Pensum Flashcards

1
Q

What are the two main machine learning techniques for data mining?

A

Supervised machine learning and unsupervised machine learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are examples of supervised machine learning?

A

Decision trees, linear classifiers, linear regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are examples of unsupervised machine learning?

A

Clustering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are characteristics with unsupervised machine learning?

A

No specific target value for unsupervised methods. System is just looking for pattern in the data but not acting like “a teacher”. Data can be grouped very nicely into a small number of categories. We just have to look for the result.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the goal with clustering?

A

Goal is to group together similar instances using some metric of similarity - so create groupings where the members of a given group are similar to each other. For example group similar customers together and design different campaigns.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are characteristics with clustering?

A

It is light classification but the groupings are not predefined. More open ended than classification and regression. Could find a way to group similar customers together. May or may not relate to the churn question.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the fundamental goal of data mining techniques?

A

Exploration to find patterns in dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is similarity matching?

A

Instances are compared based on their attributes to determine how similar they are. Amazon - find books that are similar to a book you have read. The most similar will be a book with all three attributes (if there were three in the one you already read).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

When do we use similarity matching?

A

The general idea of similarity matching placeable in many different forms of data mining including classification, regression, and clustering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is important with similarity matching?

A

Important to have information about the relevant attributes. And information about which one attributes is most important.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is regression?

A

Numerical value. Related to classification but there is a difference. Classification predicts wether there is going to happen something. Regression predicts how much.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Give an example of when we would use regression.

A

How much will a customer spend? that will be solved with regression.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What does supervised machine learning do in general?

A

A target value specified for each instance. Examining instances one by one. We can simply compute how often the system makes the right choice.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is classification?

A
Classification involves defining a small number of classes and then trying to predict for each instance, which class they belong to. In churn example classification is a natural one - one for will churn and one for will not churn
Each instance is labelled with a target value indicating what class it belongs to.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is data preparation?

A

Data preparation is about constructing a dataset from one or more data sources to be used for exploration and modeling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Why do we need data preparation?

A

It is a solid practice to start with an initial dataset to get familiar with the data, to discover first insights into the data and have a good understanding of any possible data quality issues. Data preparation is often a time consuming process and heavily prone to errors.

17
Q

What is a database?

A

Database collects, stores and manages information so users can retrieve, add, update or remove such information. It presents information in tables with rows and columns.

18
Q

What tools can you use to assess data?

A

Through accuracy, precision/ recall, testing/training, cross-validation

19
Q

What technique do you use when you want to find out how much a customer wants to use a service?

A

Regression

20
Q

What does classification do?

A

Predicts the class each individual belongs to

21
Q

What does regression do?

A

Estimates a numerical value for each individual

22
Q

What does clustering do?

A

Identifies similar individuals based on data known about them

23
Q

Can we find groups of customers who are likely to cancel the service when the contract expires?

A

This is a problem for supervised learning

24
Q

How can a CSV data file look like?

A

sunny,short,boring,no

25
Q

How do you calculate entropy? (which is a technique for information gain)

A

x

26
Q

A Linear Classifier is a Parameterized Model – the Parameters are what is learned in the training process. What are the parameters for a Linear Classifier?

A

The weights

27
Q

A good way to recognise overfitting is:

A

Compare accuracy on holdout data with accuracy on training data

28
Q

kNN is a data mining technique that can be used for?

A

Classification and regression

29
Q

What are the most widely used techniques in data mining?

A

Classification, regression and clustering.

30
Q

What does the data mining technique co-occurrences and associations/market-basket analysis do?

A

Finding items that go together. For example, by
analyzing market basket data, you might find that customers who bought a pork sandwich also bought a water. Learning these associations can be very useful.

31
Q

What is the core of data analytical thinking?

A
Data should be considered an asset
Can help to structure business
problems
Applying data science to a well-structured problem vs
exploratory data mining
32
Q

What is the aim of generalisation in data analytical thinking?

A

We want patterns that generalize to data we have not seen

33
Q

Mention four ways to extract knowledge from data

A
  1. Identifying informative attributes
  2. Fitting a numeric function model to data
  3. Controlling complexity - generalization and overfitting
  4. Calculating similarity between objects