Pensum Flashcards

Question 1

Q

What are the two main machine learning techniques for data mining?

Answer

A

Supervised machine learning and unsupervised machine learning

Question 2

Q

What are examples of supervised machine learning?

Answer

A

Decision trees, linear classifiers, linear regression

Question 3

Q

What are examples of unsupervised machine learning?

Answer

A

Clustering

Question 4

Q

What are characteristics with unsupervised machine learning?

Answer

A

No specific target value for unsupervised methods. System is just looking for pattern in the data but not acting like “a teacher”. Data can be grouped very nicely into a small number of categories. We just have to look for the result.

Question 5

Q

What are the goal with clustering?

Answer

A

Goal is to group together similar instances using some metric of similarity - so create groupings where the members of a given group are similar to each other. For example group similar customers together and design different campaigns.

Question 6

Q

What are characteristics with clustering?

Answer

A

It is light classification but the groupings are not predefined. More open ended than classification and regression. Could find a way to group similar customers together. May or may not relate to the churn question.

Question 7

Q

What are the fundamental goal of data mining techniques?

Answer

A

Exploration to find patterns in dataset.

Question 8

Q

What is similarity matching?

Answer

A

Instances are compared based on their attributes to determine how similar they are. Amazon - find books that are similar to a book you have read. The most similar will be a book with all three attributes (if there were three in the one you already read).

Question 9

Q

When do we use similarity matching?

Answer

A

The general idea of similarity matching placeable in many different forms of data mining including classification, regression, and clustering

Question 10

Q

What is important with similarity matching?

Answer

A

Important to have information about the relevant attributes. And information about which one attributes is most important.

Question 11

Q

What is regression?

Answer

A

Numerical value. Related to classification but there is a difference. Classification predicts wether there is going to happen something. Regression predicts how much.

Question 12

Q

Give an example of when we would use regression.

Answer

A

How much will a customer spend? that will be solved with regression.

Question 13

Q

What does supervised machine learning do in general?

Answer

A

A target value specified for each instance. Examining instances one by one. We can simply compute how often the system makes the right choice.

Question 14

Q

What is classification?

Answer

A

Classification involves defining a small number of classes and then trying to predict for each instance, which class they belong to. In churn example classification is a natural one - one for will churn and one for will not churn
Each instance is labelled with a target value indicating what class it belongs to.

Question 15

Q

What is data preparation?

Answer

A

Data preparation is about constructing a dataset from one or more data sources to be used for exploration and modeling

Question 16

Q

Why do we need data preparation?

Answer

A

It is a solid practice to start with an initial dataset to get familiar with the data, to discover first insights into the data and have a good understanding of any possible data quality issues. Data preparation is often a time consuming process and heavily prone to errors.

Question 17

Q

What is a database?

Answer

A

Database collects, stores and manages information so users can retrieve, add, update or remove such information. It presents information in tables with rows and columns.

Question 18

Q

What tools can you use to assess data?

Answer

A

Through accuracy, precision/ recall, testing/training, cross-validation

Question 19

Q

What technique do you use when you want to find out how much a customer wants to use a service?

Answer

A

Regression

Question 20

Q

What does classification do?

Answer

A

Predicts the class each individual belongs to

Question 21

Q

What does regression do?

Answer

A

Estimates a numerical value for each individual

Question 22

Q

What does clustering do?

Answer

A

Identifies similar individuals based on data known about them

Question 23

Q

Can we find groups of customers who are likely to cancel the service when the contract expires?

Answer

A

This is a problem for supervised learning

Question 24

Q

How can a CSV data file look like?

Answer

A

sunny,short,boring,no

Question 25

Q

How do you calculate entropy? (which is a technique for information gain)

Question 26

Q

A Linear Classifier is a Parameterized Model – the Parameters are what is learned in the training process. What are the parameters for a Linear Classifier?

Answer

A

The weights

Question 27

Q

A good way to recognise overfitting is:

Answer

A

Compare accuracy on holdout data with accuracy on training data

Question 28

Q

kNN is a data mining technique that can be used for?

Answer

A

Classification and regression

Question 29

Q

What are the most widely used techniques in data mining?

Answer

A

Classification, regression and clustering.

Question 30

Q

What does the data mining technique co-occurrences and associations/market-basket analysis do?

Answer

A

Finding items that go together. For example, by
analyzing market basket data, you might find that customers who bought a pork sandwich also bought a water. Learning these associations can be very useful.

Question 31

Q

What is the core of data analytical thinking?

Answer

A

Data should be considered an asset
Can help to structure business
problems
Applying data science to a well-structured problem vs
exploratory data mining

Question 32

Q

What is the aim of generalisation in data analytical thinking?

Answer

A

We want patterns that generalize to data we have not seen

Question 33

Q

Mention four ways to extract knowledge from data

Answer

A

Identifying informative attributes
Fitting a numeric function model to data
Controlling complexity - generalization and overfitting
Calculating similarity between objects

Brainscape's Knowledge GenomeTM

Pensum Flashcards

Brainscape's Knowledge Genome^TM