L1: Introduction to Data Analysis Flashcards

After working your way through this week, you should be able to: - Differentiate between supervised and unsupervised learning - Differentiate between inference and prediction - Understand the trade-off between prediction accuracy and interpretability - Understand the bias-variance trade-off

1
Q

What is data mining?

A

It is the process of automatically extracting information from large datasets.

Automatic is important here, as extracting manually is unfeasible.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the differences between structured and unstructured data?

A

Structured data:
Usually in a database, with numbers and measurements

Unstructured data:
Free-form text from documents

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is machine learning?

A

Machine learning refers to the branch of artificial intelligence that studies methods for automatically learning from data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is probability theory?

A

The branch of mathematics that is concerned with random phenomena

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is statistics?

A

Statistics is the science of collection, organisation and interpretation of data.

A statistic is a function of datasets, intended to summarise data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a statistical model?

A

A statistical model is a mathematical statement or function that describes the relationship between variables that have a random component.

We use them to describe our world in data science.

Many machine learning algorithms are based on statistical models and they are important in Natural Language Processing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the differences between statistics and machine learning?

A

ML involves the analysis of much larger datasets and is automatic.

Statistics focuses of hypothesis testing, whereas ML is focused on prediction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is supervised learning?

A

Supervised learning refers to the ML algorithms that learn from training data, with both dependent and independent variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is unsupervised learning?

A

Unsupervised learning involves creating ML models that can describe the data without having a clear target variable. E.g. topic modelling, clustering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Well then what is semi-supervised learning?

A

Semi-supervised is the combination of both scenarios. We have a dataset to train a model that has labelled and unlabelled data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

In the case of predicting TV sales from the advertising spent on TV, newspapers and radio. What are the input variables and output variables?

A

Input/predictor/independent variable: TV, newspaper and radio spending
Output/response/dependent variable: Sales

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the differences between prediction and inference?

A

Prediction necessitates the finding of the estimator. Y^ = f^(X). We just want accurate predictions for Y. f^(X) can be a black box, as long as it is accurate.

Inference is to understand the way that the dependent variable is affected by the independent variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Most statistical learning methods are categorized as parametric or non-parametric methods.

What are these referring to?

A

Parametric methods involve reducing the function of the model into a set of parameters. E.g. linear regression model

Non-parametric methods make so explicit assumptions about the functional form of f. E.g. random forest model, K-nearest neighbour

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the trade-off between accuracy and model interpretability?

A

Generally, these two ways of judging a model have an inverse relationship.

Models that have high interpretability often constrict the flexibility and so have less complex functions for representing the data.

Whereas low interpretability models such as RF will often act as a “black box” but produce quite accurate results.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are regression vs classification tasks?

A

Regression involves a quantitative response, e.g. a continuous sales variable. Can be done with linear regression

Classification involves a qualitative response, e.g. the type of sale involved with a purchase. Can be done with Logistic Regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is a good way of measuring the fit of a model?

A

We want to measure how well the predictions match the true labels in a training dataset.

In regression, we commonly use the Mean Squared Error (MSE).

This is: 1/n (SUMn[(ytrue - ypred)^2])

Which is the average of the difference between the predicted value and the true value squared.

17
Q

What is cross-validation?

A

CV is the method of checking the training and testing error when training a new model.

It is a process of taking samples of the dataset to split into training/test datasets so that a model can be trained and its generalisation against the testing data can be calculated.

18
Q

What is the bias-variance trade-off?

A

The U-shape of the test MSE is due to two competing properties of learning methods, the bias and the variance.

Bias is the difference between the average prediction of our model and the correct value.

Variance is the amount by which the predictions would change if we estimated it using a different dataset (and same model)

19
Q

What is preferred, low training or low testing MSE?

A

We want the sweet spot! Where the test MSE is the lowest

20
Q

What is the K-nearest neighbours model?

A

This model is a method in estimating the conditional distribution of Y, given X. Then classifying the observation to the class with the highest estimated probability.

  1. Identify the K points in the training data that are closest to the test observation, x0
  2. Estimate the conditional probability for class j as the fraction of points in N0 whose response values equal j
  3. Use Bayes’ rule to classify the test observation to the class with highest probability