1.1: Data Science Flashcards

1
Q

What is the value of learning about multivariate statistics and machine learning?

A

Multivariate statistics (MS) and MS methods are probably the most used in psychology and probably other sciences too. A solid understanding of the basics of classical MS methods forms the way for the understanding and utilisation of more advanced methods such as multi-level modelling and structural equation modelling which are very popular techniques.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is multivariate statistics?

A

A collection of methods that form the heart of what is now hailed as data science. It involves:
• quite advanced data analysis techniques
• models designed for multiple dependent and independent variables
• the most commonly applied statistical methods in psychology
• applied multivariate statistics is now popularised as data science

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is machine learning?

A

An approach to AI in which computers are programmed to carry out tasks (human-like abilities which until recently could only be done by humans) by showing them many examples of the task performed, for example, by a human. e.g diagnosing based on X-ray scans. To a large extent an advanced application of these methods applied to Big Data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

WHat is the relationship between multivariate statistics and machine learning?

A

ML has a lot of overlap in the questions it asks with the empirical sciences (e.g psychology) and their statistical methods. Many of the techniques in machine learning have thus been borrowed and adapted based on preceding multivariate statistical methods.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What several sub-disciplines is data science constructed of?

A

Applied multivariate statistics, computational statistics and machine learning (classic ML and also deep learning).

Also AI, visualisation and big data (volume, velocity, variety, distributed databases, cloud computing)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is meant by a model?

A

An object or mechanism that captures essential behaviour of the object under study, which helps us to rigorously reason about and predict what might happen under changing conditions.

e.g animal models (for inferring about humans in research), diagram model (e.g line graph), mathematical models

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Give an example of an animal model, a diagram model and a mathematical model in psychology

A

Rat model, spotlight model of visual attention (more precise than simply an analogy) and linear structural relations model (for latent variables, involves models showing complex relationships between variables which we use to reason about how things may join together or change in such a model) or forgetting curve

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Why use statistical models?

A

Empirical sciences are fraught with uncertainty about relations between variables, noise etc. Statistical models allow us to model this uncertainty with the help of random variables.

  • it’s an object: a mathematical expression of relations
  • it uses random variables to account for stochastic (random) behaviour
  • it helps us to reason about the relations
  • it helps us to predict what happens when things change
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

In linear regression the model is as follows:
Y = B0 + B1X + e

describe what each variable represents

A
Y: dependent variable
X: independent variable
B0: intercept
B1: regression slope
e (epsilon): attempting to model the uncertainty in the relationship between x and y
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What assumptions do we have about epsilon (the uncertainty)

A

It has a normal distribution centred around 0 and it has a given variant, or standard deviation squared. Also that the error and X are uncorrelated

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What does the regression equation as a mathematical model allow us to deduce?

A

The expected value of Y (e.g attention) based on a value of X (e.g hours slept). It also allows us to deduce the variance of Y, covariance between X and Y

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is meant by covariance

A

Similar to correlation, however covariance is unstandardised

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Are mathematical models based on causal reasoning or are they purely descriptive?

A
  • Statistical models may be based on causal reasoning: more associated with statistical inference in science
  • Statistical models may be purely descriptive—i.e., it is a summary: more associated with machine learning
  • Mathematically there is no difference
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Is the general factor model of intelligence causal or descriptive?

A

It explains why people who score highly on one test score highly on another, many people take this factor as an underlying cause of both test scores but others will say it is only a summary of the observed relationships in the data. It is therefore unclear whether this model is causal or simply descriptive. This can often be the case with psychological models. Neither statistics or math can determine whether a model is causal or not.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Multivariate data require mode sophisticated statistical models such as
• Models with multiple independent and dependent variables
• Latent variable models (Factor Analysis), PCA
• Multivariate (normal) probability distribution
• Clustering

What do all of these models mostly have in common?

A

Almost always focused on means and (co-)variances

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the difference between statistics and machine learning?

A

They use a lot of the same techniques however statistics is typically used for making inferences through causal modelling and ML is typically used for making accurate predictions through descriptive modelling.

There are also the following differences:
• Statistics involves interpretation of a few, highly interpretable model parameters while ML ignore the meaning of these parameters (there are often many many parameters, even 10^6)

  • Smaller data sets, and thus there is a need for statistical power. ML utilises large datasets and there’s a need for big data.
  • Statistics is concerned about chance model, while ML is concerned about the prediction accuracy
17
Q

What is involved in machine learning rather than hypothesis testing as in statistics?

A

Because of the focus of prediction, ML divides up data in a training and test data set instead of conducting hypothesis tests.

• Training data: used to develop a predictive model
• Test Data: used to test the prediction accuracy
• Cross-validation (in large big data datasets): training data is (again) split in subsets:
1. fit model on one subset
2. cross-validate the fitted model on the other
3. repeat 1. and 2. with different data splits