1.1: Data Science Flashcards
What is the value of learning about multivariate statistics and machine learning?
Multivariate statistics (MS) and MS methods are probably the most used in psychology and probably other sciences too. A solid understanding of the basics of classical MS methods forms the way for the understanding and utilisation of more advanced methods such as multi-level modelling and structural equation modelling which are very popular techniques.
What is multivariate statistics?
A collection of methods that form the heart of what is now hailed as data science. It involves:
• quite advanced data analysis techniques
• models designed for multiple dependent and independent variables
• the most commonly applied statistical methods in psychology
• applied multivariate statistics is now popularised as data science
What is machine learning?
An approach to AI in which computers are programmed to carry out tasks (human-like abilities which until recently could only be done by humans) by showing them many examples of the task performed, for example, by a human. e.g diagnosing based on X-ray scans. To a large extent an advanced application of these methods applied to Big Data
WHat is the relationship between multivariate statistics and machine learning?
ML has a lot of overlap in the questions it asks with the empirical sciences (e.g psychology) and their statistical methods. Many of the techniques in machine learning have thus been borrowed and adapted based on preceding multivariate statistical methods.
What several sub-disciplines is data science constructed of?
Applied multivariate statistics, computational statistics and machine learning (classic ML and also deep learning).
Also AI, visualisation and big data (volume, velocity, variety, distributed databases, cloud computing)
What is meant by a model?
An object or mechanism that captures essential behaviour of the object under study, which helps us to rigorously reason about and predict what might happen under changing conditions.
e.g animal models (for inferring about humans in research), diagram model (e.g line graph), mathematical models
Give an example of an animal model, a diagram model and a mathematical model in psychology
Rat model, spotlight model of visual attention (more precise than simply an analogy) and linear structural relations model (for latent variables, involves models showing complex relationships between variables which we use to reason about how things may join together or change in such a model) or forgetting curve
Why use statistical models?
Empirical sciences are fraught with uncertainty about relations between variables, noise etc. Statistical models allow us to model this uncertainty with the help of random variables.
- it’s an object: a mathematical expression of relations
- it uses random variables to account for stochastic (random) behaviour
- it helps us to reason about the relations
- it helps us to predict what happens when things change
In linear regression the model is as follows:
Y = B0 + B1X + e
describe what each variable represents
Y: dependent variable X: independent variable B0: intercept B1: regression slope e (epsilon): attempting to model the uncertainty in the relationship between x and y
What assumptions do we have about epsilon (the uncertainty)
It has a normal distribution centred around 0 and it has a given variant, or standard deviation squared. Also that the error and X are uncorrelated
What does the regression equation as a mathematical model allow us to deduce?
The expected value of Y (e.g attention) based on a value of X (e.g hours slept). It also allows us to deduce the variance of Y, covariance between X and Y
What is meant by covariance
Similar to correlation, however covariance is unstandardised
Are mathematical models based on causal reasoning or are they purely descriptive?
- Statistical models may be based on causal reasoning: more associated with statistical inference in science
- Statistical models may be purely descriptive—i.e., it is a summary: more associated with machine learning
- Mathematically there is no difference
Is the general factor model of intelligence causal or descriptive?
It explains why people who score highly on one test score highly on another, many people take this factor as an underlying cause of both test scores but others will say it is only a summary of the observed relationships in the data. It is therefore unclear whether this model is causal or simply descriptive. This can often be the case with psychological models. Neither statistics or math can determine whether a model is causal or not.
Multivariate data require mode sophisticated statistical models such as
• Models with multiple independent and dependent variables
• Latent variable models (Factor Analysis), PCA
• Multivariate (normal) probability distribution
• Clustering
What do all of these models mostly have in common?
Almost always focused on means and (co-)variances