Statistics Flashcards
What is the Central Limit Theorum?
The Central Limit Theorem states that the sampling distribution of the sample means approaches a normal distribution as the sample size gets larger — no matter what the shape of the population distribution. This fact holds especially true for sample sizes over 30.
I.e. As you take more samples, especially large ones, your graph of the sample means will look more like a normal distribution.
In statistics, What is Differentiable?
Differentiable means that a function has a derivative. In simple terms, it means there is a slope (one that you can calculate). This slope will tell you something about the rate of change: how fast or slow an event (like acceleration) is happening.
The derivative must exist for all points in the domain, otherwise the function is not differentiable. This might happen when you have a hole in the graph: if there’s a hole, there’s no slope (there’s a drop off!).
In statistics, What is Non-Differentiable?
In general, a function is not differentiable (you cannot calculate the slope) for four reasons:
- Corners,
- Cusps,
- Vertical tangents,
- Jump discontinuities.
You’ll be able to see these different types of scenarios by graphing the function on a graphing calculator;
In statistics, what does ‘differentiate’ mean?
When you differentiate (or “take the derivative”), you’re finding the slope of a function at a particular point. It tells you the rate of change (i.e. how fast or how slow something is changing).
For example, if you know the position of a car, you can use differentiation to tell you how fast the car is going at that point.
The major difference between differentiation and using the slope formula is that differentiation can give you all of the slopes at all of the points, while the slope formula can only give you one at a time
What is a Deterministic model?
Deterministic models are based on precise relationships between the input variables and the model’s outputs.
The inputs to a deterministic model are fixed and known with certainty.
Given the same set of inputs, a deterministic model will always produce the same output.
Deterministic models do not incorporate randomness or uncertainty in their calculations.
These models are suitable when the system being modeled is well-defined and the inputs are known precisely.
Examples of deterministic models include mathematical equations, linear regression models, and optimization models.
What is a Stochastic model?
Stochastic models take into account randomness and uncertainty in the input variables or parameters, and produce probabilistic outcomes.
The inputs to a stochastic model are not fixed but rather described by probability distributions.
Given the same set of inputs, a stochastic model may produce different outputs due to the inherent randomness.
Stochastic models are appropriate when the system being modeled involves uncertainty or when the inputs are subject to variation.
Examples of stochastic models include Monte Carlo simulations, Markov chains, and certain types of machine learning models like Gaussian processes.
What is standard deviation?
The standard deviation measures the amount of variation of a set of values.
It’s calculated by taking the square root of the sum of squared differences from the mean divided by the size of the data (square root of the variance)
In a normal distribution, about 95% of all values fall within two standard deviations of the mean
What are univariate analysis?
Univariate analysis explores each variable in a data set, separately.
It looks at the range of values, as well as the central tendency of the values. It describes the pattern of response to the variable. It describes each variable on its own.
Descriptive statistics describe and summarize data.
What is bivariate analysis?
Bivariate analysis refers to analyzing two variables to determine relationships between them.
If there’s a relationship between two variables then there will be a correlation.
What is a population in statistics?
In statistics, the population comprises all observations (data points) about the subject under study.
An example of a population is studying the voters in an election. In the 2019 Lok Sabha elections, nearly 900 million voters were eligible to vote in 543 constituencies.
What are measures of central tendency?
Measures of central tendency are the measures that are used to describe the distribution of data using a single value.
Mean, Median and Mode are the three measures of central tendency.
In statistics, what is variance?
Variance is used to measure the variability in the data from the mean.
In statistics, what is skewness?
Skewness measures the shape of the distribution.
A distribution is symmetrical when the proportion of data at an equal distance from the mean (or median) is equal. If the values extend to the right, it is right-skewed, and if the values extend left, it is left-skewed.
In statistics, what is kurtosis?
Kurtosis in statistics is used to check whether the tails of a given distribution have extreme values. It also represents the shape of a probability distribution.
What is Gaussian distribution?
In statistics and probability, Gaussian (normal) distribution is a popular continuous probability distribution for any random variable.
It is characterized by 2 parameters (mean μ and standard deviation σ)
Properties of Gaussian Distribution:
- Mean, median, and mode are the same
- Symmetrical bell shape
- 68% data lies within 1 st dev of the mean
- 95% data lie within 2 st devs of the mean
- 99.7% of the data lie within 3 st devs of the mean
What is the p value?
The P stands for probability and measures how likely it is that any observed difference between groups is due to chance.
If the p-value > 0.05 - Accept the null hypothesis.
If the p-value < 0.05 - Reject the null hypothesis.
Some popular hypothesis tests are:
Chi-square test
T-test
Z-test
Analysis of Variance (ANOVA)
When is a t-test used and what are its assumptions?
A t-test is a type of statistical analysis used to compare the averages of two groups and determine if the differences between them more are likely to arise from chance
Assumptions: (CRINE)
- Continuous data
- Random sample
- Independent observations
- Normal distribution of data in each group
- Equal variance (i.e., the variability of the data in each group is similar - is this is not he case use Welch’s t-test).
Can be one-sample, 2-sample or paired
What is correlation?
Correlation is a statistical measure that expresses the extent to which two variables are linearly related (meaning they change together at a constant rate).
It’s a common tool for describing simple relationships without making a statement about cause and effect.
We describe correlations with a unit-free measure called the correlation coefficient which ranges from -1 to +1 and is denoted by r (closer to zero = less correlation, minus = negative correlation).
Statistical significance is indicated with a p-value. Therefore, correlations are typically written with two key numbers: r = and p = .
In statistics, what is simple linear regression?
Simple linear regression is used to model the relationship between two continuous variables.
Often, the objective is to predict the value of an output variable (or response) based on the value of an input (or predictor) variable.
What is multiple linear regression?
Multiple linear regression is used to model the relationship between a continuous response variable and continuous or categorical explanatory variables.
Why should you do residual analysis?
To verify that assumptions drawn from regression modelling are valid.
Check that residuals:
1. Have constant variance (no non-random pattern)
2. Are approximately normally distributed
3. Are independent from one another
What are externally studentized residuals?
A studentized residual is calculated by dividing the residual by an estimate of its standard deviation.
The standard deviation for each residual is computed with the observation excluded.
For this reason, studentized residuals are sometimes referred to as externally studentized residuals.
What is Multicollinearity?
When two or more predictors are highly correlated with one another.
In regression, multicollinearity can make it difficult to determine the effect of each predictor on the response, and can make it challenging to determine which variables to include in the model.
Multicollinearity can also cause other problems:
- Coefficients might be poorly estimated, or inflated.
- Coefficients might have signs that don’t make sense.
- Standard errors for these coefficients might be inflated.
To resolve, try:
- removing a redundant term from the model
- principal component analysis (PCA)
- partial least squares (PLS)
- tree-based methods
- penalized regression
What is One Way ANOVA?
One-way analysis of variance (ANOVA) is a statistical method for testing for differences in the means of three or more groups.
One-way ANOVA can only be used when investigating a single factor and a single dependent variable.
When comparing the means of three or more groups, it can tell us if at least one pair of means is significantly different, but it can’t tell us which pair.
Also, it requires that the dependent variable be normally distributed in each of the groups and that the variability within groups is similar across groups.
What is the Sum of Squares in statistics?
The sum of squares gives us a way to quantify variability in a data set by focusing on the difference between each data point and the mean of all data points in that data set.
A higher sum of squares indicates higher variability while a lower result indicates low variability from the mean.
To calculate the sum of squares, subtract the mean from the data points, square the differences, and add them together.
What is Degrees of Freedom in statistics?
- Number of pieces of information that can freely vary
- Without violating restrictions
- Independent pieces of information available to estimate other pieces of information (variable features available)
What is the Chi-squared test?
A hypothesis testing method involving checking if observed frequencies in one or more categories match expected frequencies.
If you have a single measurement variable, you use a Chi-square goodness of fit test.
If you have two measurement variables, you use a Chi-square test of independence. There are other Chi-square tests, but these two are the most common.
What are Hidden Markov Models?
Statistical models used to describe and analyze sequential data, particularly data that exhibits a sequential or temporal dependence.
Particularly useful for modeling sequential data where the underlying states are not directly observable but have an impact on the observed data.
HMM assumes that the system transitions from one state to another according to a probabilistic process. State transitions are governed by transition probabilities, which determine the likelihood of moving from one state to another.
The underlying assumption of an HMM is the Markov property, which states that the probability of being in a particular state depends only on the immediately preceding state. In other words, the current state is assumed to be conditionally independent of all previous states given the most recent state.
What is a Monte Carlo Simulation?
A computational technique used to model and analyze systems or processes that involve uncertainty.
Relies on generating random samples or scenarios to estimate the behavior or outcomes of complex systems.
- Assign random values, or values based on probability distributions to uncertain parts of the model
- Run model on all combinations of values to show the range of potential outcomes (these are plotted on a probability distribution curve)
- Frequencies of the different outcomes should form a normal distribution:
- the mean is the most likely outcome, with equal chance of it going either side
- 68% chance true outcome will fall within 1 st dev of the mean
- 95% chance true outcome will fall within 2 st dev of the mean
Limitations:
- Solely based on input data (random values, features etc)
- Computationally expensive
What is Homoscedasticity
Homoscedasticity describes a situation in which the error term (that is, the “noise” or random disturbance in the relationship between the independent variables and the dependent variable) is the same across all values of the independent variables.
What are Eigenvectors and Eigenvalues?
In linear algebra, an eigenvector of a linear transformation is a nonzero vector that changes at most by a scalar factor when that linear transformation is applied to it.
The corresponding eigenvalue, often denoted by lambda, is the factor by which the eigenvector is scaled.
e.g. when you stretch or shear an image, the eigenvectors are the ‘axis’ along which all other points in the image will slide across, and not change direction/length themselves