Statistics Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

Random variable

A

Measurable phenomena that can take more than one possible value.
Discrete variable - finite number of possible values
Continuous variable - infinite number of possible values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Probability distribution

A

The probability distribution depicts the relative probability that a random variable will have a particular value (in the case of discrete variables) or a value within a certain interval (continuous variables) on a particular measurement occasion.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Expected value

A

Each random variable has an expected value or expectation value, the value it is most likely to take on.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Central tendency and dispersion

A

Random variables are characterized by the shape of their distribution. Important aspects are the central tendency and dispersion which are described by the parameters:
central tendancy; mean, median, mode
dispersion; variance, standard deviation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

mean, median, mode

A

Mean: The average of a set of numbers, calculated by adding all values and dividing by the total count.

Median: The middle value in a sorted list of numbers, or the average of the two middle values if there’s an even count.

Mode: The most frequently occurring value in a set of numbers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Skewness

A

A distribution that is not symmetrical is said to be skewed. Generally this means that the mode is different from the mean.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Kurtosis

A

Kurtosis describes how peaked or flat the distribution is about the mode.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Unimodal / bimodel

A

A distribution may have more than one mode, e.g. a bimodal distribution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Population

A

The population of a random variable = all possible unique observations of the variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Sample

A

set of 1 or more observations from the population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Estimator

A

standard deviations are estimators of dispersion in the population distribution (example). The sample mean is an estimator of the expected value of the population (example)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

mean square, variance, standard deviation

A

Mean square - an estimator of the population variance, necessary to compute the variance and the standard deviation

Variance - variance (2s2) measures how spread out the numbers in a data set are. It’s calculated by taking the average of the squared differences from the mean. A higher variance indicates that the data points are more spread out from the mean, and a lower variance indicates they are closer to the mean.

Standard deviation - Average difference between the observations and the mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Notations

A

Ẋ= sample mean
s^2= sample varience
s= sample standard deviation
mu (weird u) = expected value/ population mean
sigma= population standard deviation
sigma^2 population varience

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Normal distribution

A

Bell shaped curves, and alway unimodal, symmetrical and:
-ca 70% of probability lies within one standard deviation above or below mu
-ca 95% of probability lies within two standard deviations above or below mu
-almost all probability lies within three standard deviations above or belove mu

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Standard normal distubution

A

A normally distributed variable can be standardized by subtracting the mean, then dividing by the standard deviation:
z= x-mu/ sigma
This is what the table with z values is for.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Standard error

A

The standard error is a measure of how much the sample mean of a data set is expected to vary from the true population mean. It’s calculated as the standard deviation of the data set divided by the square root of the sample size. A smaller standard error indicates a more accurate estimate of the population mean.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Confidence interval

A

A confidence interval is a range of values, derived from the sample data, that is likely to contain the value of an unknown population parameter. It gives an estimated range of values which is likely to include the parameter, based on the data in the sample and the chosen confidence level (like 95%). The wider the interval, the more uncertain the estimate.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Infer

A

In statistics, we are interested in using samples to infer information(parameters like the mean) about the population. That is why a sample is drawn. For example, we would like to know how close the mean of our sample is to the true population mean. To what extent is it representative? If We can quantify how confident we are of this, then we can say something about the population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Inference

A

To derive a conclusion from facts, premises, or theory –here, the theory is based on our knowledge of what the sampling distribution of the means looks like. We use the standard error, together with our knowledge that the mean of the population should equal the mean of the sampling distribution of the population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

t distribution

A

The t-distribution is a type of probability distribution that is symmetric and bell-shaped, like the normal distribution, but with heavier tails. It’s used in statistics, especially in situations where the sample size is small and the population standard deviation is unknown. As the sample size increases, the t-distribution approaches the normal distribution. It’s commonly used in hypothesis testing and constructing confidence intervals.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

statistical test

A

A statistical test is a method used in statistics to make decisions or inferences about a population based on sample data. It evaluates a hypothesis, such as comparing means or proportions, by determining the likelihood that the observed data occurred by chance. Common examples include t-tests, chi-square tests, and ANOVA. The outcome of a statistical test is usually a p-value, which helps determine whether the results are statistically significant.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Inference

A

Inference in statistics is the process of drawing conclusions about a population’s characteristics based on a sample of data from that population. It involves using probability theory to estimate population parameters, test hypotheses, and make predictions. There are two main types of statistical inference:
- Estimation, where you estimate population parameters (like mean or proportion) using sample data.
- Hypothesis testing, where you test assumptions about a population based on sample data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

null hypothesis H0

A

The null hypothesis (H0) in statistics is a default assumption that there is no effect or no difference. It’s tested against to see if it can be rejected, suggesting a significant effect or difference exists.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Simultaneous sample

A

A simultaneous sample means gathering data from a group of people all at the same time, giving a snapshot of their collective traits or behaviors in that moment.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Alternative hypothesis HA

A

The alternative hypothesis (HA or H1) in statistics is a statement that suggests a new effect, difference, or relationship exists in the data, contrary to the null hypothesis (H0). It’s what you aim to support through your data analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

significance level α

A

The significance level (α) in statistics is a threshold used to determine the statistical significance of a result. It’s the probability of rejecting the null hypothesis when it is actually true, often set at 0.05 (or 5%). A result is considered statistically significant if the p-value is less than α.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Simultaneous sample

A

A simultaneous sample means gathering data from a group of people all at the same time, giving a snapshot of their collective traits or behaviors in that moment.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Causal relationship

A

A causal relationship refers to a cause-and-effect connection between two variables, where one variable (the cause) brings about changes in another variable (the effect).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Correlation

A

Correlation refers to a statistical measure that shows the relationship or association between two variables. It indicates how changes in one variable relate to changes in another variable, without implying causation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Covariance

A

Covariance measures how much two random variables vary together. It indicates the degree to which changes in one variable are associated with changes in another variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Pearson correlation coefficient

A

The Pearson correlation coefficient measures how strongly and in what direction two variables are related on a scale from -1 to 1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

one-sided, two-sided comparison

A

In t-tests, comparisons can be either one-sided or two-sided:
One-Sided Test: This test looks for a difference in a specific direction. For example, you might test if one mean is greater than another, not just different. It’s used when the research hypothesis predicts a specific direction of effect.
Two-Sided Test: This test checks for any difference, regardless of direction. It’s used when you want to determine if two means are different, but you don’t have a specific direction in mind (either greater or lesser).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

test statistic

A

A test statistic is a calculated value used in statistical hypothesis testing to determine whether to reject the null hypothesis. It’s derived from sample data and is used to measure the degree of agreement between the sample data and the null hypothesis. The type of test statistic depends on the test being performed (like a t-statistic for t-tests or a z-statistic for z-tests) and is compared against a critical value from a statistical distribution to determine significance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Simultaneous sample

A

A simultaneous sample means gathering data from a group of people all at the same time, giving a snapshot of their collective traits or behaviors in that moment.

35
Q

Causal relationship

A

A causal relationship refers to a cause-and-effect connection between two variables, where one variable (the cause) brings about changes in another variable (the effect).

36
Q

degrees of freedom

A

Degrees of freedom in statistics are the number of independent values in a calculation. They influence the shape of various statistical distributions and are important for the accuracy of tests like t-tests. It’s often calculated as the sample size minus the number of parameters estimated.

37
Q

critical value

A

A critical value in statistics is a point on the test distribution that is compared to the test statistic to decide whether to reject the null hypothesis. It’s determined by the chosen significance level (like 0.05) and the test’s distribution (like t-distribution for t-tests). If the test statistic is more extreme than the critical value, the null hypothesis is rejected. You find this in the tables.

38
Q

Type I Error

A

Occurs when the null hypothesis is true, but is incorrectly rejected. It’s also known as a “false positive.” The probability of making a Type I error is equal to the significance level (α), often set at 0.05.

39
Q

Type II Error

A

Occurs when the null hypothesis is false, but is incorrectly accepted as true. It’s also known as a “false negative.” The probability of making a Type II error is denoted by β, and 1−β is the power of the test.

40
Q

Confidence interval of the mean

A

A confidence interval of the mean is a range of values, calculated from the sample data, that is likely to contain the population mean. It’s based on the sample mean, the standard error of the mean, and the desired confidence level (like 95%). The wider the interval, the more confident you can be that it contains the true mean

41
Q

Paired observations

A

Paired observations are sets of two related measurements, often from the same subjects at different times or under different conditions. They’re used in statistics to compare changes or effects more accurately within paired groups.

42
Q

paired samples t test

A

A paired samples t-test is a statistical method used to compare the means of two related groups. It’s typically used when the same subjects are measured under two different conditions (like before and after a treatment), or when pairs of subjects are matched in terms of key variables. The test checks whether the average difference between pairs is significantly different from zero.

43
Q

independent samples

A

Independent samples in statistics refer to two or more groups of data where the subjects in each group are not related or paired with each other. Each subject in one group is independent of the subjects in the other group(s). This concept is crucial in choosing the right statistical test, like an independent sample t-test, which is used to compare the means of two independent groups.

44
Q

two-sample t test

A

The two-sample t-test, also known as the independent samples t-test, is a statistical method used to determine if there is a significant difference between the means of two independent groups. This test is appropriate when the data are normally distributed, the variances of the two groups are equal, and the samples are collected independently from each other. It’s commonly used in experiments where different subjects are assigned to different treatments or conditions.

45
Q

F distribution

A

The F distribution is a probability distribution that arises frequently in statistics, particularly in the context of variance analysis and hypothesis testing. It’s used primarily in ANOVA (Analysis of Variance) and in tests that compare variances. The F distribution is asymmetric and skewed to the right, with its shape depending on two sets of degrees of freedom: one for the numerator and one for the denominator. These degrees of freedom correspond to the variances being compared. The F-test, based on this distribution, is used to test if two population variances are equal.

46
Q

Between samples variance

A

The “between samples” variance, often encountered in the context of ANOVA (Analysis of Variance), refers to the variability among the means of different groups or samples. It measures how much the group means differ from the overall mean. This variance is crucial for determining whether the differences between group means are significant, suggesting that the groups are not all the same. In ANOVA, this is contrasted with “within samples” variance, which measures the variability within each group.

47
Q

’within’ samples variance

A

“Within samples” variance measures the variability of data points within each group or sample, showing how spread out the data are around their own group mean. It’s used in statistics to evaluate the consistency within each group.

48
Q

F-test or ANOVA

A

The F-test in ANOVA (Analysis of Variance) is a statistical method used to determine if there are significant differences between the means of three or more groups. It compares the variance between the groups (between-group variability) to the variance within the groups (within-group variability). If the between-group variability is significantly larger than the within-group variability, the F-test suggests that at least one group mean is different from the others. ANOVA is commonly used in experiments with multiple groups or treatments to test for overall differences.

49
Q

Simultaneous sample

A

A simultaneous sample means gathering data from a group of people all at the same time, giving a snapshot of their collective traits or behaviors in that moment.

50
Q

Causal relationship

A

A causal relationship refers to a cause-and-effect connection between two variables, where one variable (the cause) brings about changes in another variable (the effect).

51
Q

Correlation

A

Correlation refers to a statistical measure that shows the relationship or association between two variables. It indicates how changes in one variable relate to changes in another variable, without implying causation.

52
Q

Covariance

A

Covariance measures how much two random variables vary together. It indicates the degree to which changes in one variable are associated with changes in another variable.

53
Q

Pearson correlation coefficient

A

The Pearson correlation coefficient measures how strongly and in what direction two variables are related on a scale from -1 to 1.

54
Q

Sampling artifact

A

A sampling artifact refers to an error or bias that occurs in research due to the way a sample is selected or collected, leading to inaccurate or misleading conclusions about the larger population.

55
Q

t test for the significance of a correlation

A

A t-test for the significance of a correlation is used to determine if the observed correlation coefficient between two variables is significantly different from zero in a sample, suggesting a meaningful relationship in the population.

56
Q

Regression analysis

A

Regression analysis is a statistical tool that examines and models the relationship between variables, helping predict one variable based on others.

57
Q

Dependent variable

A

The dependent variable is the outcome or response in a study that is being measured or predicted based on changes in other variables.

58
Q

Independent variable

A

The independent variable is the factor or input in a study that is manipulated or changed by the researcher to observe its effect on the dependent variable.

59
Q

Linear model

A

A linear model is a statistical representation that assumes a linear relationship between the independent variable(s) and the dependent variable, often depicted as a straight line on a graph.

60
Q

Intercept and slope parameter

A
  • The intercept is the point where the regression line crosses the y-axis.
  • The slope represents the rate of change of the dependent variable concerning changes in the independent variable.
61
Q

Least squares method

A

The least squares method finds the line that minimizes the overall distance between observed data points and the line itself.

62
Q

Residuals

A

Residuals are the differences between observed data points and the predicted values from a regression model. They represent the unexplained variability or errors in the model’s predictions.

63
Q

Coefficient of determination r^2

A

The coefficient of determination, often denoted as
R^2, is a statistical measure that represents the proportion of variation in the dependent variable explained by the independent variables in a regression model.

64
Q

Explained variation

A

Explained variation refers to the portion of variability in a dependent variable that is accounted for or explained by the independent variables in a statistical model, typically represented by the coefficient of determination R^2

65
Q

Correlation versus regression

A

Correlation measures the relationship between variables, while regression models this relationship and predicts one variable based on another.

66
Q

Non linear relationship

A

A non-linear relationship refers to a connection between variables where the change in one variable doesn’t correspond uniformly with the change in the other variable; it doesn’t follow a straight line on a graph.

67
Q

Transformation

A

Transformation in statistics refers to altering data using mathematical operations (like logarithms, square roots, etc.) to make it more suitable for analysis or to meet the assumptions of statistical tests.

68
Q

Polynomial regression

A

Polynomial regression fits a curved line to data instead of a straight line, allowing for more complex relationships between variables.

69
Q

Multiple regression

A

Multiple regression is a statistical technique that examines the relationship between a dependent variable and two or more independent variables, enabling the analysis of how multiple factors simultaneously influence the outcome variable.

70
Q

Autocorrelation (temporal and spatial)

A

Autocorrelation can violate the independence assumption in statistical models, as it implies that current values are influenced by previous ones in time series (temporal autocorrelation) or that values in one location depend on values in nearby locations in spatial data (spatial autocorrelation). This can lead to inaccurate results in analyses that assume each data point is independent.

71
Q

normality test

A

A normality test checks if a dataset follows a normal distribution. Common tests include the Shapiro-Wilk and Kolmogorov-Smirnov tests. It’s important for methods that assume data are normally distributed.

72
Q

normal probability plot

A

A normal probability plot is a graphical tool used to assess if a dataset follows a normal distribution. It plots the quantiles of the data against the quantiles of a normal distribution. If the data are normally distributed, the points will roughly form a straight line. Deviations from the line indicate departures from normality.

73
Q

lognormal distribution

A

A lognormal distribution is where a variable’s logarithm is normally distributed, leading to a right-skewed distribution. It’s often used for positively skewed data like income or certain biological measures.

74
Q

data transformation

A

Changing or altering raw data to better suit analysis or modeling needs. Log transformation involves taking the logarithm of data points, often used to handle exponential growth or stabilize variance, while square transformation entails squaring each data point, emphasizing differences or capturing quadratic relationships in the data.

75
Q

Multiple comparisons

A

Risk of false positives when conducting many tests. Use corrections like Bonferroni, Tukey’s HSD, FDR, Sidak, or Dunn’s Test to control errors.

76
Q

Bonferroni correction

A

Adjusts significance level for multiple comparisons to control Type I errors.

77
Q

assumptions in regression

A
  1. Linearity: The relationship between the independent variables and the dependent variable is assumed to be linear.
  2. Independence: The residuals (the differences between observed and predicted values) should be independent of each other.
  3. Homoscedasticity: The variance of the residuals should be constant across all levels of the independent variables, meaning that the spread of the residuals should be roughly consistent.
  4. Normality: The residuals should be approximately normally distributed. This assumption is often relaxed for larger sample sizes due to the Central Limit Theorem.
78
Q

Outliers

A

Outliers are data points that significantly deviate from the overall pattern or distribution of the data.

79
Q

Homoscedasticity

A

Homoscedasticity, often called homogeneity of variance, is an assumption in regression analysis where the variance of the residuals (the differences between observed and predicted values) is constant across all levels of the independent variables. In simpler terms, it means that the spread of the residuals should be roughly consistent throughout the data.

80
Q

Non parametric test

A

Non-parametric tests analyze data without assuming a specific distribution, unlike parametric tests that require specific assumptions about the data’s distribution.

81
Q

Ranked data

A

Ranked data refers to a data arrangement where values are replaced by their rank order from lowest to highest, allowing comparison and analysis without relying on specific numerical values.

82
Q

Mann-Whitney-U

A

The Mann-Whitney U test is a non-parametric test to compare two independent groups and determine if there’s a significant difference between them, suitable for non-normally distributed or ordinal data.

83
Q

Kruskal Wallis test

A

Kruskal-Wallis test checks if three or more groups have different medians without relying on specific data distribution assumptions.