Statistics Flashcards
Random variable
Measurable phenomena that can take more than one possible value.
Discrete variable - finite number of possible values
Continuous variable - infinite number of possible values
Probability distribution
The probability distribution depicts the relative probability that a random variable will have a particular value (in the case of discrete variables) or a value within a certain interval (continuous variables) on a particular measurement occasion.
Expected value
Each random variable has an expected value or expectation value, the value it is most likely to take on.
Central tendency and dispersion
Random variables are characterized by the shape of their distribution. Important aspects are the central tendency and dispersion which are described by the parameters:
central tendancy; mean, median, mode
dispersion; variance, standard deviation
mean, median, mode
Mean: The average of a set of numbers, calculated by adding all values and dividing by the total count.
Median: The middle value in a sorted list of numbers, or the average of the two middle values if there’s an even count.
Mode: The most frequently occurring value in a set of numbers.
Skewness
A distribution that is not symmetrical is said to be skewed. Generally this means that the mode is different from the mean.
Kurtosis
Kurtosis describes how peaked or flat the distribution is about the mode.
Unimodal / bimodel
A distribution may have more than one mode, e.g. a bimodal distribution.
Population
The population of a random variable = all possible unique observations of the variable.
Sample
set of 1 or more observations from the population
Estimator
standard deviations are estimators of dispersion in the population distribution (example). The sample mean is an estimator of the expected value of the population (example)
mean square, variance, standard deviation
Mean square - an estimator of the population variance, necessary to compute the variance and the standard deviation
Variance - variance (2s2) measures how spread out the numbers in a data set are. It’s calculated by taking the average of the squared differences from the mean. A higher variance indicates that the data points are more spread out from the mean, and a lower variance indicates they are closer to the mean.
Standard deviation - Average difference between the observations and the mean
Notations
Ẋ= sample mean
s^2= sample varience
s= sample standard deviation
mu (weird u) = expected value/ population mean
sigma= population standard deviation
sigma^2 population varience
Normal distribution
Bell shaped curves, and alway unimodal, symmetrical and:
-ca 70% of probability lies within one standard deviation above or below mu
-ca 95% of probability lies within two standard deviations above or below mu
-almost all probability lies within three standard deviations above or belove mu
Standard normal distubution
A normally distributed variable can be standardized by subtracting the mean, then dividing by the standard deviation:
z= x-mu/ sigma
This is what the table with z values is for.
Standard error
The standard error is a measure of how much the sample mean of a data set is expected to vary from the true population mean. It’s calculated as the standard deviation of the data set divided by the square root of the sample size. A smaller standard error indicates a more accurate estimate of the population mean.
Confidence interval
A confidence interval is a range of values, derived from the sample data, that is likely to contain the value of an unknown population parameter. It gives an estimated range of values which is likely to include the parameter, based on the data in the sample and the chosen confidence level (like 95%). The wider the interval, the more uncertain the estimate.
Infer
In statistics, we are interested in using samples to infer information(parameters like the mean) about the population. That is why a sample is drawn. For example, we would like to know how close the mean of our sample is to the true population mean. To what extent is it representative? If We can quantify how confident we are of this, then we can say something about the population.
Inference
To derive a conclusion from facts, premises, or theory –here, the theory is based on our knowledge of what the sampling distribution of the means looks like. We use the standard error, together with our knowledge that the mean of the population should equal the mean of the sampling distribution of the population.
t distribution
The t-distribution is a type of probability distribution that is symmetric and bell-shaped, like the normal distribution, but with heavier tails. It’s used in statistics, especially in situations where the sample size is small and the population standard deviation is unknown. As the sample size increases, the t-distribution approaches the normal distribution. It’s commonly used in hypothesis testing and constructing confidence intervals.
statistical test
A statistical test is a method used in statistics to make decisions or inferences about a population based on sample data. It evaluates a hypothesis, such as comparing means or proportions, by determining the likelihood that the observed data occurred by chance. Common examples include t-tests, chi-square tests, and ANOVA. The outcome of a statistical test is usually a p-value, which helps determine whether the results are statistically significant.
Inference
Inference in statistics is the process of drawing conclusions about a population’s characteristics based on a sample of data from that population. It involves using probability theory to estimate population parameters, test hypotheses, and make predictions. There are two main types of statistical inference:
- Estimation, where you estimate population parameters (like mean or proportion) using sample data.
- Hypothesis testing, where you test assumptions about a population based on sample data.
null hypothesis H0
The null hypothesis (H0) in statistics is a default assumption that there is no effect or no difference. It’s tested against to see if it can be rejected, suggesting a significant effect or difference exists.
Simultaneous sample
A simultaneous sample means gathering data from a group of people all at the same time, giving a snapshot of their collective traits or behaviors in that moment.
Alternative hypothesis HA
The alternative hypothesis (HA or H1) in statistics is a statement that suggests a new effect, difference, or relationship exists in the data, contrary to the null hypothesis (H0). It’s what you aim to support through your data analysis.
significance level α
The significance level (α) in statistics is a threshold used to determine the statistical significance of a result. It’s the probability of rejecting the null hypothesis when it is actually true, often set at 0.05 (or 5%). A result is considered statistically significant if the p-value is less than α.
Simultaneous sample
A simultaneous sample means gathering data from a group of people all at the same time, giving a snapshot of their collective traits or behaviors in that moment.
Causal relationship
A causal relationship refers to a cause-and-effect connection between two variables, where one variable (the cause) brings about changes in another variable (the effect).
Correlation
Correlation refers to a statistical measure that shows the relationship or association between two variables. It indicates how changes in one variable relate to changes in another variable, without implying causation.
Covariance
Covariance measures how much two random variables vary together. It indicates the degree to which changes in one variable are associated with changes in another variable.
Pearson correlation coefficient
The Pearson correlation coefficient measures how strongly and in what direction two variables are related on a scale from -1 to 1.
one-sided, two-sided comparison
In t-tests, comparisons can be either one-sided or two-sided:
One-Sided Test: This test looks for a difference in a specific direction. For example, you might test if one mean is greater than another, not just different. It’s used when the research hypothesis predicts a specific direction of effect.
Two-Sided Test: This test checks for any difference, regardless of direction. It’s used when you want to determine if two means are different, but you don’t have a specific direction in mind (either greater or lesser).
test statistic
A test statistic is a calculated value used in statistical hypothesis testing to determine whether to reject the null hypothesis. It’s derived from sample data and is used to measure the degree of agreement between the sample data and the null hypothesis. The type of test statistic depends on the test being performed (like a t-statistic for t-tests or a z-statistic for z-tests) and is compared against a critical value from a statistical distribution to determine significance.