VL 2 und VL 3 Flashcards by Luca Key

What’s important for sampling in experiments in science?

Sampling must be
- randomly
- independent
- the more the merrier

> simple random sampling (e.g rolling a dice)
systematic sampling (every 10th person)
stratified sampling (making subgroups based on categories
accidental sampling (close to hand, opportunity sampling, often not representative)
cluster sampling (divide city into areas, sample each area, …)

How well did you know this?

Not at all

Perfectly

Explain the two research methods
1. Correlational research
2. Experimental research

Correlational research
Correlational research is a type of scientific study that aims to explore the relationship between two or more variables. It focuses on examining the statistical association between variables without manipulating them directly. Correlational research provides valuable insights into the degree and direction of the relationship between variables but does not establish causation.
Experimental research:
Experimental research is a scientific method used to study cause-and-effect relationships between variables. It involves manipulating one or more variables under controlled conditions to observe the impact on another variable. By carefully designing and controlling the experiment, researchers can draw conclusions about the causal relationship between the variables being studied.
(sometimes experimental research is not possible, because of ethics)

How well did you know this?

Not at all

Perfectly

what measurement levels/ data types exist?

Data types depend on the measurement levels.

categorical data (quality)
* nominal:
- gender: female, male (binary)
- smoker: yes, no (binary)
- prot structures: H, E, …
- nucleotides: A,C, G, T, U
* ordinal:
- age: young,medium,old - grade: 1,2,3,4,5
- lucky, ok, unlucky
numerical data (quantity)
* discrete
– age: 6, 8, 84 – height:
150, 176cm
– helices per 1000 AA
– cigarettes per day 0, 20, 30
* continuous
– weight: 79.99kg, 72kg – height: 12.2, 12.5

How well did you know this?

Not at all

Perfectly

How do you figure out what data type it is?
(Data type question)

Can you calculate a mean?
- yes ( it is numeric)
1.1 is the mean always a possible value?
- yes (continuous numeric)
- no (discrete numeric)

Can you calculate a mean?
- no ( it is categorical)
1.1 is there a logical order of the values?
- yes (ordinal categorical)
- no (nominal categorical)

How well did you know this?

Not at all

Perfectly

Which terms are used in statistics to describe the number of variables in an analysis?

Univariate:
Univariate analyses refer to the study of a single variable. Here, statistical measures such as the average, standard deviation, or median are used to gain information about that variable.

age, gender, length

Bivariate:
Bivariate analyses, on the other hand, refer to the study of two variables simultaneously. This involves examining whether and how these two variables are related to each other. Correlation coefficients or scatter plots can be used to analyze this relationship.

weight - height, weight - age, smoker- gender,
aa.freq- aa.weight
(- translates to versus)

Multivariate:
In multivariate analyses, three or more variables are examined simultaneously. Here, complex statistical methods such as regression analyses or factor analyses are used to understand the relationships between the variables and to make predictions.

weight - age | gender
weight - age | smoker*gender

How well did you know this?

Not at all

Perfectly

Parameters and statistics are both important concepts in statistics, but they have different meanings and uses, explain them.

Parameters:
Characteristics or measures that describe a population (Population is characterised by Parameters) They represent fixed, unknown quantities that define the characteristics of the entire population under study. Parameters are usually denoted by Greek letters (e.g., μ for population mean, σ for population standard deviation)

Statistics:
Statistics are calculated from sample data (sample I characterised by statistics) and are used to estimate population parameters or describe the sample itself. They provide information about the sample and can be used to make inferences about the population. Common examples of statistics include the sample mean, sample standard deviation, sample proportion, or correlation coefficient.

WE USE THE SAMPLE TO ESTIMATE THE PARAMETERS OF THE POPULATION

–> sample mean m is the unbiased estimator of the parameter µ, mean of the population!

How well did you know this?

Not at all

Perfectly

describe a box plot

A box plot is a graphical representation of the distribution of a dataset, showing the median, quartiles, and potential outliers.

A box plot consists of a box and two whiskers. The box is drawn from the first quartile (Q1 25%) to the third quartile (Q3 75%) of the data, with a vertical line inside representing the median (Q2). The length of the box represents the interquartile range (IQR), which is the range containing the middle 50% of the data.

The whiskers extend from the box and represent the minimum and maximum values within a specified range. They can be calculated using the 1.5*IQR rule. Outliers, which are data points that fall significantly outside the whiskers are dots.

How well did you know this?

Not at all

Perfectly

What is the z-score? And how you you calculate it?

A z-score is a way to tell how far away a particular data point is from the average (mean) of a group of data, and it’s measured in terms of standard deviations. It helps you understand if a data point is typical or unusual compared to the rest of the data.

A positive z-score means the data point is above average, while a negative z-score means it’s below average. The size of the z-score tells you how much farther away it is from the average compared to the standard deviation.

By using z-scores, you can compare data from different groups or distributions on a common scale and determine if a particular data point is relatively high or low compared to others.

The formula to calculate the z-score is:
z = (x - μ) / σ

z is the z-score
x is the data point
μ is the mean of the distribution
σ is the standard deviation of the distribution

How well did you know this?

Not at all

Perfectly

Why is table() useful?

The table() command is helpful for quickly summarizing and analyzing the frequency distribution of categorical variables in R.

numerical data can be transformed into categorical data as well by using cut function in R.

cut function, splitting numeric values into different categories
assign level/class names with levels function
categorical/qualitative data are called factors in R

How well did you know this?

Not at all

Perfectly

What graphics are good for descriptive statistics?

histogram, barplot, pie, dot chart
(Learn all in R!!!!)

How well did you know this?

Not at all

Perfectly

One sample prop.test?

One-sample proportion test (prop.test):
The one-sample proportion test in R (prop.test) helps us compare the proportion of successes in a group to a specific expected proportion. It’s used when dealing with categorical data.

For example, let’s say you have a sample of 200 individuals and you want to test if the proportion of individuals who prefer coffee is significantly different from a hypothesized proportion of 0.5 (50%). You can use prop.test() to perform this test.

The prop.test() function provides results like the estimated proportion, a test statistic (often a z-score), a p-value, and a confidence interval. By looking at the p-value, we can determine if there’s a significant difference between the observed proportion and the expected proportion.

How well did you know this?

Not at all

Perfectly

One sample t.test?

One-sample t-test (t.test):
It helps us find out if the average of a group is significantly different from a specific expected average. We typically use it when working with continuous numerical data.

For example, let’s say you measured the weights of 50 randomly selected apples and want to check if the average weight is significantly different from 150 grams (the expected average). To analyze this, you can use the t.test() function in R.

The t.test() function provides results such as the average weight of the sample, a test statistic called the t-value, a p-value, and a confidence interval. By looking at the p-value, we can determine if there’s a significant difference between the observed average and the expected average.

How well did you know this?

Not at all

Perfectly

why use prop.test and t.test in inferential statistics?

In summary, the prop.test() compares proportions, while the t.test() compares averages. Both tests help us determine if there’s a significant difference between what we observed and what we expected.

How well did you know this?

Not at all

Perfectly

Name 3 different Data Scatter Measures and explain them.

Standard Deviation (SD):
tells you how much the data points deviate from the mean. A higher standard deviation means more variability, while a lower standard deviation means less variability. It helps quantify the dispersion or uncertainty in the data.
- sample standard deviation s
- population standard deviation σ

coefficient of variation (CV):
is a measure of relative variability. It compares the standard deviation of a data set to its mean. A higher CV indicates higher relative variability, while a lower CV indicates lower relative variability. It is useful for comparing variability between data sets with different means or units.

standard error of the mean (SEM):
is a measure of how much the sample mean is likely to vary from the true population mean. It represents the precision of the estimate. A smaller SEM means a more reliable estimate, while a larger SEM means a less precise estimate.

How well did you know this?

Not at all

Perfectly

What is an Outlier and why can they be important?

Outlier is a data point that significantly differs from the other observations in a dataset. It is an extreme value that can impact statistical analyses and calculations. Identifying and understanding outliers is important for ensuring accurate results and interpretations.

Outliers can be important because they may reveal errors in data, provide insights into rare events, test the robustness of analyses, identify distinct subgroups, and challenge assumptions about data distribution. They offer valuable information that can enhance understanding and improve the accuracy of statistical analyses. (Tomatenpflanzen Beispiel)

How well did you know this?

Not at all

Perfectly

What is a Confidence Interval (CI)?

A confidence interval is a range of values that estimates the true population parameter. It quantifies the uncertainty in the estimate and is based on the sample data. The confidence level indicates the probability that the interval captures the true parameter. A wider interval provides greater confidence but less precision, while a narrower interval provides less confidence but more precision.

the mean of the sample is the estimator for the
population mean.
We need to quantify uncertainity about the population
value.
A confidence interval states our uncertainty.
Confidence intervals are available for means,
differences between means, proportions, correlations…

What does CI 95% mean?

95% confidence interval: “We are 95% sure that this range contains the population parameter”
In other words, a 95% confidence level suggests a high level of confidence that the true population parameter lies within the calculated confidence interval. It implies that there is a 5% chance that the interval does not capture the true parameter.

The selection of 95% is related to the concept of statistical significance, where a common threshold is set at p < 0.05 (5% significance level) for hypothesis testing. A 95% confidence level aligns with this threshold and is often used in conjunction with hypothesis testing.

CI’s tells you about your proportions and the
probable value in the population.
CI’s are often the better measure, but unfortunately
in science less frequently used.

Null Hypothesis

H0 with one group:
- our sample proportion is not different from the known population proportion
- our sample mean is not different from the known population mean

H0 with two or more groups:
- groups are not different from each other regarding their population proportions or means

alternative hypothesis:
– same sentences but without not

p-value ?

P-value is a measure of the strength of evidence against the null hypothesis. It represents the probability of obtaining the observed data or more extreme results if the null hypothesis is true. A small p-value suggests strong evidence against the null hypothesis, while a large p-value suggests weak evidence. The threshold for determining statistical significance is typically set at 0.05.

BUT the p-value does not tell you if there really diffrence.
Somehow the p-value is only a measure of randomness.

Explain the word “Significance” in statistics.

Significance indicates whether the result is likely due to true differences rather than random chance. Statistical significance is typically determined using hypothesis testing and is expressed through p-values.

statistically significant does not! necessary mean that the observation is “Important”

significant * (p < 0.05)
highly significant ** (p < 0.01)
extremely significant *** (p < 0.001)

Remember 1): Something can be extremely significant but still totally unimportant!!
Remember 2): Don’t sort by p-values and don’t trust papers which do so!!

How do I know if my data is normal or non-normal?
And what does normal mean?

Normal data means gaussian distribution.

o assess whether your data follows a normal distribution (also known as a Gaussian distribution or bell curve), you can use graphical methods and statistical tests.

Graphical methods:
One way to evaluate the distribution of your data is by visualizing it using a histogram, a density plot, or a Q-Q plot. In a histogram or density plot, a roughly symmetrical and bell-shaped distribution suggests normality. In a Q-Q plot, if the data points closely follow a straight line, it indicates a normal distribution.

2.Statistical tests:
There are several statistical tests available to assess the normality of data, such as the Shapiro-Wilk test, Anderson-Darling test, or Kolmogorov-Smirnov test. These tests provide a p-value, and if the p-value is greater than a chosen significance level (e.g., 0.05), the data can be considered approximately normal.

normal data –> t.test
non-normal data, skewed or multi-modal
distributions –> wilcox.test

Th Shapiro Wil test to assume normality.

The Shapiro-Wilk test is a statistical test used to assess whether data follows a normal distribution. It calculates a test statistic based on the correlation between the observed data and the expected values under the assumption of normality. The test statistic is compared to critical values to determine the p-value, which indicates whether the data is significantly non-normal.

p-value >0,05 (normal)
p-value < 0,05 (non-normal)

BUT be careful, with many samples (over 100) Shapiro will test is easy significant. Its better to use as well visual inspection, histogram, qqplot

What is a T-Distribution?

The t-distribution is a probability distribution used when working with small sample sizes or unknown population standard deviations. It is similar to the normal distribution but has slightly heavier tails. It is commonly used for hypothesis testing and constructing confidence intervals.

Derives from the normal distribution.
It is the difference between the sample mean and the
population mean, divided by the SEM.

What is a t-value?

A t-value is a measure that quantifies the difference between a sample mean and a hypothesized population mean, accounting for data variability. It is used in hypothesis testing and confidence interval calculations when the population standard deviation is unknown or for small sample sizes.

the t-value and the t-distribution are connected in hypothesis testing and confidence interval calculations. The t-value is compared to critical values from the t-distribution to determine statistical significance, and the t-distribution is used to calculate p-values associated with the t-value.

What is Cohens d?

Cohen's d is a measure of effect size that quantifies the difference between two means in terms of the standard deviation. It provides a standardized measure of the magnitude of the difference and is commonly used in statistical analysis to assess the practical significance of an effect. d = 0,2 - 0,5 small effect d = 0,5 - 0,8 medium effect d = >0,8 large effect d = 0,5 means the difference between groups is half of a standard deviation

What is paired and matched Data?

Paired data refers to observations or measurements collected on the same subjects or entities at different times or conditions. Matched data refers to observations or measurements collected from different groups, with individuals specifically paired based on certain characteristics. Both types allow for more precise comparisons by accounting for dependencies or controlling for confounding variables.