VL 2 und VL 3 Flashcards
What’s important for sampling in experiments in science?
Sampling must be
- randomly
- independent
- the more the merrier
> simple random sampling (e.g rolling a dice)
systematic sampling (every 10th person)
stratified sampling (making subgroups based on categories
accidental sampling (close to hand, opportunity sampling, often not representative)
cluster sampling (divide city into areas, sample each area, …)
Explain the two research methods
1. Correlational research
2. Experimental research
- Correlational research
Correlational research is a type of scientific study that aims to explore the relationship between two or more variables. It focuses on examining the statistical association between variables without manipulating them directly. Correlational research provides valuable insights into the degree and direction of the relationship between variables but does not establish causation. - Experimental research:
Experimental research is a scientific method used to study cause-and-effect relationships between variables. It involves manipulating one or more variables under controlled conditions to observe the impact on another variable. By carefully designing and controlling the experiment, researchers can draw conclusions about the causal relationship between the variables being studied.
(sometimes experimental research is not possible, because of ethics)
what measurement levels/ data types exist?
Data types depend on the measurement levels.
- categorical data (quality)
* nominal:
- gender: female, male (binary)
- smoker: yes, no (binary)
- prot structures: H, E, …
- nucleotides: A,C, G, T, U
* ordinal:
- age: young,medium,old - grade: 1,2,3,4,5
- lucky, ok, unlucky - numerical data (quantity)
* discrete
– age: 6, 8, 84 – height:
150, 176cm
– helices per 1000 AA
– cigarettes per day 0, 20, 30
* continuous
– weight: 79.99kg, 72kg – height: 12.2, 12.5
How do you figure out what data type it is?
(Data type question)
- Can you calculate a mean?
- yes ( it is numeric)
1.1 is the mean always a possible value?
- yes (continuous numeric)
- no (discrete numeric)
Can you calculate a mean?
- no ( it is categorical)
1.1 is there a logical order of the values?
- yes (ordinal categorical)
- no (nominal categorical)
Which terms are used in statistics to describe the number of variables in an analysis?
Univariate:
Univariate analyses refer to the study of a single variable. Here, statistical measures such as the average, standard deviation, or median are used to gain information about that variable.
age, gender, length
Bivariate:
Bivariate analyses, on the other hand, refer to the study of two variables simultaneously. This involves examining whether and how these two variables are related to each other. Correlation coefficients or scatter plots can be used to analyze this relationship.
weight - height, weight - age, smoker- gender,
aa.freq- aa.weight
(- translates to versus)
Multivariate:
In multivariate analyses, three or more variables are examined simultaneously. Here, complex statistical methods such as regression analyses or factor analyses are used to understand the relationships between the variables and to make predictions.
weight - age | gender
weight - age | smoker*gender
Parameters and statistics are both important concepts in statistics, but they have different meanings and uses, explain them.
Parameters:
Characteristics or measures that describe a population (Population is characterised by Parameters) They represent fixed, unknown quantities that define the characteristics of the entire population under study. Parameters are usually denoted by Greek letters (e.g., μ for population mean, σ for population standard deviation)
Statistics:
Statistics are calculated from sample data (sample I characterised by statistics) and are used to estimate population parameters or describe the sample itself. They provide information about the sample and can be used to make inferences about the population. Common examples of statistics include the sample mean, sample standard deviation, sample proportion, or correlation coefficient.
WE USE THE SAMPLE TO ESTIMATE THE PARAMETERS OF THE POPULATION
–> sample mean m is the unbiased estimator of the parameter µ, mean of the population!
describe a box plot
A box plot is a graphical representation of the distribution of a dataset, showing the median, quartiles, and potential outliers.
A box plot consists of a box and two whiskers. The box is drawn from the first quartile (Q1 25%) to the third quartile (Q3 75%) of the data, with a vertical line inside representing the median (Q2). The length of the box represents the interquartile range (IQR), which is the range containing the middle 50% of the data.
The whiskers extend from the box and represent the minimum and maximum values within a specified range. They can be calculated using the 1.5*IQR rule. Outliers, which are data points that fall significantly outside the whiskers are dots.
What is the z-score? And how you you calculate it?
A z-score is a way to tell how far away a particular data point is from the average (mean) of a group of data, and it’s measured in terms of standard deviations. It helps you understand if a data point is typical or unusual compared to the rest of the data.
A positive z-score means the data point is above average, while a negative z-score means it’s below average. The size of the z-score tells you how much farther away it is from the average compared to the standard deviation.
By using z-scores, you can compare data from different groups or distributions on a common scale and determine if a particular data point is relatively high or low compared to others.
The formula to calculate the z-score is:
z = (x - μ) / σ
z is the z-score
x is the data point
μ is the mean of the distribution
σ is the standard deviation of the distribution
Why is table() useful?
The table() command is helpful for quickly summarizing and analyzing the frequency distribution of categorical variables in R.
numerical data can be transformed into categorical data as well by using cut function in R.
- cut function, splitting numeric values into different categories
- assign level/class names with levels function
- categorical/qualitative data are called factors in R
What graphics are good for descriptive statistics?
histogram, barplot, pie, dot chart
(Learn all in R!!!!)
One sample prop.test?
One-sample proportion test (prop.test):
The one-sample proportion test in R (prop.test) helps us compare the proportion of successes in a group to a specific expected proportion. It’s used when dealing with categorical data.
For example, let’s say you have a sample of 200 individuals and you want to test if the proportion of individuals who prefer coffee is significantly different from a hypothesized proportion of 0.5 (50%). You can use prop.test() to perform this test.
The prop.test() function provides results like the estimated proportion, a test statistic (often a z-score), a p-value, and a confidence interval. By looking at the p-value, we can determine if there’s a significant difference between the observed proportion and the expected proportion.
One sample t.test?
One-sample t-test (t.test):
It helps us find out if the average of a group is significantly different from a specific expected average. We typically use it when working with continuous numerical data.
For example, let’s say you measured the weights of 50 randomly selected apples and want to check if the average weight is significantly different from 150 grams (the expected average). To analyze this, you can use the t.test() function in R.
The t.test() function provides results such as the average weight of the sample, a test statistic called the t-value, a p-value, and a confidence interval. By looking at the p-value, we can determine if there’s a significant difference between the observed average and the expected average.
why use prop.test and t.test in inferential statistics?
In summary, the prop.test() compares proportions, while the t.test() compares averages. Both tests help us determine if there’s a significant difference between what we observed and what we expected.
Name 3 different Data Scatter Measures and explain them.
Standard Deviation (SD):
tells you how much the data points deviate from the mean. A higher standard deviation means more variability, while a lower standard deviation means less variability. It helps quantify the dispersion or uncertainty in the data.
- sample standard deviation s
- population standard deviation σ
coefficient of variation (CV):
is a measure of relative variability. It compares the standard deviation of a data set to its mean. A higher CV indicates higher relative variability, while a lower CV indicates lower relative variability. It is useful for comparing variability between data sets with different means or units.
standard error of the mean (SEM):
is a measure of how much the sample mean is likely to vary from the true population mean. It represents the precision of the estimate. A smaller SEM means a more reliable estimate, while a larger SEM means a less precise estimate.
What is an Outlier and why can they be important?
Outlier is a data point that significantly differs from the other observations in a dataset. It is an extreme value that can impact statistical analyses and calculations. Identifying and understanding outliers is important for ensuring accurate results and interpretations.
Outliers can be important because they may reveal errors in data, provide insights into rare events, test the robustness of analyses, identify distinct subgroups, and challenge assumptions about data distribution. They offer valuable information that can enhance understanding and improve the accuracy of statistical analyses. (Tomatenpflanzen Beispiel)