Statistics Basics Flashcards
What means “Doing science”?
Collecting Data so that sample information is a useful representation of the world.
Summarizing Data to make it easier to understand and use for describing the real world.
Using data to critically evaluate evidence for or against a specific hypothesis.
Population of Interest (Main Problem)
Too large to study
Sample
Subset of the population of interest. Knowledge gained from measurements on a sample. Scientists can make estimates of the larger population
What determines whether or not the data collected for a study are representative of the real world?
The methods used to obtain such data. Such methods must include unbised, random procedures
What is a Variable?
Is a characteristic of an object or group of objects that can be represented with a number that has more than 1 possible value
Columns represent…
Variables
Rows represent…
Observations
Ratio-Scale Variables
Have a true absolute zero value.
Quantitative data measured on a scale that has a constant increment between successive values.
Ordinal or Rank Scale Variables
Have values that represent the ranked order of the objects or individuals or individuals with regard to a variable.
However, the actual differences between ranks can differ. Ex. Top 3 GPAs: 4.0, 3.96, 3.7
Discrete variables
Can take only specific values and are often based on counting.
Continuous variables
Can take infinite number of possible values, limited only to the number of decimal places to which the value can be precisely measured.
Categorical variables
Have values that indicate the individual belongs to a class or category. Although these values cannot be inherently represented by numbers, they are often analyzed in terms of the count or proportion of individuals that fall within that class or category.
Population of Interest
Entire group of objects or individuals about which information is desired.
Data
Refers to a collection of observations and/or measurements for one or more variables, made on one or more individuals from the population of interest for the purpose of addessing a specific question.
Statistics
Numbers that describe characteristics of a sample. These are calculated by the data obtained from individuals in a sample.
Sample Statistics and its relation to Population Parameters
Sample statistics are used to estimate or infer something about the values of population parameters.
Sample unit
Is an individual unit that comprises the sample or pop. of interest. For example, sample= people; sample unit= person.
Are Sample Statistics considered accurate with a valid study design?
Even with a valid study design, a sample statistic is more or less accurate. They never represent the true values of the pop. parameter.
Population Parameters
Numbers that describe the characteristics of the entire population of interest.
Random Sample Variation (RSV)
Is the variation in the values of a sample statistic computed from different, independent samples taken from the same population.
They will always happen as long as scientists use samples to estimate population parameters.
Why does RSV happen?
It is a consequence of the randomness of the process by which individuals are selected from a population to create a sample.
In other words, it occurs because repeated samples include different subsets of indivudals who vary with regard to the VoI.
One example as to how sample variation and consequent uncertainty (associated with the estimates of the pop. parameters) can minimized.
If data is obtained using appropiate methods and unbiased procedures.
Bias
Is any systematic deviation of sample statistics away from the true value of population parameters.
Systematic referring to consistently wrong.
Three most common reasons for bias
Confounding, Selection Bias, Information Bias
Selection Bias
Happens when individuals included in the sample are not representative of the larger population of interest.
This is determined by the method used to select individuals from the population into your sample.
Information Bias
Measurements do not adequately represent the variable of interest.
This can be determined whether the choice of method used to measure the variable of interest (VoI) calculates the wrong value of the VoI OR when you are using an appropiate measurement method; however, it is consistently calculating the VoI wrong (For example, lack of training).
Measurement Validity
Is the idea that a measurement made on the study subjects accurately quantifies the variable of interest.
Precision
Refers to the amount of variation among the values of a sample statistic derived from repeated, independent samples of the same population.
For example, if repeated samples produce very similar values of a sample statistic then the estimates are said to be precise estimates of the population parameter.
Sample size and its effect on unusual observations
Since some samples, by chance, include unusual individuals, the effect of such few unusual observations on the value of the sample statistic can be reduced by a large sample size.
In other words, if many individuals are included in a sample, random variation among individuals will average out and sample statistics computed from repeated samples will be less variable, and thus, more precise.
Precise Estimates
The values of a precise statistic derived by repeatedly sampling the same population tends to fall within a very narrow range.
Accurate estimates
The values of a precise statistic derived from an unbiased study design are very likely to fall close to the true parameter value.
What makes a statistic an unbiased estimate of the population parameter?
If all individuals in the population of interest have equal chance of being selected AND the measurement procedure produces valid data for the VoI.
How do you minimize bias?
You do this by reducing selection bias (meaning your sample is representative of the larger population), reducing information/measurement error (choosing an adequate tool to measure your VoI AND making sure your personnel is trained enough), and checking for confounding
How do you maximize precision?
Collecting data from large sample sizes and also training personnel to ensure consistent methodology
Study Design
Is a description of the methods that the investigator will be using to acquire data
Study design depends on
the requirements for statistics to be accurate estimates of parameters and the specific objectives of the study
Sampling study design
Is generally used when the study objective is to estimate the value of the pop. parameter.
Principles of a representative sample
All individuals in the pop. should have an equal chance to be included in the sample AND the sample should include a sufficient number of individuals in the sample to represent the range of variation that is present within the population.
Randomization
Process for selecting individuals who will be included in a sample based on some random mechanism
Purpose of randomization
Minimize bias
Replication in Experimental Design
Refers to the number of individuals included in a sample
RSV and Sample Size
Sample statistics computed from large sample exhibit less random sampling variation/less affected by a few unusual individuals than stats computed from small samples.
Purpose of replication in Experimental Design
Control sampling variability and increase precision
Types of sampling study designs
Completely random sampling, randomized systematic sampling, stratified sampling design
Completely random sampling
All members of the pop. must have an equal chance of being selected for the sample. This sampling can happen through the random selection from a list OR random location of sample points.
Haphazard selection
Is not random sampling. Examples of this include, picking up the phonebook to randomly pick names or walking in the woods aimlessly to pick random location. It is not possible for you to confirm that the selection is truly random
Randomized systematic sampling Application
Commonly used in field sampling when it is very difficult to travel to random points or when a relatively small numbers of sample points will be used to describe a larger area.
Randomized systematic sampling
A random starting point is chosen and then sampling locations are located at a fixed distance intervals or travel-time intervals proceeding away from this random point. It reduces cost and effort.
Stratified sampling design
Involves identifying the various subpopulations called strata and taking separate random samples from each.
Experiment
Deliberate imposition of a treatment by the investigator on a sample of subjects to evaluate the response of the subjectsto the treatment.
Primary purpose of experimental study designs
Determine cause-effect relationships between a treatment variable and a response variable
Treatment/Explanatory Variable
Measures the condition(s) that the imposes on the study subjects
Response/Outcome Variable
Measures some characteristic of the study subjects that is hypothesized to change as a result of the treatment
How are experiments conducted in relation to non-treatment variables
They are done in controlled settings to minimize the possibility that nontreatment variables might influence the response of study sujects
Experimental design establishes what?
Establishes conditions such that there are only two possible explanations for why groups that received different treatments are different at the end of the experiment: Treatment caused the difference or is it was due to chance (random sampling variation)
What happens if nontreatment factors are allowed to influence study subjects or treatment groups were different from the beginning
It is impossible to determine if the treatment was the cause of the observed differences
What is an essential good component of experimental design
Equivalent study (treatment and control) groups
What is referred to equivalent study (treatment and control) groups
Groups that prior to the imposition of treatment are similar. They have the same variation of all nontreatment variables
Randomization of Assignment does what?
It randomly assigns the treatment to one of the two groups and ensures that the groups are equivalent before the treatment is imposed.
What is adequate replication?
Since replication involves including many subjects in your sample, which in the case of an experimental design will be divided into two groups, adequate replication refers to including enough individuals to average out individual differences, minimizing hte possibility that differences between groups are due to random-chance difference between individual assigned to groups.
What is the experimental unit?
The basic unit of the population of interest, for example, if population is patients with a HBP, then the experimental unit is patient.
What does independence refer to?
The value on the outcome variable on one experimental unit (patient) should not influence or not be influence by the value measured in other units
Different types of experimental design
Completely randomized (one treatment variable, two or more levels),
Before-after (used when the effect of the treatment is expected to be small relative to the variation among the study subjects),
Matched-pairs (unlike before-after, there is no carryover effect),
Randomized block (used when a response to treatment is influenced by an extraneous/nontreatment variable that cannot be controlled or eliminated; for example gender),
Factorial (used when the researcher want s to determine the response of the study subjects to two or more treatments but has reason to believe that effect of one variable will interact with the effect of other variables)
Problems with Cause-Effect Inferences
Since the purpose of an experimental study design is to provide a realistic test of the effect of a treatment on individuals in a population of interest, an inverstigator might reach an ERRONEOUS conclusion regarding the effect of the treament on the pop. of interest due to any of several types:
Confounding factors: facotrs that were not controlled by the investigator might actually be the cause of the differences between groups;
Poor measurement validity (aka information bias): measurements that are poor representations of the phenomen of interest provide misleading information;
Groups are not similar: often happens when randomization is not done or when there is not an adequate replication;
Nonrepresentative subjects included in the experiment (aka as selection bias): often seen when researchers, who are tying to control for external factors, include subjects from a homogenous group;
Investigator bias: researchers are more likely to see or not see a treatment effect due to their preconceived expectations about how experiments should turn out.;
Placebo effect;
Lack of realim: The more realistic the treatment and experimental cinditions, the less control the researcher may have over confounding factors; the more control over confounding factors, the less realistic the experiemntal conditons and the less likely the study subjects will respond as they might in their natural environment.
What are Natural Experiments?
They are experiments that involve comparing samples obtained from two or more populations in their natural environment
Main observations about natural experiments
Researcher has limited or no control over which subjects received treatments AND he/she has limited control over how the subjects have been influenced by extraneous, nontreatment factors.
Main reasons why natural experiments are adequate
They are more realistic and they avoid ethical concerns since it was not the researcher who imposed the natural treament/exposure
Main problems with natural experiments
There is a substantial risk that the two comparison groups are nonequivalent before the exposure happened AND cannot completely control for extraneous factors HENCE observed differences may not have been caused by the natural treatment/exposure
Adequate solution to natural experiments
Selecting study subjects based on criteria that make comparison groups as similar as possble. Caution must be taken when reporting conclusions about the population of interest
How can the independence assumption be violated?
Sampling of each individual is not truly random, such that some individuals in the sample are located in close spatial proximity to each other other or are genetically related.
Pseudo-replication: multiple measurements made on individuals from the pop. of interest are treated the same as single measurements from different, randomly chosen individuals
Are the scenarios that violate independence problematic and can you still make analyses with them?
Making analyses with these types of scenarios are not problematic, The problem is assuming that your sample is independent. For pseudo-replication, you will need to average the multiple measurements to obtain one single measurment, and for measurments on multiple individuals located around a single randomly located point, they must be analyzed using procedures specficially developed to handle data from that type of study design.
How does randomization lead to equivalent groups before the treatment is imposed?
Guarenteeing each individual has equal chance of being assigned to any group.
When participants are randomly allocated to different groups, the law of large numbers ensures that, on average, each group will have similar characteristics. Randomization creates balance across groups in terms of both measured and unmeasured factors that could potentially influence the outcome being studied.
By randomizing, you’re making sure each group has a fair mix of different preferences and characteristics. It’s like shuffling the cards before dealing them out.
So, randomization helps make sure your experiment is fair and that any differences you see between the groups are because of what you’re testing, not because of other factors. It’s like setting up a level playing field for your experiment.
First rule after collecting data
LOOK at the data values for unanticipated patterns or oddities. Do not jump straight to complex statistical analyses. Remember, all statistical analyses are bounded to the GIGO principle: Garbage in, Garbage out. Meaning if oyu apply statisitcal analysis to that that data that is not appropriate for that analysis, your apparently scientific and precise results will be nonsense.
What is exploratory data analysis (EDA)?
It is based on looking at the data to assimilate and understand the information embodied within the data. The goal is to summarize the data in a way that accentuates patterns (systematic variation in data values), and that distinguish patterns from noise (random variations or deviations from the pattern).
What is the goal of the EDA?
The goal is to summarize the data in a way that accentuates patterns (systematic variation in data values), and that distinguish patterns from noise (random variations or deviations from the pattern). We are looking for evidence of ANY PATTERN without preconceived notions as to its nature.
What are associations?
They are a type of pattern that allow us to make informed predictions of future events, the true goal in science.
What are the primary tools of EDA?
Graphics and Summary Statistics
Examples of graphics
Histograms, stem-leaf plots, box plots, scatter plots
Examples of summary statistics
Mean, Median, Mode (Measure of Center)
Range, Variance, Standard Deviation, Interquartile Range, Min, Max (Measures of Spread), Percentiles (Measure of Data Value Location)
What is frequency
Is the number of times a specific data value occurs in the sample data.