Module 1: Introduction to Data Flashcards
Concept
Answer
A frequency table exhibits how…
frequencies are distributed over various categories (known as a frequency distribution)
Associated variables
When two variables show some connection/relationship with one another
Blocking (experimental design)
Grouping the sample based on variables which may effect the outcome and then randomizing within groups
Categorical variable
The individual entries are categories, the possible values are called “levels”
Cluster sample
Break the population into groups and then sample a fixed number of those groups and include all observations from each group; helpful when there’s a lot of variability between cases within a cluster but the clusters themselves don’t differ much from one another
Confounding variable
A variable that is correlated with both the explanatory and the response variables
Continuous variable
A numerical variable that has no limitation (e.g. infinite decimal points for precision); e.x. height, weight (think how much)
Controlling (experimental design)
Mitigate the differences between groups
Convenience sample bias
When individuals who are more accessible are more likely to be included in the sample
Cumulative frequency
The total of a frequency and all frequencies below it in a frequency distribution; the running total of frequencies
Cumulative relative frequency
Cumulative frequency for that category/Sum of all frequencies
Data
Information we gather with experiments and with surveys
Description
Summarizing the data that are obtained
Descriptive statistics
Refers to methods for summarizing the data; describes the sample only (graphs, numerical summaries)
Design
Planning how to obtain data to answer the questions of interest (experimental design, sample size, power, etc.)
Discrete variable
A numerical variable that only takes number values in jumps (e.g. whole numbers); e.x. the number that appears when throwing a die (think how many)
Experiment
Used to investigate the possible causal connection between variables
Explanatory variable
The variable (first) that causually affects the other
Frequency
The number of elements that belong in a certain category
Graphical methods
Histogram, boxplot, bar graph, etc.
Graphs (categorical)
Bar chart, pie chart; focuses on frequencies or relative frequencies of the levels of the variable
Graphs (numerical/scale)
Dot chart (discrete variable), stem-and-leaf plot, histogram, boxplot, scatterplot
Histogram
A bar chart that gives the frequencies or relative frequencies of occurrances of a scale variable in certain intervals; the heights of the bars in the histogram are called the distribution of the sample
Characteristics of a distribution: left-skewed
Negatively skewed; the values to the left of the center fall further away from the center than those to the right of the center; the mean is less than the median
Characteristics of a distribution: Right-skewed
Positively skewed; the values to the right of the center fall further away from the center than those to the left of the center; the mean is greater than the median
Characteristics of a distribution: symmetric
Left and right sides of the graph are roughtly mirror images of eachother; the center is the mean and the mean ~ the median
How to describe graphical data
Center, variation, distribution, outliers, time
Independent variables
When two variables are not associated/there is no evident relationship between the two
Inference
Making decisions and predictions based on the data
Inferential statistics
Are used when data are available only for a sample but we want to make a decision or prediction about the entire population (confidence intervals, signficiance tests)
Intensity map (heat map)
Colors are used to show higher and lower values of a variable
Multi-stage sample
Clustering, but sample within each cluster rather than the entire cluster
Negatively associated
Downward trend between the two poles of the variables
Nominal variable
A categorical variable where the levels have no heirarchy; e.x. eye color, type of car
Non-response bias
When a sample’s recruitment’s nonresponse rate is high, so it’s unclear if those selected really represent the sample
Numerical summaries, location (descriptive statistics)
Mean, median, quantile/percentile, quartile, mode
Numerical summaries, spread (descriptive statistics)
Standard deviation, sample variance, range, interquartile range, coefficient of variance
Numerical variable
Can take a wide range of number values, and it is sensible to add/subtract/take averages
Observational data
No treatment has been explicity applied/witheld in regards to the data collected
Observational study
When data is collected in a way that does not interfere with how the data arise; can provide evidence of a naturally occuring association but alone cannot show a causal connection
Ordinal variable
A categorical variable where the levels have a natural ordering; e.x. level of education
Population
Is the total set of subjects in which we are interested
Positively associated
Upward trend between the two poles of the variables
Probability
Is the basic tool for evaluating chances and is alsothe key to how well inferential statistics work
Qualitative data in a one way table can include
Absolute frequency, relative requency, cumulative frequency, cumulative relative frequency
Qualitative data in a two way table can
Indicate the relationship between two variables
Random sample reduces…
The change of introducing biases
Randomization (experimental design)
Accounts for variables that can’t be controlled
Randomized experiment
When individuals are randomly assigned to a group in an experiment
Relative frequency
Frequency for that category/sum of all frequencies
Replication (experimental design)
Can be accomplished via a significantly large sample, or duplicating a study
Response variable
The second variable that changes based on the explanatory variable
Sample
The subset of the population for whom we have or plan to have data
Sampling methods are based in the notion of…
Implied randomness, and tend to be a good reflection of population when each subject in the population has the same chance of being included in that sample.
Scatterplot
Represents the bivartiate relationship between two variables (usually continuous variables) by plotting a data point for each observation in the data set; useful fo visualizing the relationship
Simple random sampling
Each case in a population has an equal chance of being included in the final sample; knowing a case is included does not provide useful info about what other cases are included (raffle-style)
Stratified sampling
Population is divided into strata (similar cases grouped together, like by age), then a second sampling is employed w/in each stratum (useful when cases in stratum are similar in respect to studied outcome)
Subjects
The entities that we measure in a study
Tabular methods
Table summary with frequency and or precent frequency
Types of descriptive statistics
Numerical methods, tabular methods, graphical methods
Characteristic of data: center
A representative or average value that indicates where the middle of the data set is located
Characteristic of data: variation
A measure of the amount that the data values vary among themselves
Characteristics of data: distribution
The nature or shape of the distribution of the data
Characteristics of the data: outliers
Sample values that lie very far away from the vast majority of the other sample values
Characteristics of data: time
Changing characteristics of the data over time (is there a trend?)
Shape of a distribution: Modality
How many prominent peaks are apparent within the distribution
Shape of a distribution: unimodal
A single prominent peak in the distribution
Shape of a distribution: bimodal
Two prominent peaks in the distribution
Shape of a distribution: multimodal
Several prominent peaks in the distribution
Shape of a distribution: uniform
No prominent peaks, mostly smooth
Mean (measure of center)
A measure of center; the sample mean is denoted as an x with a bar across the top, and the population mean is denoted as the greek letter mu (the little u with a tail)
Sample mean (x with bar over it)
A sample statistic that serves as a point estimate of the population mean
Variance (measures of variability)
The average squared deviation from the mean; we used the squared deviation to get rid of negatives so that observations equally distant from the mean are weighted equally, and to weigh larger deviation more heavily
Standard deviation (measures of variability)
The square root of the variance, and has the same units as the data
Median (measures of center)
The value that splits the data in half when ordered in ascending order; if there are an even number observations then the median is the average of the two values in the middle; also called the 50th percentile
IQR (measures of variability)
The middle 50% of the data included between the first quartile (25th percent) and the third quartile (75th percent); IQR = Q3 - Q1
Box plot
The box represents the middle 50% of the data, the line dissecting the box is the median, the upper and lower whiskers is the full range of the data and any dots are suspected outliers
Box plot: Whiskers
Max upper whisker reach = Q3 + 1.5 x IQR; max lower whisker reach = Q1 - 1.5 x IQR
Box plot: Outliers
Defined as an observation beyond the max reach of the whiskers, helpful for identifying extreme skew in the distribution, indentifying data collection/entry errors, provides insight into interesting features of data
Robust statistics
Median and IRQ are more robust to skewness and outliers
For skewed distributions, use…
Median (center) and IQR (spread)
For symmetric distributions, use…
Mean (center) and standard deviation (spread)
Log transformation
Useful when data is extremely skewed as it can make outliers less prominent, but the results of the analysis might be difficult to interpret because the log of a measured variable is usually meaningless