DATA ANALYSIS Flashcards
A/B TEST
TYPE OF INFERENTIAL ANALYSIS
DESCRIPTIVE ANALYSIS
Descriptive analysis lets us describe, summarize, and visualize data so that patterns can emerge. Sometimes we’ll only do a descriptive analysis, but most of the time a descriptive analysis is the first step in our analysis process.
Descriptive analyses include measures of central tendency (e.g., mean, median, mode) and spread (e.g., range, quartiles, variance, standard deviation, distribution), which are referred to as descriptives or summary statistics.
Typically, data visualization is also included in descriptive analysis.
EXPLORATORY ANALYSIS
Exploratory analyses show us underlying patterns and relationships within datasets.
Exploratory analyses cannot determine causation.
INFERENTIAL ANALYSIS
Inferential analysis lets us test a hypothesis on a sample of a population and then extend our conclusions to the whole population.
CAUSAL ANALYSIS
CORRELATION =! CAUSATION
Experiments that support causal analysis:
Correlation does not equal causation.
Proving causation is tricky and generally requires very careful experimental design.
Replication, randomization, and control are key components of good experimental design.
PREDICTIVE ANALYSIS
DATA ANALYSIS
SUMMARY STATISTICS
CENTRAL TENDENCY
INCLUDED IN DESCRIPTIVE ANALYSIS
e.g., mean, median, mode
EX OF SUMMARY STATISTIC
SPREAD
INCLUDED IN DESCRIPTIVE ANALYSIS
(e.g., range, quartiles, variance, standard deviation, distribution
EX OF SUMMARY STATISTIC
UNSUPERVISED LEARNING
CLUSTERING ALGORITHMS
PRINCIPAL COMPONENT ANALYSIS
K-MEANS CLUSTERING
Rand statistic
GOOD EXPERIMENTAL DESIGN
REPLICATION
RANDOMIZATION
CONTROL
REPLICATION
GATHER ENOUGH SUBJECTS (REPLICATES) TO SUPPORT STATISTICAL ANALYSIS
RANDOMIZATION
ASSIGN SUBJECTS RANDOMLY INTO TREATMENT GROUPS, SO EACH SUBJECT HAS AN EQUAL CHANCE TO BE IN ANY TREATMENT GROUP
CONTROL
CONTROL ALL FACTORS THAT ARE NOT THE EXPERIMENT’S FOCUS BUT COULD INFLUENCE THE OUTCOME
Causal inference with observational data
requires:
Advanced techniques to identify a causal effect
Meeting very strict conditions
Appropriate statistical tests
SUPERVISED MACHINE LEARNING
Supervised machine learning algorithms are trained with labeled data and predict the likelihood of future outcomes.
Supervised machine learning algorithms can only be as good as the data used to train them.
POPULAR SUPERVISED MACHINE LEARNING TECHNIQUES
REGRESSION MODELS
SUPPORT VECTOR MACHINES
DEEP LEARNING CCN
REGRESSION MODELS
SUPPORT VECTOR MACHINES
DEEP LEARNING CONVOLUTIONAL NEURAL NETWORKS
GARBAGE IN
GARBAGE OUT
LOW RISK PREDICTION
HIGH RISK PREDICTION
AUTOMATION BIAS
Automation bias stems from the idea that computers or machines are more trustworthy than humans because they are more objective. Automation bias is at the root of why people follow their GPS into trouble, even when contradictory information is available.
BIAS
SYSTEMATIC ERRORS IN THINKING INFLUENCED BY CULTURAL AND PERSONAL EXPERIENCE.
DISTORT OUR PERCEPTION AND CAUSE US TO MAKE INCORRECT DECISIONS.
SELECTION/SAMPLING BIAS
Selection bias occurs when study subjects (i.e., the sample) are not representative of the population. Selection bias can be due to poor study design if the sample is too small or is not randomized. Selection bias can also crop up when the only data available is influenced by historical bias
HISTORICAL BIAS
systematic influence based on historic social and cultural beliefs
ALGORITHMIC BIAS
Algorithmic bias arises when an algorithm produces systematic and repeatable errors that lead to unfair outcomes, such as privileging one group over another. Algorithmic bias can be initiated through selection bias and then reinforced and perpetuated by other bias types.
EVALUATION BIAS
Testing an algorithm with a non-representative dataset leads to evaluation bias. Testing with a non-representative benchmarking dataset would give high overall accuracy scores, even if the algorithms were inaccurate for certain groups.
BLACK BOX
the algorithms are proprietary, making them “black boxes”. In addition to not knowing what data were used to train and test the algorithm, we can’t know how it was designed or how it works. As a result, it’s impossible to evaluate the algorithms themselves.
CONFIRMATION BIAS
our tendency to seek out information that supports our views. Confirmation bias influences data analysis when we consciously or unconsciously interpret results in a way that supports our original hypothesis. To limit confirmation bias, clearly state hypotheses and goals before starting an analysis, and then honestly evaluate how they influenced our interpretation and reporting of results.
OVERGENERALIZATION BIAS
Is inappropriately extending observations made with one dataset to other datasets, leading to overinterpreting results and unjustified extrapolation. To limit overgeneralization bias, be thoughtful when interpreting data, only extend results beyond the dataset used to generate them when it is justified, and only extend results to the proper population.
REPORTING BIAS
is the human tendency to only report or share results that affirm our beliefs or hypotheses, also known as “positive” results. Editors, publishers, and readers are also subject to reporting bias as positive results are published, read, and cited more often. To limit reporting bias, report negative results and cite others who do, too.
NOMINAL
ORDINAL