EDA Flashcards
Result from making observations either on a single variable or simultaneously on two or more variables
Data
Types of data
Primary data
Secondary data
Categorical data
Contonuous data
Collected fresh and for the first time and thus happen to be original in character
Primary data
Data which have been collected by someone else and which already have been passed through a statistical analysis
Secondary data
Is a variable type with two or more categories, take on one of a limited number of possible values
Categorical data
Data that can be measured on an infinite scale
Continuous data
Process of gathering and measuring information on targeted variables in an established system, which then enables one to answer relevant questions and evaluate outcomes
Data collection
Process of using diverse analytical methods to review data and arrive at relevant conclusions
Data interpretation
Process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making
Data analysis
Steps of Data Analysis
Defining the question
Collecting the data
Cleaning the data
Analyzing the data
Sharing your results
Embracing failure
Summary
Methods of Data collection
Population
Sample
Entire collection of individuals or objects about which information is desired is called the of interest
Population
A subset of the population, selected for study in some prescribed manner
Sample
Steps in data gathering
Measurement
Representation
Statistical tools to analyze data
Data gathering under measurement
Construct
Measurement
Response
Edited response
Data gathering under Representation
Target population
Sampling frame
Sample/respondents
Post-survey adjustments
Data gathering under statistical tools to analyze data
Statistical tests
Some advance modeling techniques
Some bias reduction techniques
Regression
Discrete choice
Three Major Methods of Data Collection
Mail
Telephone
Face to face survey
Types of Questionnaire Survey
Stated preference survey
Revealed preference survey
types of of stated choice experiment
Conjoint analysis
Contingent valuation method
Typically asking participants to choose one alternative from a set of hypothetical alternatives where attributes of alternatives are set by researcher
Conjoint analysis
Typically asking participants to answer (monetary) value of some public (non-market) good.
Contingent valuation method
The group of elements for which the survey investigator wants to make inferences by using the sample statistics
Target population
Lists/procedures intended to identify all elements of a target population or a set of units who are potentially selected as respondents
Sampling frame
Sample selected from a sampling frame
Sample
Sample who successfully answered
Respondents
Two types of Sample Non-response
Item nonresponse
Unit nonresponse
Respondent refuses to answer one or more survey questions
Item nonresponse
Respondents refuses to take the survey at all
Unit nonresponse
Handling non-response data by simply excluding the data having item-nonresponse
Procedures with completely recorded units
Handling non-response data by excluding the item0missing data, and handle the impacts by changing the weights
Weighting procedures
Handling non-response data in a way that the missing values are filled in and the resultant completed data are analyzed by standard methods
Imputation-based procedures
Handling non-response data by defining a model for the observed data with a certain missing
Model-based imputation procedures
Function gives the frequency of different possible values
Distribution
Examples of Single Variable
Histogram
Boxplot
Examples of data visualization under Categorical-continuous data of Multiple Variable
Point
Histogram
Boxplot
Examples of data visualization under Continuous-continuous data of Multiple Variable
Scatterplot
Heatplot
Yi=B0 + B1 X1 + ui
What are B0 and B1
parameters
Yi=B0 + B1 X1 + ui
What is ui
error term
two types of linear model
Regression analysis
Linear Regression
For of predictive modelling technique which investigates the relationship between a dependent (target) and independent variable (s) (predictor).
Regression analysis
The supervised machine learning model in which the model finds the best fit linear line between the independent and dependent variable i.e it finds the linear relationship between the dependent and independent variable
Linear regression
Probabilities of occurrence of different possible outcomes
Probability distribution
A “bell-shaped” distribution
Normal Distribution
A “discrete-probability” distribution
Poisson distribution
Confidence interval is calculated from what?
Standard Error
Is more intuitive and has clear quantitative implications
Confidence interval
Very popular in the old style
p-value
The probability to wrongly reject the collect null hypothesis
p-value
the 95th confidence interval is not intersected with the zero line
p-value
A statistical measure that expresses the extent to which two variables are linearly related (meaning they change together at a constant rate)
Correlation
Correlation does not necessarily mean what?
Causation
Refers to the claim that a set of observed data are not the result of chance but can instead be attributed to a specific cause. It is a way to tell you if your test results are solid.
Statistical Significance
Process of determining the magnitude of statistical variates at some future point of time.
Prediction
The process of using correlations between variables to hypothesize about future events and outcomes
Prediction
Used to model the relationship between two continuous variables
Simple Linear Regression
When to use Simple Linear Regression
Positive relationship
Negative relationship
Linear relationship
Curvilinear relationship
Used to model the relationship between a continuous response variable and continuous or categorical explanatory variables
Multiple Linear Regression
A statistical test that is used to compare the means of two groups
T-test
It is often used in hypothesis testing to determine whether a process or treatment influences the population of interest, or whether two groups are different from one another
T-test
A statistical method for testing for differences in the means of three or more groups
One Way - ANOVA
Meaning of ANOVA
Analysis of Variance
Test that measures how a model compares to actual observed data
Chi-square test