L5: Data science solutions Flashcards
Crisp DM
Import –> Tidy <–> Visualize <–> Model <–>
<–> Transform
Six type of questions
Data analysis flowchart
Descriptive
Exploratory
Inferential
Predictive
Causal
Mechanistic
What is a model
A simplified representation of reality crated to serve a purpose
What is a predictive model?
· A formula for estimating the unknown value of interest: the target
The formula can be mathematical, logical statement (e.g., rule), etc.
What is prediction?
· Estimate an unknown value (i.e. the target)
Instance / example:
· Represents a fact or a data point
Described by a set of attributes (fields, columns, variables, or features)
Model induction:
o The creation of models from data
Training data
The input data for the induction algorithm
Test data
Data used to test the model
How to choose a model
Based on the question!!
· Descriptive questions demand descriptive statistics, or unsupervised learning
· Predictive questions are best answered with predictive models, e.g. machine learning
· For all other questions, inferential statistics are probably your best bet
· Pay attention to
o Your dependent variable> kind
o Your independent variables> kind, number
Assumptions
Non parametric tests
Non-parametric tests are used when we the data is non-parametric. This is the case if
the dependent variable does not represent a continuous interval-scaled or ratio-scaled variable
errors (also called residuals) which represent the difference between the expected or predicted values and the observed values do not approximate a normal distribution
the dependent variable is ordinal (it represents ranks)
Chi Square
The test evaluates whether the observed frequencies in a contingency table match the expected frequencies if the two categorical variables are independent.
E.g. Suppose you want to test if there’s an association between gender (male, female) and preference for a particular product (like, dislike)
Assumptions:
· Observations are independent of each other.
· Categories are mutually exclusive.
Expected frequency for each cell should be 5 or more for a 2x2 table. For larger tables, 80% of the cells should have expected frequencies of 5 or more, and all cells should have expected frequencies of 1 or more.