final Flashcards
benefits of python
large ecosystem
community support
readability and simplicity
versatility
interactivity
integration
python basic data types
str, float, int, bool
tuple vs list vs dictionary
tuple: (), immutable - only count/index
list: [], mutable
dictionary: {key: value}
what is a function? what is a library?
function: block of organized, reusable code used to perform a single, related action
library: collections of pre-written code that provide ready-made functions and methods to accomplish specific tasks
* reusable code, promoting modularity, reducing redundancy + time used
common python libraries: numpy, pandas, matplotlib, scikit-learn, math
what does math.ceil() do?
numpy: numerical operations, arrays
pandas: data transformation, analysis, dataframes
matplotlib: data visualization, plotting
sklearn: tools for machine learning, predictive analytics
math: mathematical operations
math.ceil(): rounds up to integer
types of analytics
descriptive: what happened?
predictive: what could happen?
prescriptive: what should happen?
disgnostic
predictive analytics: noise, models, environment
noise: other factors impacting observations
models: mathematical approximations
environment: success depends on environment
linear regression
one continuous response variable, one or more continuous explanatory variable
use x to predict y by mapping a straight line through the data
the line is determined by OLS
types of predictive analytics
predict values
* exact value
* probability
* proportion
predict categories
* nominal groups
* ordinal groups (probability groups)
assumptions of linear regression
- error terms follow normal dist
- mean of error terms = 0
- variance of the error terms is constant, and independent of X
- error terms are independent of each other
- no multicollinearity
interpreting model estimates: coef, SD, t, P
coef: constant/slope of each term
SD: how much the coefficient varies
t: significance
P: significance
R squared
adjusted R-squared?
how much variation in Y is explained by X in linear regression
* increases with more variables included
* adjusted R-squared: adjusts for multiple predictors, decreases when additional variables do not contribute to model’s significance
continuous vs binary response
continuous:
* values are (-inf, +inf) or (0, +inf)
* fits straight line
* ex. profitability, attendance, capacity
binary:
* {0, 1}
* logit line
* ex. win/loss, survival, normal/failure
types of binary responses
winning percentage
probability of 1
failure rate
winning prob
what is logistic regression?
continuous/discrete variable predicts binary categorical variable
how do we transform linear y to a probability dist? how do we calculate odds?
assumptions of logit
y = 1 if y * >= 0
y = 0 if y * < 0
odds = Pr(Y=1)/Pr(Y = 0)
error term follows a logistic distribution
interpretation of logistic regression: coefficient, z-value, p-value
for each 1 unit increase in Xk, odds is multiplied by exp(Bk)
z-value: how many SDs the estimated coefficients are away from 0 on a standard normal curve; should be >2 to be statistically significant
p-value: < 0.05 to be statistically significant
accuracy/hit rate, true positive rate, true negative rate
accuracy rate: correct predictions/all predictions
TPR: true positives/all positive predictions
TNR: true negatives/all negative predictions
machine learning + types
machine learning: gives computers the ability to learn without being explicitly programmed
- supervised: we know the results so we can check accuracy; inputs > training > outputs
- unsupervised: does not predict anything, just identifies patterns/structures (ex. clustering); inputs > outputs
- reinforcement: learns from +ve and -ve reinforcement to maximize rewards; inputs > outputs > rewards
regression vs classification tree
regression: response variable continuous; classifies to value ranges
classification: response variable discrete/categorical; classifies to categories
overfitting, pruning, cross-validation, random forests
what do CARTs depend on?
overfitting: model describes random error/noise instead of underlying relationship; easily happens in CART models
* pruning: pruning leaves to reduce overfitting
cross-validation: mixes training sets to avoid overfitting and choose the best model
random forests: construct a bunch of decision trees at trainin time and output the mode of the classes or mean prediction
trainign sample, variable used, algorithms used
gini impurity, entropy
both are algorithms to determine which attribute is at the top of the tree
gini impurity: weighted avg of 1 - P(true)^2 - P(false)^2 for each leaf
entropy: True/False > used to calculate information gain (entropy before - entropy after)
prescriptive analytics + subcategories
examples
takes what we know (descriptive) to forecast what could happen (predictive) and decide what to do (prescriptive)
produces a reliable path to an optimal solution to business needs
characterized by rules, constraints, thresholds
helps managers make decisions under complex environments
* mathematical programming
* evolutionary computation
* probabilistic models
* simulation
* logic-based models
farm, self-driving car, flight prices
decision, objective, predictive models. environment
decision → decision variables are the input
objectives and measurable outcomes → output
predictive models → understanding input/output relationship
environment → complexity