quiz2 Flashcards
there are three types of data: quantitative, ordinal, and nominal. describe them
quantitative: numeric values with magnitude (think numbers)
ordinal: values or categories that can be ordered (think grades)
nominal: vales or categories that cant be ordered(think colours)
what is the point of inferential statistics
to use well chosen samples to come to a probably correct conclusion about the population
what is a probability distribution
the description of probabilities for all possible outcomes
what is covariance. when is it a + covariance and when is it a - covariance
it describes how two variables are related.
+: large y with a large x
-: large y with a small x
in inferential statistics we need to form a hypothesis: we need a null hypothesis and an alternate hypothesis. What’s the difference. What do we need to remember about hypothesises
Alternate hypothesis is usually what we are hoping to conclude, null hypothesis is the opposite.
these two hypothesis have to cover all possibilities
we assume the null is true and look for the data to force us to conclude that it isn’t. If it isn’t true then we have a proof by contradiction and we can assume the alternate is true.
we can never conclude the null is true. We can only falsify it
what does the T-test do and what is it’s null hypothesis
if we have two samples which are both normal and
with equal-variance, the T-test will tell us if the distributions have different means
MUST BE normally-distributed and equal-variance
null hypothesis: means are equal
what does the p-value tell us
all inferential tests end up with a probability. which is the probability of seeing our data if the null hypothesis is true. Alternatively you can think of it as, if the p-value is small we can reject the null hypothesis and accept the alternative hypothesis.
if smaller than 0.05 we reject the null hypothesis. If greater than 0.05 we do not reject it.
what do you do if you don’t know if a distribution is normal or not
use stats.normaltest
where null hypothesis is that: data is normal
stats.normaltest(data).pvalue
if the pvalue returned is >0.05 we can conclude it is normal since we cannot falsify the null hypothesis
what do you do if you don’t know if two distributions have equal variance
use the levene’s test
which has a null hypothesis: two samples do have equal variance
stats.levene(data1, data2).pvalue
if p-value >0.05 we can assume they do have equal variance because we cannot falsify the null hypothesis
we can transform data if it isn’t normal to make it normal enough.
Assuming all data are greater than 0, what are the 4 ways you can transform data and when would they be useful
e^x (if data left-skewed, longer on left)
x^2 (if data left-skewed, longer on left)
root(x) (if data right-skewed, longer on right)
log(x) (if data right-skewed, longer on right)
what is the issue with doing t-tests on 2+ datasets
what should we do to prevent the issue
if you do multiple t-tests, it increases the likely hood that there is an incorrect rejection of the null hypothesis
instead you should use the Bonferroni correction, where you choose a threshold of 0.05/(num of t-tests conducted)
What is ANOVA and its purpose
to test if the means of any groups differ, it is like a t-test but for +2 groups
Musts be
- observations must be independent and identically distributed
- normally distributed
- equal variance
Null hypothesis: groups have the same mean
What does ANOVA not tell you. What do you need to do to find out.
ANOVA tells you that there is a difference in the means (if there is) but we don’t know which groups have the different mean.
Use Post Hoc Analysis, only if we have a ANOVA value of less than 0.05.
ie. the groups do not have the same mean
How do we use the Post Hoc Analysis: Turkey’s HSD and what does it return
use panda.melt to get the data in a format that you want (unpivoted data), and then you can use the post hoc Turkey test
it returns a list of pairs and tells us if they have different means. Reject column tells us if true, they are different
give an example of where you would want to use a one sided tail test rather than two.
what is a way you can conduct this test
if you only want to determine if there is a difference between groups in a specific direction. (ir. will studying get me a better grade)
conduct ur test and look at the p value, we change the signifigance level to 0.10
a two sided test where p < 0.10 is the same as one sided test is the same as a one sided test where p < 0.05
what is a mann-whitney U-test used and what is it good for
what is the null hypothesis
it is used when you know nothing about ur distribution or you cannot transform it into a normal distribution.
it is used to check for a difference in the distributions of two independent samples
can be used on ordinal or continouous data
null: there is no significant difference in the groups distributions
what is chi-square, what does it do, why is it used
what is the null hypothesis
what does it need to run
chi-square is used for categorical data with little structure
it tells you if the categories are independent
null: categories are independent, (ex. university does not affect ur happiness)
a contingency table
what is a way to produce a contingency table for categorical data for a chi-square test
panda’s cross tab function
when training model that looks linear, what model should you use
Linear Regression: draws a straight line through the input/training data to best fit match and estimates are done on this line
what do you do if you don’t have data that a linear regression can cleanly fit through
use polynomial regression
when validating the training what are you looking for in the score returned by
model.score(X_valid, y_valid)
a high(er) number = better fit. ie a number closer to 1
describe the Naive Bayes method
some times regressions don’t work because it assumes a continuous value that we can map to.
some times we only have categories, so instead we find, which category are you most likely to be apart of
There are a few ways to categorize Naive Bayes is one of them
get the probability of the input being in each category. After, we look for the category with the highest probability and that becomes our category.
For this method we assume that the input features are not related to each other
what are baysian priors used for
it defines the probability of finding each category. We define the likelihood of each category before we start predicting. These predictions give the categories weight before we even look at the data
what is the nearest neighbors method of model training classification
look for k-nearest neighbors to make a prediction about what category you would likely fit into
what is the decision trees classification technique
what are some useful information’s when making decision trees
build a big nested if/else structure. At the end of each branch, make a prediction.
we can limit the height, leaf and splits of the decision tree to stop us from over fitting the data
what is the point of an ensamble in decision trees
what is random forest in decision trees
what is boosting in decision trees
ensemble: combine multiple models to improve overall performance, since basic decision trees can overfit the data
random forests: build multiple decision trees using random subsets of the data. At the end merge the trees together, will increase robustness
boosting: build decision trees sequentially, each tree correcting the error of the previous one. The final prediction is the weighted sum of the individual tree prediction
what is the point of PCA (principal component analysis
it decreases the number of dimensions in data but keeps the information that matters
what is the difference between supervised and unsupervised learning
supervised: with with examples where we know what the output should be
unsupervised: there is no known right answer prior
what should you do when you want to do a regression but the data doesn’t fit a linear model but you want to predict a number
use the k-nearest neighbors regressor, take the neighbors value and find the mean, this is out prediction
random forest regressor: instead of each decison being a category, put a number value at each leaf
neural network regressor: take inputs, weight them somehow, have an activation function to normalize the results. Do some magic to learn the weights from training data Add extra layers of computation for more complex decision
what is clustering in ML techniques
it is an unsupervised training where you find observations that are similar and group them together into a cluster
differing clustering algorithms will give different results and take different parameters
what is anomaly detect in ML techniques
it is another unsupervised training where you find unusual observations. you use the unusual observations to try and detect more outliers
what is a neural network
neural network: a method in AI that teaches the computer to process data in a way that is inspired by the human brain. Where you will have interconnected nodes in a layered structure
you have a dataframe called searches and a column called uid, you want to create a new column called ‘whichUI’ based off if the uid is even or odd
write it
searches[‘whichUI’] = np.where(searches[‘uid’] % 2 == 0, ‘true’, ‘odd’)