quiz2 Flashcards by Teeya Li

there are three types of data: quantitative, ordinal, and nominal. describe them

quantitative: numeric values with magnitude (think numbers)

ordinal: values or categories that can be ordered (think grades)

nominal: vales or categories that cant be ordered(think colours)

How well did you know this?

Not at all

Perfectly

what is the point of inferential statistics

to use well chosen samples to come to a probably correct conclusion about the population

How well did you know this?

Not at all

Perfectly

what is a probability distribution

the description of probabilities for all possible outcomes

How well did you know this?

Not at all

Perfectly

what is covariance. when is it a + covariance and when is it a - covariance

it describes how two variables are related.

+: large y with a large x

-: large y with a small x

How well did you know this?

Not at all

Perfectly

in inferential statistics we need to form a hypothesis: we need a null hypothesis and an alternate hypothesis. What’s the difference. What do we need to remember about hypothesises

Alternate hypothesis is usually what we are hoping to conclude, null hypothesis is the opposite.

these two hypothesis have to cover all possibilities

we assume the null is true and look for the data to force us to conclude that it isn’t. If it isn’t true then we have a proof by contradiction and we can assume the alternate is true.

we can never conclude the null is true. We can only falsify it

How well did you know this?

Not at all

Perfectly

what does the T-test do and what is it’s null hypothesis

if we have two samples which are both normal and
with equal-variance, the T-test will tell us if the distributions have different means

MUST BE normally-distributed and equal-variance

null hypothesis: means are equal

How well did you know this?

Not at all

Perfectly

what does the p-value tell us

all inferential tests end up with a probability. which is the probability of seeing our data if the null hypothesis is true. Alternatively you can think of it as, if the p-value is small we can reject the null hypothesis and accept the alternative hypothesis.

if smaller than 0.05 we reject the null hypothesis. If greater than 0.05 we do not reject it.

How well did you know this?

Not at all

Perfectly

what do you do if you don’t know if a distribution is normal or not

use stats.normaltest
where null hypothesis is that: data is normal

stats.normaltest(data).pvalue

if the pvalue returned is >0.05 we can conclude it is normal since we cannot falsify the null hypothesis

How well did you know this?

Not at all

Perfectly

what do you do if you don’t know if two distributions have equal variance

use the levene’s test

which has a null hypothesis: two samples do have equal variance

stats.levene(data1, data2).pvalue

if p-value >0.05 we can assume they do have equal variance because we cannot falsify the null hypothesis

How well did you know this?

Not at all

Perfectly

we can transform data if it isn’t normal to make it normal enough.

Assuming all data are greater than 0, what are the 4 ways you can transform data and when would they be useful

e^x (if data left-skewed, longer on left)

x^2 (if data left-skewed, longer on left)

root(x) (if data right-skewed, longer on right)

log(x) (if data right-skewed, longer on right)

How well did you know this?

Not at all

Perfectly

what is the issue with doing t-tests on 2+ datasets

what should we do to prevent the issue

if you do multiple t-tests, it increases the likely hood that there is an incorrect rejection of the null hypothesis

instead you should use the Bonferroni correction, where you choose a threshold of 0.05/(num of t-tests conducted)

How well did you know this?

Not at all

Perfectly

What is ANOVA and its purpose

to test if the means of any groups differ, it is like a t-test but for +2 groups

Musts be
- observations must be independent and identically distributed
- normally distributed
- equal variance

Null hypothesis: groups have the same mean

How well did you know this?

Not at all

Perfectly

What does ANOVA not tell you. What do you need to do to find out.

ANOVA tells you that there is a difference in the means (if there is) but we don’t know which groups have the different mean.

Use Post Hoc Analysis, only if we have a ANOVA value of less than 0.05.

ie. the groups do not have the same mean

How well did you know this?

Not at all

Perfectly

How do we use the Post Hoc Analysis: Turkey’s HSD and what does it return

use panda.melt to get the data in a format that you want (unpivoted data), and then you can use the post hoc Turkey test

it returns a list of pairs and tells us if they have different means. Reject column tells us if true, they are different

How well did you know this?

Not at all

Perfectly

give an example of where you would want to use a one sided tail test rather than two.

what is a way you can conduct this test

if you only want to determine if there is a difference between groups in a specific direction. (ir. will studying get me a better grade)

conduct ur test and look at the p value, we change the signifigance level to 0.10

a two sided test where p < 0.10 is the same as one sided test is the same as a one sided test where p < 0.05

How well did you know this?

Not at all

Perfectly

what is a mann-whitney U-test used and what is it good for

what is the null hypothesis

Study These Flashcards

it is used when you know nothing about ur distribution or you cannot transform it into a normal distribution.

it is used to check for a difference in the distributions of two independent samples

can be used on ordinal or continouous data

null: there is no significant difference in the groups distributions

what is chi-square, what does it do, why is it used

what is the null hypothesis

what does it need to run

Study These Flashcards

chi-square is used for categorical data with little structure

it tells you if the categories are independent

null: categories are independent, (ex. university does not affect ur happiness)

a contingency table

what is a way to produce a contingency table for categorical data for a chi-square test

Study These Flashcards

panda’s cross tab function

when training model that looks linear, what model should you use

Study These Flashcards

Linear Regression: draws a straight line through the input/training data to best fit match and estimates are done on this line

what do you do if you don’t have data that a linear regression can cleanly fit through

Study These Flashcards

use polynomial regression

when validating the training what are you looking for in the score returned by

model.score(X_valid, y_valid)

Study These Flashcards

a high(er) number = better fit. ie a number closer to 1

describe the Naive Bayes method

Study These Flashcards

some times regressions don’t work because it assumes a continuous value that we can map to.

some times we only have categories, so instead we find, which category are you most likely to be apart of

There are a few ways to categorize Naive Bayes is one of them

get the probability of the input being in each category. After, we look for the category with the highest probability and that becomes our category.

For this method we assume that the input features are not related to each other

what are baysian priors used for

Study These Flashcards

it defines the probability of finding each category. We define the likelihood of each category before we start predicting. These predictions give the categories weight before we even look at the data

what is the nearest neighbors method of model training classification

Study These Flashcards

look for k-nearest neighbors to make a prediction about what category you would likely fit into

what is the decision trees classification technique what are some useful information's when making decision trees

build a big nested if/else structure. At the end of each branch, make a prediction. we can limit the height, leaf and splits of the decision tree to stop us from over fitting the data

what is the point of an ensamble in decision trees what is random forest in decision trees what is boosting in decision trees

ensemble: combine multiple models to improve overall performance, since basic decision trees can overfit the data random forests: build multiple decision trees using random subsets of the data. At the end merge the trees together, will increase robustness boosting: build decision trees sequentially, each tree correcting the error of the previous one. The final prediction is the weighted sum of the individual tree prediction

what is the point of PCA (principal component analysis

it decreases the number of dimensions in data but keeps the information that matters

what is the difference between supervised and unsupervised learning

supervised: with with examples where we know what the output should be unsupervised: there is no known right answer prior

what should you do when you want to do a regression but the data doesn't fit a linear model but you want to predict a number

use the k-nearest neighbors regressor, take the neighbors value and find the mean, this is out prediction random forest regressor: instead of each decison being a category, put a number value at each leaf neural network regressor: take inputs, weight them somehow, have an activation function to normalize the results. Do some magic to learn the weights from training data Add extra layers of computation for more complex decision

what is clustering in ML techniques

it is an unsupervised training where you find observations that are similar and group them together into a cluster differing clustering algorithms will give different results and take different parameters

what is anomaly detect in ML techniques

it is another unsupervised training where you find unusual observations. you use the unusual observations to try and detect more outliers

what is a neural network

neural network: a method in AI that teaches the computer to process data in a way that is inspired by the human brain. Where you will have interconnected nodes in a layered structure

you have a dataframe called searches and a column called uid, you want to create a new column called 'whichUI' based off if the uid is even or odd write it

searches['whichUI'] = np.where(searches['uid'] % 2 == 0, 'true', 'odd')

quiz2 Flashcards

(33 cards)