data science - statistics I Flashcards by Zach M

EDA

Exploratory Data Analysis - first step of the data science project, familiarizing yourself with the data

How well did you know this?

Not at all

Perfectly

continuous data

data that can take any value in an interval (float)

How well did you know this?

Not at all

Perfectly

discrete data

data that can only take integer values, such as counts (int)

How well did you know this?

Not at all

Perfectly

categorical data

data that can take on only a specific set of values representing a set of possible categories

How well did you know this?

Not at all

Perfectly

binary data

special subset of categorical data that with just two category values

How well did you know this?

Not at all

Perfectly

ordinal data

categorical data that has an explicit ordering

How well did you know this?

Not at all

Perfectly

feature

often a column in a table, attribute/predictor of a row of data

How well did you know this?

Not at all

Perfectly

record

a row in a table

How well did you know this?

Not at all

Perfectly

data scientists use features to predict a target, while statisticians..

use predictor variables in a model to predict a response/dependent variable

How well did you know this?

Not at all

Perfectly

trimmed mean

avg of all values after dropping a fixed number of extreme values

How well did you know this?

Not at all

Perfectly

robust

not sensitive to extreme values

How well did you know this?

Not at all

Perfectly

x-bar

sample mean of a population

How well did you know this?

Not at all

Perfectly

reasons to use a weighted mean

1) observations that are highly variable may be given a lower weight (ex: a sensor that is less accurate)
2) data collected doesn’t represent different groups that we are interested in measuring (ex: give greater weight to underrepresented minorities )

How well did you know this?

Not at all

Perfectly

why is the median more robust than mean as an estimate of central tendency

It isn’t influenced by outliers / extreme cases that could skew the results

How well did you know this?

Not at all

Perfectly

what is thought to be a compromise between mean and median

trimmed mean - robust to extreme values in data, but uses more data to calculate the estimate for central tendency than median

How well did you know this?

Not at all

Perfectly

variance

Study These Flashcards

sum of squared deviations from the mean divided by n - 1 (aka: mean-squared-error)

standard deviation

Study These Flashcards

the square root of variance (aka: 12-norm, euclidean norm)

range

Study These Flashcards

difference between largest and smallest value in a dataset

interquartile range

Study These Flashcards

the difference between the 75th percentile and the 25th percentile

mean absolute deviation

Study These Flashcards

mean of the abs value of the deviations from the mean

Why use n-1 instead of n when calculating variance?

Study These Flashcards

When using n, you will underestimate the true value of the variance and the std in the population.

what measure of variability is most robust to extremes / outliers?

Study These Flashcards

median abso value = median (| x1 - m |, |x2 - m |,… | xN -m |)

Graph that shows the min/max, IQR, median

Study These Flashcards

box plot, box and whisker plot

tally of the count of data falling into intervals/bins

Study These Flashcards

frequency table

plot of the frequency table

histogram

smoothed version of a histogram

density plot

When the categories can be associated with a numeric value, this gives an average value based on a category’s probability of occurrence.

expected value

How is the expected value calculated (2 steps)

A marketer for a new cloud technology, for example, offers two levels of service, one priced at $300/month and another at $50/month. The marketer offers free webinars to generate leads, and the firm figures that 5% of the attendees will sign up for the $300 service, 15% for the $50 service, and 80% will not sign up for anything. This data can be summed up, for financial purposes, in a single “expected value,” which is a form of weighted mean in which the weights are probabilities. The expected value is calculated as follows: 1. Multiply each outcome by its probability of occurring. 2. Sum these values. In the cloud service example, the expected value of a webinar attendee is thus $22.50 per month, calculated as follows: EV = (0.05)(300)+(0.15)(50)+(0.8)(0) == 22.5

Exploratory data analysis often begins with what 3 things?

1. univariate analysis 2. examining correlation among predictors (features) 3. examining correlation among features and the target

A metric that measures the extent to which numeric variables are associated with one another (ranges from –1 to +1).

correlation coefficient (R)

A table where the variables are shown on both rows and columns, and the cell values are the correlations between the variables.

correlation matrix

A plot in which the x-axis is the value of one variable, and the y-axis the value of another.

scatterplot

How do we compute the correlation coefficient?

we multiply deviations from the mean for variable 1 times those for variable 2, and divide by the product of the standard deviations & (n-1): (Summation of (xi - xbar)(yi - ybar)) / ((n-1) * SDx * SDy)

what is an example where correlation coefficient isn't a useful metric

when the relationship is not linear

A tally of counts between two or more categorical variables.

Contingency tables

A plot of two numeric variables with the records binned into hexagons.

Hexagonal binning

A plot showing the density of two numeric variables like a topographical map.

Contour plots

Similar to a boxplot but showing the density estimate.

Violin plots

scatterplots are good for smaller amounts of data, what are good alternatives when having large amounts of data

hexagonal binning, contour plots,heat maps

data science - statistics I Flashcards

(39 cards)