data science - statistics I Flashcards

1
Q

EDA

A

Exploratory Data Analysis - first step of the data science project, familiarizing yourself with the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

continuous data

A

data that can take any value in an interval (float)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

discrete data

A

data that can only take integer values, such as counts (int)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

categorical data

A

data that can take on only a specific set of values representing a set of possible categories

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

binary data

A

special subset of categorical data that with just two category values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

ordinal data

A

categorical data that has an explicit ordering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

feature

A

often a column in a table, attribute/predictor of a row of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

record

A

a row in a table

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

data scientists use features to predict a target, while statisticians..

A

use predictor variables in a model to predict a response/dependent variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

trimmed mean

A

avg of all values after dropping a fixed number of extreme values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

robust

A

not sensitive to extreme values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

x-bar

A

sample mean of a population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

reasons to use a weighted mean

A

1) observations that are highly variable may be given a lower weight (ex: a sensor that is less accurate)
2) data collected doesn’t represent different groups that we are interested in measuring (ex: give greater weight to underrepresented minorities )

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

why is the median more robust than mean as an estimate of central tendency

A

It isn’t influenced by outliers / extreme cases that could skew the results

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

what is thought to be a compromise between mean and median

A

trimmed mean - robust to extreme values in data, but uses more data to calculate the estimate for central tendency than median

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

variance

A

sum of squared deviations from the mean divided by n - 1 (aka: mean-squared-error)

17
Q

standard deviation

A

the square root of variance (aka: 12-norm, euclidean norm)

18
Q

range

A

difference between largest and smallest value in a dataset

19
Q

interquartile range

A

the difference between the 75th percentile and the 25th percentile

20
Q

mean absolute deviation

A

mean of the abs value of the deviations from the mean

21
Q

Why use n-1 instead of n when calculating variance?

A

When using n, you will underestimate the true value of the variance and the std in the population.

22
Q

what measure of variability is most robust to extremes / outliers?

A

median abso value = median (| x1 - m |, |x2 - m |,… | xN -m |)

23
Q

Graph that shows the min/max, IQR, median

A

box plot, box and whisker plot

24
Q

tally of the count of data falling into intervals/bins

A

frequency table

25
Q

plot of the frequency table

A

histogram

26
Q

smoothed version of a histogram

A

density plot

27
Q

When the categories can be associated with a numeric value, this gives an average value based on a category’s probability of occurrence.

A

expected value

28
Q

How is the expected value calculated (2 steps)

A

A marketer for a new cloud technology, for example, offers two levels of service, one priced at $300/month and another at $50/month. The marketer offers free webinars to generate leads, and the firm figures that 5% of the attendees will sign up for the $300 service, 15% for the $50 service, and 80% will not sign up for anything. This data can be summed up, for financial purposes, in a single “expected value,” which is a form of weighted mean in which the weights are probabilities.

The expected value is calculated as follows:

  1. Multiply each outcome by its probability of occurring.
  2. Sum these values.

In the cloud service example, the expected value of a webinar attendee is thus $22.50 per month, calculated as follows:
EV = (0.05)(300)+(0.15)(50)+(0.8)(0) == 22.5

29
Q

Exploratory data analysis often begins with what 3 things?

A
  1. univariate analysis
  2. examining correlation among predictors (features)
  3. examining correlation among features and the target
30
Q

A metric that measures the extent to which numeric variables are associated with one another (ranges from –1 to +1).

A

correlation coefficient (R)

31
Q

A table where the variables are shown on both rows and columns, and the cell values are the correlations between the variables.

A

correlation matrix

32
Q

A plot in which the x-axis is the value of one variable, and the y-axis the value of another.

A

scatterplot

33
Q

How do we compute the correlation coefficient?

A

we multiply deviations from the mean for variable 1 times those for variable 2, and divide by the product of the standard deviations & (n-1):

(Summation of (xi - xbar)(yi - ybar)) / ((n-1) * SDx * SDy)

34
Q

what is an example where correlation coefficient isn’t a useful metric

A

when the relationship is not linear

35
Q

A tally of counts between two or more categorical variables.

A

Contingency tables

36
Q

A plot of two numeric variables with the records binned into hexagons.

A

Hexagonal binning

37
Q

A plot showing the density of two numeric variables like a topographical map.

A

Contour plots

38
Q

Similar to a boxplot but showing the density estimate.

A

Violin plots

39
Q

scatterplots are good for smaller amounts of data, what are good alternatives when having large amounts of data

A

hexagonal binning, contour plots,heat maps