EDA Flashcards

1
Q

Continuous vs Discrete Data

A

Numerical data

1) Data that can take on any value in an interval
2) Data that can be taken only as an integer value, such as counts

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Binary vs Ordinal Data

A

Categorical data - specific set of values representing a set of possible categories

1) special case of categorical data with just 2 categories of values
2) categorical data that has an explicit ordering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Advantages of explicit identification of data

A

1) tells software how statistical procedures, such as making a chart of fitting a model should behave
Sk.learn.preprocessing.OrdinalEncoder
2) storage and indexing can be optimized
3) possible values given a categorical variable can take are enforced in the software

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Types of nonrectangular data structures

A

1) time series - successive measurements of the same variable, raw material for statistical forecasting methods
2) spatial data structure - mapping and location
3) graph data structures - physical, social, and abstract relationships

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Estimate

A

Getting a typical value for each features: an estimate of where most of the data is located

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Mean

A

Average

Sum of all values divided by the number of values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Weighted mean

A

weighted average

Sum of all values times a weight divided by the sum fo the weights

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Median

A

Value such that 1/2 of the data lies above and below

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Percentile

A

Quantile

Value such that p percent of the data lies below

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Weighted median

A

Value such that one half of the sum of the weights lies above and below the sorted data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Trimmed mean

A

truncated mean
Average of all values after dropping a fixed number of extreme values
Eliminates influence of extreme values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Robust

A

Not sensitive to extreme values

Median is a robust estimator

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Outlier

A

Data value that is very different from most the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Metrics vs estimates

A

Statisticians estimate - account for uncertainty - draw a distinction between what we see from the data and the theoretical true or exact state of affairs
Metric - concrete business or organizational objectives at the focus of data science

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

anomaly detection

A

Points of interest that are the outliers, and the greater mass of data serves primarily to define the “normal” against which anomalies are measured

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Dispersion/variability

A

Measures whether the data values are tightly clustered or spread out
Heart of stats: measure, reduce, and distinguishing random from real variability, identify various sources of real variability and making decisions in the presence of it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Deviations

A

Difference between the observed values and the estimate of location (mean)
Errors, residuals

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Variance

A

Sum of squared deviations from the mean divided by n-1 where n is the number of data values

an average of the squared deviations

Mean squared error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Standard deviation

A

Square root of the variance

I2-norm, Euclidean norm

20
Q

Mean absolute deviation

A

Mean of the absolute value of the deviations from the mean

I1norm, manhattan norm

21
Q

Median absolute deviation from the median

A

The median of the absolute value of the deviations from the mean

22
Q

Range

A

Difference between the largest and the smallest value in a dataset

23
Q

Order statistics

A

Metrics based on the data values sorted from smallest to biggest

24
Q

Percentile

A

Value such that P percent of the values take on this value or less and (100-P) percent take on this value or more

Quantile

25
Q

Interquartile range

A

Difference between the 75th percentile and the 25th percentile

IQR

26
Q

Degrees of freedom

A

N-1 in denominator instead of n

If you use n you will underestimate the true value of the variance and the standard deviation in the population - biased
When you do n-1, the variance becomes an unbiased estimate

Degrees of freedom = takes into account the number of constraints in computing an estimate
One constraint - standard deviation depends on calculating the sample mean

27
Q

Boxplot

A

Visualize distribution of the data

Top and the bottom of the box are 75th and 25th percentiles, respectively
Had
Median is the horizontal line in the box

Whiskers - extend from the top and bottom to indicate the range of the bulk of the dataw

28
Q

Frequency table

A

Tally of the count of numeric data values that fall into a set of intervals

29
Q

Histogram

A

Plot of the frequency table with the bins on the x-axis and the count on the y-axis

30
Q

Density plot

A

Smooth version of the histogram, often based on a kernel density estimate

31
Q

Why make bins?

A

Both frequency tables and percentiles summarize the data by creating bins
In general, quartile and deciles will have the same count in each bin (equal-count bins), but bin size will be different
Small bins = result is too granular and the ability to see bigger pictures is lost

32
Q

Statistical moments

A

1) location
2) variability
3) skew ness
4) kurtosis

33
Q

Skewness

A

Refers to whether the data is skewed to larger or smaller values

34
Q

Kurtosis

A

Propensity of the data to have extreme values

35
Q

Density estimates, density plot

A

Smoothed histogram

A density plot corresponds to plotting the history ram as a proportion rather than counts

36
Q

Mode

A

The most commonly occurring category or value in a data set

37
Q

Expected value

A

When the categories can be associated with a numeric value, this give an average value based on a category’s probability of occurence

1) multiply each outcome by its probability of occurring
2) sum these values

  • future expectations and probability weights
38
Q

Bar charts

A

Frequency or proportion for each category plotted as bars

39
Q

Pie charts

A

Frequency or proportion for each category plotted as wedges in a pie

40
Q

Correlation coefficient

A

A metric that measures the extent to which numeric variables are associated with one another (ranges from -1 to +1)

Multiply deviations from the mean for variable 1 times those for variable 2 and divide by the product of the standard deviations

41
Q

Correlation matrix

A

A table where the variables are shown on both rows and columns, and the cell values are the correlations between the variables

42
Q

Scatterplot

A

A plot in which the x-axis is the values of one variable, and the y-axis the value of another

43
Q

Contingency tables

A

A tally of counts between 2 or more categorical variables

44
Q

Hexagonal binning

A

A plot of two numeric variables with the records binned into hexagons

45
Q

Contour plots

A

A plot showing the density of 2 numeric variables like a topographical map

46
Q

Violin plots

A

Similar to a boxplot but showing the density estimate

Plot a numeric variable against a categorical variable

47
Q

Boxplot

A

Visually compare the distributions of a numeric variable grouped according to a categorical variable