Exploratory Data Analysis Flashcards

1
Q

Data that are expressed on a numerical scale is what data type?

A

numeric

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Data that can take only a specific set of values representing a set of possible categories (enums, enumerated, factors, nominal) are what data type?

A

categorical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Cite the two numerical data types

A

continuous and discrete

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Cite the two categorical data types

A

binary and ordinal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Data that can take on any value in an interval (float, numeric)

A

continuous

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Data that can take only integer values, such as counts

A

discrete

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

True or False

Data typing in software acts as a signal on how to process the data

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Rectangular data (like a spread sheet) is the basic structure for statistical and machine learning models, cite the structure?

A

dataframe

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

A column (series) within a table is commonly referred to as a _______?

A

feature

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Many data science projects involve predicting an ______?

A

outcome (dependent variable, response, target, output)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

A row in a table is referred to as a ______?

A

record

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the sum of all values divided by the number of values

A

mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

The sum of all values times a weight divided by the sum of the weights

A

weighted mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

The value such that one-half of the data lies above and below

A

median

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

The value such that P percent of the data lies below

A

percentile (quantile)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

The value such that one-half of the sum of the weights lies above and below the sorted data

A

weighted median

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

The average of all values after dropping a fixed number of extreme values

A

trimmed mean (truncated mean)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Not sensitive to extreme values

A

Robust (resistant)

19
Q

What is a data value that is very different from most of the data?

A

Outlier (extreme value)

20
Q

The difference between the observed values and the estimate of location?

A

deviations

21
Q

The sum of squared deviations from the mean divided by n-1 where n is the number of data values.

22
Q

The square root of the variance

A

standard deviation

23
Q

The mean of the absolute values of the deviations from the mean (L1-norm, Manhattan norm)

A

mean abs deviation

24
Q

The mean of the absolute values of the deviations from the median

A

median abs deviation from the median

25
The difference between the largest and smallest value in a data set.
range
26
Metrics based on the data values sorted from smallest to largest (ranks)
order statistics
27
The value such that P percent of the values take on this value or less and (100-P) percent take on this value or more
percentile
28
The difference between the 75th percentile and the 25th percentile (IQR)
interquartile range
29
The basic metric for location is the ____, but it can be sensitive to extreme values _____?
mean, outliers
30
Location is one dimension in summarizing a _______?
feature
31
A second dimension, variability (dispersion) measures?
whether the data values are tightly clustered or spread out.
32
What are the objectives being performed during the exploratory data analysis phase (EDA)?
``` elements of structured data estimates of location estimates of variability(dispersion metrics) exploring the data distribution exploring binary and categorical data correlation exploring two or more variables ```
33
estimates of location can be describe by?
mean, median and robust estimates.
34
What is being described in estimates of location?
the mean, median and robust estimates.
35
What is being described in the estimates of variability?
standard deviation and related estimates | estimates based on percentiles
36
What is being described when exploring the data distributions?
percentiles and boxplots frequency tables and histograms density plots and estimates
37
What is meant by exploring binary and categorical data?
mode, expected value, probability
38
What describes exploring two or more variables?
hexagon binning and contours (plotting numeric vs numeric data) two categorical variables visualizing multiple variables
39
Give an example of a numerical continuous data?
weight (which can be infinitely divided)
40
Give an example of a numerical discrete data?
year of birth (numerical data that can't be divided)
41
Give an example of a categorical binary value? (only two options)
a brand of camera, Sony
42
Give an example of a categorical ordinal value? (ordinal meaning order)
data where the order of it matters
43
Pandas stands for what?
panel data