Exploratory Data Analysis Flashcards

1
Q

Data that are expressed on a numerical scale is what data type?

A

numeric

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Data that can take only a specific set of values representing a set of possible categories (enums, enumerated, factors, nominal) are what data type?

A

categorical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Cite the two numerical data types

A

continuous and discrete

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Cite the two categorical data types

A

binary and ordinal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Data that can take on any value in an interval (float, numeric)

A

continuous

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Data that can take only integer values, such as counts

A

discrete

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

True or False

Data typing in software acts as a signal on how to process the data

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Rectangular data (like a spread sheet) is the basic structure for statistical and machine learning models, cite the structure?

A

dataframe

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

A column (series) within a table is commonly referred to as a _______?

A

feature

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Many data science projects involve predicting an ______?

A

outcome (dependent variable, response, target, output)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

A row in a table is referred to as a ______?

A

record

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the sum of all values divided by the number of values

A

mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

The sum of all values times a weight divided by the sum of the weights

A

weighted mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

The value such that one-half of the data lies above and below

A

median

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

The value such that P percent of the data lies below

A

percentile (quantile)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

The value such that one-half of the sum of the weights lies above and below the sorted data

A

weighted median

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

The average of all values after dropping a fixed number of extreme values

A

trimmed mean (truncated mean)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Not sensitive to extreme values

A

Robust (resistant)

19
Q

What is a data value that is very different from most of the data?

A

Outlier (extreme value)

20
Q

The difference between the observed values and the estimate of location?

A

deviations

21
Q

The sum of squared deviations from the mean divided by n-1 where n is the number of data values.

A

variance

22
Q

The square root of the variance

A

standard deviation

23
Q

The mean of the absolute values of the deviations from the mean (L1-norm, Manhattan norm)

A

mean abs deviation

24
Q

The mean of the absolute values of the deviations from the median

A

median abs deviation from the median

25
Q

The difference between the largest and smallest value in a data set.

A

range

26
Q

Metrics based on the data values sorted from smallest to largest (ranks)

A

order statistics

27
Q

The value such that P percent of the values take on this value or less and (100-P) percent take on this value or more

A

percentile

28
Q

The difference between the 75th percentile and the 25th percentile (IQR)

A

interquartile range

29
Q

The basic metric for location is the ____, but it can be sensitive to extreme values _____?

A

mean, outliers

30
Q

Location is one dimension in summarizing a _______?

A

feature

31
Q

A second dimension, variability (dispersion) measures?

A

whether the data values are tightly clustered or spread out.

32
Q

What are the objectives being performed during the exploratory data analysis phase (EDA)?

A
elements of structured data
estimates of location
estimates of variability(dispersion metrics)
exploring the data distribution
exploring binary and categorical data
correlation
exploring two or more variables
33
Q

estimates of location can be describe by?

A

mean, median and robust estimates.

34
Q

What is being described in estimates of location?

A

the mean, median and robust estimates.

35
Q

What is being described in the estimates of variability?

A

standard deviation and related estimates

estimates based on percentiles

36
Q

What is being described when exploring the data distributions?

A

percentiles and boxplots
frequency tables and histograms
density plots and estimates

37
Q

What is meant by exploring binary and categorical data?

A

mode, expected value, probability

38
Q

What describes exploring two or more variables?

A

hexagon binning and contours (plotting numeric vs numeric data)
two categorical variables
visualizing multiple variables

39
Q

Give an example of a numerical continuous data?

A

weight (which can be infinitely divided)

40
Q

Give an example of a numerical discrete data?

A

year of birth (numerical data that can’t be divided)

41
Q

Give an example of a categorical binary value? (only two options)

A

a brand of camera, Sony

42
Q

Give an example of a categorical ordinal value? (ordinal meaning order)

A

data where the order of it matters

43
Q

Pandas stands for what?

A

panel data