Data Science using Python and R - 15 Flashcards

1
Q

What does descriptive statistics refer to?

A

Methods for summarizing and organizing the information in a data set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are elements in a data set?

A

Entities for which information is collected.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a variable?

A

A characteristic of an element, which takes on different values for different elements.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are observations in a data set?

A

The set of variable values for a particular element.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are qualitative variables?

A

Variables that enable the elements to be classified or categorized according to some characteristic.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are quantitative variables?

A

Variables that take numeric values and allow arithmetic to be meaningfully performed on them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the four levels of measurement for data?

A
  • Nominal * Ordinal * Interval * Ratio
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is nominal data?

A

Data that refer to names, labels, or categories without natural ordering.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is ordinal data?

A

Data that can be rendered into a particular order but cannot have arithmetic meaningfully performed on them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is interval data?

A

Quantitative data defined on an interval without a natural zero where addition and subtraction may be performed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is ratio data?

A

Quantitative data for which all arithmetic operations may be performed and a natural zero exists.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is a discrete variable?

A

A numerical variable that can take either a finite or a countable number of values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a continuous variable?

A

A numerical variable that can take infinitely many values, forming an interval on the number line.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a population in statistics?

A

The set of all elements of interest for a particular problem.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is a parameter?

A

A characteristic of a population, usually unknown but constant.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is a sample?

A

A subset of the population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is a statistic?

A

A characteristic of a sample.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is a census?

A

The collection of information from every element in the population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What does statistical inference refer to?

A

Methods for estimating or drawing conclusions about population characteristics based on a sample.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is a random sample?

A

A sample for which each element has an equal chance of being selected.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is a predictor variable?

A

A variable whose value is used to help predict the value of the response variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is a response variable?

A

A variable of interest whose value is presumably determined by the predictor variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What does frequency refer to in categorical data?

A

The number of data values in each category.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is a relative frequency?

A

The frequency of a category divided by the number of cases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is a frequency distribution?

A

A distribution consisting of all categories that the variable assumes, together with their frequencies.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What is a bar chart?

A

A graph used to represent the frequencies or relative frequencies for a categorical variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is a Pareto chart?

A

A bar chart where the bars are arranged in decreasing order.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What is a pie chart?

A

A circle divided into slices, with the size of each slice proportional to the relative frequency of the category.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

How are quantitative data grouped?

A

Into classes, with lower and upper limits defining the classes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What is a cumulative frequency distribution?

A

A distribution showing the total number of data values less than or equal to the upper class limit.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What is a histogram?

A

A graphical representation of a frequency distribution for a quantitative variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What is a stem-and-leaf display?

A

A display showing the shape of the data distribution while retaining the original data values.

33
Q

What is a dotplot?

A

A plot where each dot represents one or more data values set above the number line.

34
Q

What is a symmetric distribution?

A

A distribution that can be split into two halves that are approximately mirror images of each other.

35
Q

What characterizes right-skewed data?

A

A longer tail on the right than the left.

36
Q

What characterizes left-skewed data?

A

A longer tail on the left than the right.

37
Q

What does the summation notation ∑x mean?

A

To add up all the data values x.

38
Q

What are measures of center?

A
  • Mean * Median * Mode * Midrange
39
Q

How is the mean calculated?

A

Add up the values and divide by the number of values.

40
Q

What is the median?

A

The middle data value when sorted in ascending order.

41
Q

What is the mode?

A

The data value that occurs with the greatest frequency.

42
Q

What is the midrange?

A

The average of the maximum and minimum values in a data set.

43
Q

How does skewness affect the mean and median?

A

For symmetric data, they are approximately equal; for right-skewed, mean > median; for left-skewed, median > mean.

44
Q

What are measures of variability?

A
  • Range * Variance * Standard deviation * Interquartile range (IQR)
45
Q

How is the range calculated?

A

Difference between the maximum and minimum values.

46
Q

What is the relationship between the mean and median in right-skewed data?

A

The mean is greater than the median.

47
Q

What is the relationship between the mean and median in left-skewed data?

A

The median is greater than the mean.

48
Q

What are the measures of variability that quantify the amount of variation in data?

A
  • Range
  • Variance
  • Standard deviation
  • Interquartile range (IQR)
49
Q

How is the range of a variable calculated?

A

Range = max(value) - min(value).

50
Q

What is a deviation in the context of data analysis?

A

The signed difference between a data value and the mean value.

51
Q

What is the formula for population variance?

A

σ² = (Σ(x - μ)²) / N

52
Q

What is the square root of the population variance called?

A

Population standard deviation.

53
Q

What is the difference between sample variance and population variance?

A

Sample variance uses n - 1 in the denominator to be an unbiased estimator.

54
Q

What is the formula for sample standard deviation?

A

s = √(s²)

55
Q

Why is standard deviation preferred over variance for reporting results?

A

Standard deviation is expressed in the original units.

56
Q

What does the sample standard deviation represent?

A

The size of the typical deviation from the mean.

57
Q

What is the definition of the pth percentile in a data set?

A

The data value such that p percent of the values are at or below this value.

58
Q

What is the percentile rank of a data value?

A

The percentage of values in the data set that are at or below that value.

59
Q

How is the Z-score calculated?

A

Z-score = (x - μ) / s

60
Q

What does a Z-score represent?

A

How many standard deviations a data value lies above or below the mean.

61
Q

What does the Empirical Rule state for a normal distribution?

A
  • About 68% of data within one standard deviation
  • About 95% of data within two standard deviations
  • About 99.7% of data within three standard deviations
62
Q

What is the first quartile (Q1) in a data set?

A

The 25th percentile of the data set.

63
Q

What is the formula for calculating the interquartile range (IQR)?

A

IQR = Q3 - Q1

64
Q

How is an outlier defined using the IQR method?

A
  • x ≤ Q1 - 1.5(IQR)
  • x ≥ Q3 + 1.5(IQR)
65
Q

What does the five-number summary of a data set consist of?

A
  • Minimum
  • Q1
  • Median
  • Q3
  • Maximum
66
Q

What is a boxplot used for?

A

To recognize symmetry and skewness in a data distribution.

67
Q

What does the left whisker in a boxplot indicate?

A

The minimum value that is not an outlier.

68
Q

What characterizes a left-skewed distribution in a boxplot?

A

The left whisker is longer than the right whisker.

69
Q

What is a bivariate relationship?

A

The relationship between two variables.

70
Q

What is a contingency table?

A

A crosstabulation of two categorical variables.

71
Q

What does a clustered bar chart represent?

A

A graphical representation of a contingency table.

72
Q

How can we summarize the relationship between a quantitative variable and a categorical variable?

A

By calculating summary statistics for the quantitative variable for each level of the categorical variable.

73
Q

What is an individual value plot?

A

A set of vertical dotplots for each category in a categorical variable.

74
Q

What is a scatter plot used for?

A

To visualize the relationship between two quantitative variables.

75
Q

What does the correlation coefficient r indicate?

A

The strength and direction of the linear relationship between two quantitative variables.

76
Q

What are the possible values of the correlation coefficient r?

A

−1 ≤ r ≤ 1.

77
Q

What does a positive and significant r indicate?

A

x and y are positively correlated.

78
Q

What does a negative and significant r indicate?

A

x and y are negatively correlated.