Data Science using Python and R - 15 Flashcards
What does descriptive statistics refer to?
Methods for summarizing and organizing the information in a data set.
What are elements in a data set?
Entities for which information is collected.
What is a variable?
A characteristic of an element, which takes on different values for different elements.
What are observations in a data set?
The set of variable values for a particular element.
What are qualitative variables?
Variables that enable the elements to be classified or categorized according to some characteristic.
What are quantitative variables?
Variables that take numeric values and allow arithmetic to be meaningfully performed on them.
What are the four levels of measurement for data?
- Nominal * Ordinal * Interval * Ratio
What is nominal data?
Data that refer to names, labels, or categories without natural ordering.
What is ordinal data?
Data that can be rendered into a particular order but cannot have arithmetic meaningfully performed on them.
What is interval data?
Quantitative data defined on an interval without a natural zero where addition and subtraction may be performed.
What is ratio data?
Quantitative data for which all arithmetic operations may be performed and a natural zero exists.
What is a discrete variable?
A numerical variable that can take either a finite or a countable number of values.
What is a continuous variable?
A numerical variable that can take infinitely many values, forming an interval on the number line.
What is a population in statistics?
The set of all elements of interest for a particular problem.
What is a parameter?
A characteristic of a population, usually unknown but constant.
What is a sample?
A subset of the population.
What is a statistic?
A characteristic of a sample.
What is a census?
The collection of information from every element in the population.
What does statistical inference refer to?
Methods for estimating or drawing conclusions about population characteristics based on a sample.
What is a random sample?
A sample for which each element has an equal chance of being selected.
What is a predictor variable?
A variable whose value is used to help predict the value of the response variable.
What is a response variable?
A variable of interest whose value is presumably determined by the predictor variables.
What does frequency refer to in categorical data?
The number of data values in each category.
What is a relative frequency?
The frequency of a category divided by the number of cases.
What is a frequency distribution?
A distribution consisting of all categories that the variable assumes, together with their frequencies.
What is a bar chart?
A graph used to represent the frequencies or relative frequencies for a categorical variable.
What is a Pareto chart?
A bar chart where the bars are arranged in decreasing order.
What is a pie chart?
A circle divided into slices, with the size of each slice proportional to the relative frequency of the category.
How are quantitative data grouped?
Into classes, with lower and upper limits defining the classes.
What is a cumulative frequency distribution?
A distribution showing the total number of data values less than or equal to the upper class limit.
What is a histogram?
A graphical representation of a frequency distribution for a quantitative variable.
What is a stem-and-leaf display?
A display showing the shape of the data distribution while retaining the original data values.
What is a dotplot?
A plot where each dot represents one or more data values set above the number line.
What is a symmetric distribution?
A distribution that can be split into two halves that are approximately mirror images of each other.
What characterizes right-skewed data?
A longer tail on the right than the left.
What characterizes left-skewed data?
A longer tail on the left than the right.
What does the summation notation ∑x mean?
To add up all the data values x.
What are measures of center?
- Mean * Median * Mode * Midrange
How is the mean calculated?
Add up the values and divide by the number of values.
What is the median?
The middle data value when sorted in ascending order.
What is the mode?
The data value that occurs with the greatest frequency.
What is the midrange?
The average of the maximum and minimum values in a data set.
How does skewness affect the mean and median?
For symmetric data, they are approximately equal; for right-skewed, mean > median; for left-skewed, median > mean.
What are measures of variability?
- Range * Variance * Standard deviation * Interquartile range (IQR)
How is the range calculated?
Difference between the maximum and minimum values.
What is the relationship between the mean and median in right-skewed data?
The mean is greater than the median.
What is the relationship between the mean and median in left-skewed data?
The median is greater than the mean.
What are the measures of variability that quantify the amount of variation in data?
- Range
- Variance
- Standard deviation
- Interquartile range (IQR)
How is the range of a variable calculated?
Range = max(value) - min(value).
What is a deviation in the context of data analysis?
The signed difference between a data value and the mean value.
What is the formula for population variance?
σ² = (Σ(x - μ)²) / N
What is the square root of the population variance called?
Population standard deviation.
What is the difference between sample variance and population variance?
Sample variance uses n - 1 in the denominator to be an unbiased estimator.
What is the formula for sample standard deviation?
s = √(s²)
Why is standard deviation preferred over variance for reporting results?
Standard deviation is expressed in the original units.
What does the sample standard deviation represent?
The size of the typical deviation from the mean.
What is the definition of the pth percentile in a data set?
The data value such that p percent of the values are at or below this value.
What is the percentile rank of a data value?
The percentage of values in the data set that are at or below that value.
How is the Z-score calculated?
Z-score = (x - μ) / s
What does a Z-score represent?
How many standard deviations a data value lies above or below the mean.
What does the Empirical Rule state for a normal distribution?
- About 68% of data within one standard deviation
- About 95% of data within two standard deviations
- About 99.7% of data within three standard deviations
What is the first quartile (Q1) in a data set?
The 25th percentile of the data set.
What is the formula for calculating the interquartile range (IQR)?
IQR = Q3 - Q1
How is an outlier defined using the IQR method?
- x ≤ Q1 - 1.5(IQR)
- x ≥ Q3 + 1.5(IQR)
What does the five-number summary of a data set consist of?
- Minimum
- Q1
- Median
- Q3
- Maximum
What is a boxplot used for?
To recognize symmetry and skewness in a data distribution.
What does the left whisker in a boxplot indicate?
The minimum value that is not an outlier.
What characterizes a left-skewed distribution in a boxplot?
The left whisker is longer than the right whisker.
What is a bivariate relationship?
The relationship between two variables.
What is a contingency table?
A crosstabulation of two categorical variables.
What does a clustered bar chart represent?
A graphical representation of a contingency table.
How can we summarize the relationship between a quantitative variable and a categorical variable?
By calculating summary statistics for the quantitative variable for each level of the categorical variable.
What is an individual value plot?
A set of vertical dotplots for each category in a categorical variable.
What is a scatter plot used for?
To visualize the relationship between two quantitative variables.
What does the correlation coefficient r indicate?
The strength and direction of the linear relationship between two quantitative variables.
What are the possible values of the correlation coefficient r?
−1 ≤ r ≤ 1.
What does a positive and significant r indicate?
x and y are positively correlated.
What does a negative and significant r indicate?
x and y are negatively correlated.