Chapter 5 - Statistics Flashcards
What are you conducting if you want to obtain data for every element of your population? Why is it not generally done?
You’re conducting a census. It’s not done because the resources needed to check every single one of your chosen population are huge!
You’ve identified that you need analyze data about films directed by Steven Spielberg. What would your population be?
Population would be all films made by Steven Spielberg
After finding all films made by Spielberg, you want to concentrate analysis ones made with a particular camera model, what name is given to this type of analysis?
Univariate Analysis
Pick the right word for the gaps….
______ pertain to the sample and ________ pertain to the population
statistics pertain to the sample and parameters pertain to the population
which branch of statistics summarizes and describes data?
Descriptive Statistics
what type of statistics do you use to help you understand the characteristics of your data?
Descriptive Statistics
in descriptive statistics, the first step is applying measures of what to your sample data? Why?
Using measures of frequency (like the count) to determine the size of the data set
It will help you determine if you can analyse the data simply on your laptop, or will require more processing power than a laptop provides.
What’s the most commonmeasure of frequency?
Count
when measuring the count of a dataset, what must you handle when doing this?
How to handle null values
What are the 3 measures of frequency mentioned in the book?
Count
Percentage
Frequency
A histogram is typically used to visualize what measure when conducting what kind of analysis?
Used to visualize frequency when conducting univariate analysis
What frequency of measure can help you identify biases in your dataset? Bias must be taken in the context of what?
Percentage measures
Bias must be taken within the context of your OBJECTIVES. It’s fine if the percentage of males in a sample is 100% if you’re only concerned with data that should only include men!
What are the 3 measured of central tendency?
Mean
Median
Mode
The Mean is also known as?
The average
When calculating the mid-point value (Median) of an even number of data observations, what must you do?
Add together the two values closest to the mid-point, divided by 2.
what is the calculation that tells you which POSITION (not value) in an ordered list of odd number observations is the median? Describe what ‘n’ is.
n+1 divided 2. n = the number of observations.
What central tendency measure that identifies the most frequently occurring observation?
the Mode.
what are the measures of dispersion mentioned in the book?
Range
Distribution
Variance
Standard Deviation
What’s the name given to the difference between a variable’s max and min values?
The Range
Why is it that calculating the range on temperature values by themselves won’t help you identify invalid data?
Because temperature values can vary widely and have positive and negative values. You need additional information like location and time of year to give context
Which tool is effective to visualize a probability distribution? Why?
Histogram. Because the shape you see provides additional insights as to how to proceed with analysis.
Which theorem states that as sample size increases, it becomes more likely that the sampling distribution will become normally distributed?
Central Limit Theorem.
Whilst they look very similar, a frequency histogram and a distribution histogram are different. How?
The frequency histograms focus on the raw counts that each interval occurs.
Distribution histograms focus on the shape and spread by looking at how often an interval value occurs in relation to the total number of values
Jon is taking a sample of which the parent population is normal. He takes several samples at varying sizes, some of them are less than 30. Would the distribution of these sampling means be skewed?
No. They would all be normal because the parent population is also normal.
If the parent population is skewed, you may need a sufficiently large sample size to get a normally distributed pattern. How large is ‘sufficient’?
Sample sizes of 30 or more is generally considered sufficiently large.
Pat is analysing a sample dataset and sees that the mean and the median a far apart. What is the probability distribution mostly likely going to be? What would it be if they were close together?
It will be skewed. If they’re close together it will more likely be normally distributed.
if the mean is greater than the median, data may be skewed _____?
If the mean is less than the median, data may be skewed ______?
if the mean is greater than the median, data may be skewed RIGHT
If the mean is less than the median, data may be skewed LEFT
If a histogram distribution is skewed left, the mean is ______ than the median
If a histogram distribution is skewed right, the mean is _____ than the median
If a histogram distribution is skewed left, the mean is LESS than the median
If a histogram distribution is skewed right, the mean is GREATER than the median
Po visualizes data about bus usage and sees there are two separate peaks in the data. What kind of distribution is this called?
bimodal distribution
You want to understand the variability of a dataset, you have the variance calculation, but don’t have the standard deviation. Can you still infer anything useful from variance alone?
Yes. You can determine the magnitude of the deviations from the mean and compare this between different sets of data.
Why is standard deviation preferred over variance for understanding the dispersion of data?
Standard deviation is preferred because it is expressed in the same units as the original data and allows for easy comparison with other dataset standard deviations
Variance emphasizes BLANK whereas standard deviation emphasizes BLANK
Variance emphasizes MAGNITUDE whereas standard deviation emphasizes ACTUAL DEVIATION from the mean
If a dataset had a mean value of 130 and a standard deviation of 20, what would be the upper and lower limit of one standard deviation?
Lower limit = 130 - 20 = 110
Upper limit = 130 + 20 = 150
List the 3 values of the empirical rule below:
xx% of values fall within 1 standard deviation
xx% of values fall within 2 standard deviations
xx% of values fall within 3 standard deviations
68% = 1 standard deviation
95% = 2 standard deviations
99.7% = 3 standard deviations