Chapter 5 - Statistics Flashcards

1
Q

What are you conducting if you want to obtain data for every element of your population? Why is it not generally done?

A

You’re conducting a census. It’s not done because the resources needed to check every single one of your chosen population are huge!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

You’ve identified that you need analyze data about films directed by Steven Spielberg. What would your population be?

A

Population would be all films made by Steven Spielberg

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

After finding all films made by Spielberg, you want to concentrate analysis ones made with a particular camera model, what name is given to this type of analysis?

A

Univariate Analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Pick the right word for the gaps….

______ pertain to the sample and ________ pertain to the population

A

statistics pertain to the sample and parameters pertain to the population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

which branch of statistics summarizes and describes data?

A

Descriptive Statistics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

what type of statistics do you use to help you understand the characteristics of your data?

A

Descriptive Statistics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

in descriptive statistics, the first step is applying measures of what to your sample data? Why?

A

Using measures of frequency (like the count) to determine the size of the data set
It will help you determine if you can analyse the data simply on your laptop, or will require more processing power than a laptop provides.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What’s the most commonmeasure of frequency?

A

Count

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

when measuring the count of a dataset, what must you handle when doing this?

A

How to handle null values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the 3 measures of frequency mentioned in the book?

A

Count
Percentage
Frequency

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

A histogram is typically used to visualize what measure when conducting what kind of analysis?

A

Used to visualize frequency when conducting univariate analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What frequency of measure can help you identify biases in your dataset? Bias must be taken in the context of what?

A

Percentage measures
Bias must be taken within the context of your OBJECTIVES. It’s fine if the percentage of males in a sample is 100% if you’re only concerned with data that should only include men!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the 3 measured of central tendency?

A

Mean
Median
Mode

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

The Mean is also known as?

A

The average

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

When calculating the mid-point value (Median) of an even number of data observations, what must you do?

A

Add together the two values closest to the mid-point, divided by 2.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

what is the calculation that tells you which POSITION (not value) in an ordered list of odd number observations is the median? Describe what ‘n’ is.

A

n+1 divided 2. n = the number of observations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What central tendency measure that identifies the most frequently occurring observation?

A

the Mode.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

what are the measures of dispersion mentioned in the book?

A

Range
Distribution
Variance
Standard Deviation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What’s the name given to the difference between a variable’s max and min values?

A

The Range

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Why is it that calculating the range on temperature values by themselves won’t help you identify invalid data?

A

Because temperature values can vary widely and have positive and negative values. You need additional information like location and time of year to give context

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Which tool is effective to visualize a probability distribution? Why?

A

Histogram. Because the shape you see provides additional insights as to how to proceed with analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Which theorem states that as sample size increases, it becomes more likely that the sampling distribution will become normally distributed?

A

Central Limit Theorem.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Whilst they look very similar, a frequency histogram and a distribution histogram are different. How?

A

The frequency histograms focus on the raw counts that each interval occurs.
Distribution histograms focus on the shape and spread by looking at how often an interval value occurs in relation to the total number of values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Jon is taking a sample of which the parent population is normal. He takes several samples at varying sizes, some of them are less than 30. Would the distribution of these sampling means be skewed?

A

No. They would all be normal because the parent population is also normal.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

If the parent population is skewed, you may need a sufficiently large sample size to get a normally distributed pattern. How large is ‘sufficient’?

A

Sample sizes of 30 or more is generally considered sufficiently large.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Pat is analysing a sample dataset and sees that the mean and the median a far apart. What is the probability distribution mostly likely going to be? What would it be if they were close together?

A

It will be skewed. If they’re close together it will more likely be normally distributed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

if the mean is greater than the median, data may be skewed _____?
If the mean is less than the median, data may be skewed ______?

A

if the mean is greater than the median, data may be skewed RIGHT
If the mean is less than the median, data may be skewed LEFT

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

If a histogram distribution is skewed left, the mean is ______ than the median
If a histogram distribution is skewed right, the mean is _____ than the median

A

If a histogram distribution is skewed left, the mean is LESS than the median
If a histogram distribution is skewed right, the mean is GREATER than the median

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Po visualizes data about bus usage and sees there are two separate peaks in the data. What kind of distribution is this called?

A

bimodal distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

You want to understand the variability of a dataset, you have the variance calculation, but don’t have the standard deviation. Can you still infer anything useful from variance alone?

A

Yes. You can determine the magnitude of the deviations from the mean and compare this between different sets of data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Why is standard deviation preferred over variance for understanding the dispersion of data?

A

Standard deviation is preferred because it is expressed in the same units as the original data and allows for easy comparison with other dataset standard deviations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Variance emphasizes BLANK whereas standard deviation emphasizes BLANK

A

Variance emphasizes MAGNITUDE whereas standard deviation emphasizes ACTUAL DEVIATION from the mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

If a dataset had a mean value of 130 and a standard deviation of 20, what would be the upper and lower limit of one standard deviation?

A

Lower limit = 130 - 20 = 110
Upper limit = 130 + 20 = 150

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

List the 3 values of the empirical rule below:
xx% of values fall within 1 standard deviation
xx% of values fall within 2 standard deviations
xx% of values fall within 3 standard deviations

A

68% = 1 standard deviation
95% = 2 standard deviations
99.7% = 3 standard deviations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Ron wants to compare different normal distributions that have different means and standard deviations. The sample sizes are large. How should he do it?

A

He should convert the values into Z-scores to standardize the distribution

36
Q

Jeremy wants to plot the sampling distribution of a very small data set and then compare it with other standard deviations, what should he do?

A

Convert the values to t-scores to standardize the distribution and plot a t-distribution.

37
Q

What element affects the height of the bell portion and the thickness of the ‘tails’ either side of it?

A

The number of degrees of freedom. I.e. the size of the sampled data. The fewer the degrees of freedom, the thicker the tails of the distribution

38
Q

Complete sentence: Z-score calculates a single ? in a ?. A z-statistic calculates a sample ? in a ?

A

Z-score calculates for a SINGLE VALUE in a SAMPLE and a z-statistic calculates for a SAMPLE MEAN from a population of sample means

39
Q

What do you need to know to calculate a z-statistic and what do you need to use instead if you don’t know it?

A

You need to know the population standard deviation. If you don’t, then you need to use the T-statistic which uses the standard deviation of the sample.

40
Q

a t-score and z-score convert WHAT to a WHAT. And this tells you how many WHAT the value is from WHAT?

A

A t-score and z-score converts an observation value to a standard value. This tells you how many standard deviations the value is from the sample mean.

41
Q

When is a t-score used?
When is a z-score used?

A

A t-score is used where a) the population SD is not known or b) the sample size is less than 30.
A z-score is used for large sample sizes or if the population standard deviation is known.

42
Q

The output value of a z or t-test is known as what?

A

The value is known as the z or t-statistic. They calculate how a sample mean (if the mean is the statistic chosen) relates to a population mean.

43
Q

Assuming you’re looking at the mean. A z-statistic or t-statistic calculates many WHAT is from WHAT?

A

It calculates how many standard deviations away the mean of the sample is from the hypothesized (null) mean

44
Q

Calculating a z-score converts what to what?

A

Converts the values to use a standardized scale. This allows you to compare where observations in a sample fall in relation to each other

45
Q

if you knew the population standard deviation or if sample sizes were sufficiently large, what distribution would you use to be able to do the below two things:
1) compare the dispersion across different variables
2) Calculate probabilities

A

using the z-distribution or standard normal distribution

46
Q

What is the following describing “25% of the data points are as small or smaller AND 75% of the data points are as large or larger”

A

The 1st Quartile or the 25th Percentile

47
Q

Quartile divisions are always at the upper or lower bound of the quartile values?

A

At the upper bound

48
Q

What is used to help identify outliers outside the acceptable range?

A

The Interquartile range

49
Q

The IQR and the range both indicate the spread of data but what’s the difference between them?

A

The IQR focuses on the spread of the middle 50% of the data whilst the Range considers only the two most extreme values

50
Q

Why is the interquartile range less sensitive to extreme values?

A

Because it doesn’t consider all of the data and it’s associated outliers, it only looks at the middle 50% spread of the data

51
Q

When you express estimates for a population parameter, it is good practice to express it in terms what values it lies between. For example, the mean value of the population lies between 28 and 32. What is this called?

A

A Confidence Interval

52
Q

What two things affect the width of a confidence interval?

A

1) the population variation
2) sample size

53
Q

population with lots of _______ leads to varied samples with high ___________ which leads to ______ confidence intervals

A

population with lots of VARIATION leads to varied samples with high VARIATION which leads to WIDER confidence interval

54
Q

Small sample sizes will have greater/less variability and will result in a narrower/wider confidence interval

A

Small sample sizes will have GREATER variability and will result in a WIDER confidence interval

55
Q

how do you calculate the standard error which is used to find the confidence interval?

A

divide the sample standard deviation by the square root of the sample size.

56
Q

The null hypothesis (Ho) is thing that we’re trying to provide evidence for/against?

A

AGAINST

57
Q

The p-value tells us what?

A

how likely it is to get a result like this

58
Q

the smaller the p-value, the less likely it is that what?

A

less likely that the result we got was as a result of pure luck

59
Q

inferential statistics is based on the premise that you can/cannot prove something, but can/cannot disprove something by finding an…..

A

inferential statistics is based on the premise that you CANNOT prove something, but CAN disprove something by finding an exception

60
Q

which hypothesis refers to the status quo, or the thing we’re trying to find evidence against?

A

the Null hypothesis

61
Q

the goal of hypothesis testing is to minimize what?

A

both Type 1 and Type 2 errors.

62
Q

If the standard deviation measures the variability of data points in a sample (or population) then what would the standard deviation of the means of multiple samples be called?

A

The Standard Error

A.K.A SEM (Standard Error of the Means)

63
Q

It represents the average variability of sample means around the true population mean.

A

Standard Error

if this is a lot, it could indicate sampling isn’t good enough

64
Q

You hypothesize about the _________. You collection sample data to draw inference about the ________. We KNOW whether there is a difference between the sample means, we use information about the samples to decide, using the T or Z _____ and the __ value, whether there is evidence to say there is a difference between the two _____________ means you are testing

A

You hypothesize about the population. You collection sample data to draw inference about the POPULATION. We KNOW whether there is a difference between the sample means, we use information about the samples to decide, using the T or Z-Test and the P-VALUE, whether there is evidence to say there is a difference between the two POPULATION means you are testing

65
Q

what is this calculating? 100% - C (confidence level)?

A

this calculates the alpha level

a.k.a. Significance level

66
Q

Which hypothesis test would you conduct to compare categorical variables against what was expected?

A

Chi-squared test

67
Q

in linear regression, what measures how well the regression line fits the observations? What also indicates the strength of this value?

A

Correlation.
The correlation co-efficient indicates the strength of the correlation

68
Q

A correlation co-efficient of 0.9 means the observations are low or highly correlated?

A

Highly correlated.

69
Q

What is the name of the value that helps you understand by how much the sample statistic may differ from the population statistic?

A

The Standard Error

helps determine if your sample is representative of the population

70
Q

They calculate how a sample mean (if the mean is the statistic chosen) relates to a population mean.

A

T-statistic or z-statistic

71
Q

measures the variability or spread of individual data points within a sample or population.

A

Standard Deviation

72
Q

Outlier, Max and Min determine what element of a dataset?

A

Range

73
Q

=MODE.MULT calculates what in Excel?

A

Will find multiple modes in a list of data

datasets can have more than one mode if they have the same frequency

74
Q

type of data that will display skewness where more of the data falls on the left or right side of the mean.

A

non-parametric data

75
Q

Chi-squared test - is used to assess whether there is a statistically significant association between two categorical variables.

A

Test of independance

76
Q

Test that compares the observed frequencies (actual results) with the expected frequencies?

A

Chi-Square test

77
Q

It is the statistical association between two (or more) equal variables

A

Correlation

78
Q

What important variables are associated with t-tests or z-tests?

A

Dependant Variables
Independant variables

79
Q

in statistical testing - the categorical variable that defines the groups we are comparing

A

Independant Variable

80
Q

the thing that we compare between two groups.

A

Dependant Variable

81
Q

What testing identifies how confident the results are (or are not) different from what is expected and that there is a relationship between the variables?

A

chi-squared test

82
Q

Type of Chi-Square test - Determines whether a single categorical variable follows a hypothesized distribution

A

Goodness of fit

83
Q

chi-squared test - is used to assess whether there is a statistically significant association between two categorical variables.

A

Test of Independance

84
Q

What term describes the possibility that a sample statistic contains the true population parameter in a range of values around the mean?

A

Confidence Interval

85
Q

in a distribution with a left skew, which end will represent the lowest value? Tail end or the other?

A

Tail end.