Mid-Term Exam Flashcards
population
the group of all items (data) of interest.
- frequently very large; sometimes infinite.
sample
a sample of items (data) drawn from the population of interest.
- potentially large but much less than population.
- the sample is a subset of the population.
parameter
a descriptive measure of a population.
- Ex. population mean
statistic
a descriptive measure of a sample.
- Ex. sample mean
statistical inference
sample statistics are used to make inferences about population parameters, meaning an estimate, prediction or decision can be produced about a population based on sample data. therefore what is known about a sample can be applied to the larger population.
numerical data
- values are real numbers
- all calculations are valid
- data may be treated as ordinal or nominal
nominal data
- values are the arbitrary numbers that represent categories
- only calculations, such as proportions based on the frequencies of occurrence are valid
- data may be treated as ordinal or numerical
ordinal data
- values must represent the ranked order of the data
- calculations based on an ordering process are valid
- data may be treated as nominal but not as numerical
bar chart
a bar chart is mainly used for nominal data and graphically represents the frequency of each category as a bar rising vertically from the horizontal axis.
- bar height is proportional to frequency of the corresponding category
pie chart
a circle that is subdivided into slices whose area are proportional to the frequencies, therefore displaying the proportion of occurrences of each category.
- popular tool to represent proportions of appearance for nominal data
steps to building a histogram (3)
1) collect the data
2) create a frequency distribution for the data
- determine number of classes
- determine class width
3) draw a histogram of rectangle bars using the class intervals and the corresponding frequencies
class width
generally best to use equal class widths. unequal class widths are used when the frequency associated with some classes is too low, then: - several classes are combined together to form a wider and more populated class - it is possible to form an open-ended class at the higher or lower of the histogram
relative frequency
proportion of observations falling into each class, and should be used when comparing two or more histograms, each with different numbers/observations.
- often preferable than the frequency itself
class relative frequency (formula)
(class frequency) divided by (total number of observations)
equal class width (formula)
(largest value - smallest value) divided by (number of classes)
cumulative frequency of a class
the number of measurements less than the upper limit of that class.
to obtain the cumulative frequency of a class
add the frequency of that class with the frequencies of all previous classes.
cumulative relative frequency of a particular class
the proportion of measurements that are less than the upper limit of that class.
arithmetic mean
most popular and useful measure of central location.
- all values are used
- it is unique
- the sum of the deviations from the mean is 0
- calculated by summing the values and dividing by the number of values
median of a set of measurements
the value that falls in the middle when the measurements are arranged in order of magnitude.
- unique median for each data set
- commonly used measure of central location
mode of a set of observations
the value that occurs most frequently.
- data set may have one, two or more modes (modal classes)
- useful for all data, mainly used for nominal
- for large data sets, modal class is more relevant than a single-value mode
which measure of central location?
- mean is generally first selection unless outliers are present in the dataset, then the median should be used.
- mode is seldom the best measure of central location.
- median is not as sensitive to extreme as is the mean.
variance
this measure of dispersion reflects the values of all the measurements.
standard deviation
the square root of the variance of the measurements.
empirical rules
- approximately 68% of all observations fall within 1 standard deviation of the mean
- approximately 95% of all observations fall within 2 standard deviations of the mean
- approximately 99.7% of all observations fall within 3 standard deviations of the mean
probability of an event
the probability P(A) of event A is the sum of the probabilities assigned to the simple events contained in A.
intersection of event A and B
the event that occurs when both A and B occur.
joint probability of A and B
the probability of intersection A and B.
conditional probability
conditional probability is used to determine how two events are related; that is, it can be determined the probability of one event given the occurrence of another related event.
discrete random variable
one that takes on a countable number of values (integers).
continuous random variable
one whose values are not discrete, not countable (real numbers).
discrete probability distribution
a table, formula or graph that lists all possible values a discrete random variable can assume, together with their associated probabilities.
expected value
the weighted average of the possible values it can assume, where the weights are the corresponding probabilities of each xi.
population variance
the weighted average of the squared deviations of the values of x from their mean, where the weights are the corresponding probabilities of each xi.
Statistical inference
The process of drawing conclusions about the properties of a population based on information obtained from a sample.
Sampling distribution
The tool that tells us how close the statistic is to the parameter.
Standard error
The standard deviation of the sampling distribution of the sample mean.
Central limit theorem
Random sample from normal population = sampling distribution of the sample mean is normally distributed
Random sample from any population = sampling distribution of sample mean is approximately normal for a large sample size (n>=30)
What causes a more closer resemblance of the sampling distribution of the sample mean to a normal distribution?
A larger sample size (n)
What does capital N mean?
Population size
A population size large relative to the sample size, the correction factor is …
Close to 1 and can be ignored
How large does a population sample have to be, to be considered “large”?
20 times larger than the sample size
Method for making statistical inferences:
- identify the parameter to be estimated
- specify the parameters estimator and its sampling distribution
- construct an interval estimator
Types of estimation (2)
- point estimator
- interval estimator
Point estimator
Estimates the value of an unknown parameter using a single value calculated from the sample data.
Interval estimator
Draws inferences about a population by estimating the value of an unknown population parameter by using an interval.
Estimator characteristics (3)
- unbiasedness
- consistency
- relative efficiency
Unbiasedness
An unbiased estimator is one whose expected value is equal to the parameter it estimates.
Consistency
An unbiased estimator is said to be consistent if the difference between the estimator and the population grows smaller as the sample size increases.
Relative efficiency
If there are two unbiased estimators available, the one with a smaller variance is said to be relatively efficient.
Examples of unbiased estimators
- sample mean
- sample median
- sample variance
- sample proportion
Examples of consistent estimators
- sample mean
- sample median
Examples of efficient estimators
Both the sample mean and median are unbiased estimators of the population mean. However the median has a greater variance than the sample mean, so the sample mean is relatively efficient when compared to the sample median.
Which is the “best” estimator?
The sample mean as it is unbiased, consistent and relatively efficient.
The expected value (E(X)) of the sampling distribution of the sample mean equals the population mean…
…for all populations.
As the level of confidence increases…
…the width also increases.
If the standard deviation is doubled…
…2B is doubled and visa versa
when n increases…
…the width of the confidence interval increases.
The width of the confidence (2B) interval is affected by:
- level of confidence
- population standard deviation
- sample size
Wide confidence intervals provide:
Little information
t-distribution
Mound-shaped and symmetrical around zero.
Degrees of freedom (n-1)
A function of the sample size, which determines how spread the distribution is compared to the normal distribution.
Purpose of hypothesis testing
To determine whether there is enough statistical evidence in favour of a certain belief about a population parameter.
Rejection region
Consists of all values of the statistic for which Ho is rejected.
Acceptance region
Consists of all values of the rest statistic for which Ho is not rejected.
Critical value
Value that separates the acceptance and rejection region.
Decision rule
Defines the range of values of the test statistic for which Ho is rejected in favour of HA.
A 90% confidence interval estimate of the population mean can be interpreted to mean…
If we repeatedly draw samples of the same size from the same population, 90% of values of the samples means will result in a confidence interval that includes the population mean.
P-value
The minimum level of significance that is required to reject the null hypothesis.
If a hypothesis is not rejected at the 0.10 level of significance it will…
…not he rejected at the 0.05 level.
P-value method:
- Good measure of amount of statistical evidence supporting HA
- Only employed statistical computer software
- Yields same conclusions as rejection region method
The expected value of the difference of two sample means the difference of the corresponding means is…
…always correct.
Description of linear relationship between two variables:
- covariance
- correlation coefficient
If the problem objective is to analyse the relationship…
Use correlation and regression analysis
Regression analysis
Used to predict the value of one variable on the basis of other variables.
Deterministic model
An equation or set of equations that allow us to fully determine the value of the dependent variable from the values of the independent variables.
Probabilistic model
A model used to capture the randomness that is part of a real-life process.
To create a probabilistic model:
Start with deterministic model that approximates the relationship we want to model and add a random term that measures the error of the deterministic model.
Random term (error variable)
Difference between actual selling price and estimated price based on the size of the house.
Estimated least square regression line
This least square method, produces a straight line that minimises the sum of the squared differences between the points and line.
The smallest the sum of the square differences…
… the better the fit.