Midterm Flashcards
Statistic
A numerical measurement describing some characteristic of a sample
Parameter
A numerical measurement describing some characteristic of a population
Population
The complete collection of all elements or subjects (scores, people, measurements, and so on) to be studied
Census
The collection of data from EVERY element in a population
Sample
A subcollection of elements drawn from a population
Discrete data
Result when a number of possible values is either a finite number or a “countable” number (dealing with counts)
Continuous data
Result from infinitely many possible values that correspond to some continuous scale that covers a range of values without gaps, interruptions, or jumps (often times has units of measure attached)
Nominal
Characterized by data that consist of names, labels, or categories only
Ordinal
Can be arrange in some order, but the difference is between the data values either cannot be determined or are meaningless
Interval
Similar to the ordinal level, but the difference between any two data values is meaningful. However, there is no natural zero starting point (where none of the quantity is present)
Ratio
Similar to the interval, but has a natural zero starting point ( where zero indicates none of the quantity is present)
Observational study
Observe and measure specific characteristics, but we don’t attempt to modify the subject being studied
Experiment
A treatment is applied to observe it’s effect on the subjects
Simulation
Mathematical or physical model used to reproduce a situation
Survey
Investigation of characteristics of a population
Placebo
A faux treatment looks like the real treatment
Placebo effect
Occurs when an untreated subject incorrectly believes that he/she is receiving a treatment and reports an improvement in symptoms
Blinding
A technique in which the subject doesn’t know whether he/she is receiving a treatment or placebo
Single blind
The researcher knew which subject received which treatment, but the subjects did not know
Double blind
Neither the researcher nor the subject knows who received a placebo it treatment
Block
A group of subjects (or experimental units) that are similar to test the effectiveness of one or more treatments
Randomized design
This is a way to assign subjects to block through Radom selection
Controlled design
Experimental units are carefully chosen so that the subject in each block are similar in the ways that are important
Confounding
Occurs in an experiment when the effect from two or more variables cannot be distinguished from each other
Sample size
- make sure your sample size is large enough, however, an extremely large sample is not necessarily a good sample
- make sure the sample is large enough to see the true nature of the effects
Replication
Helps to confirm results by repeating the experiment
Systematic sampling
Randomly select a starting point through a random number generator and take every kth subject of the population
- Identify and define the pop.
- Determine sample size
- list all members or pop.
- Determine k by dividing the number of members in the pop by the desired sample size (pop/sample size =every kth person)
- Choose a random starting point in the pop list
- Starting at that point in the pop., select every kth name on the list until the desired sample size is met
- if the end of the Los is reached before the desired sample size is drawn, go to the top of the list and continue
Convenience sample
A researcher chooses a sample that is convenient or easy for them to access
Sampling error
The difference between a sample result and the true population result; such as an error result from chance sample fluctuations
Non-sampling error
Occurs when the sample data are incorrectly collected recorded, or analyzed (uh as selecting a biased sample, using a defective measurement instrument, or copying the data incorrectly)
Quantitative data
Values that answer questions about the quantity or amount (with units) of what is being measured
Categorical data
(Qualitative data) can be separated into different categories that are often distinguished by some nonnumeric characteristic
Multistage samples
Sampling schemes that combine several methods
Randomization
Collect data in an appropriate way, otherwise our data are useless
Random sample
Members of a population are selected in a way that each has an equal chance of being selected
Simple random sample (SRS)
Subjects are selected in a way that every possible sample size n has the same chance of being chosen
- Identify and define the pop.
- Determine the sample size
- List all members of the pop.
- assign each member of the pop. A consecutive number from zero to the desired sample size
- Select an arbitrary starting number from the random number table
- look for the subject who was assigned that number. If there is a subject with that assigned number, they are in the sample
- Look to the net number in the random number table and repeat steps 6 and 7 until the appropriate number of participants has been selected
Cluster sampling
First divide the population area into sections (clusters) , then randomly select some of those clusters, and then choose all members from those selected clusters
- Identify and determine the pop.
- determine the sample size
- Identify and define a cluster
- List all clusters
- Estimate the average number of clusters needed
- Determine that desired number of clusters
- Choose the desired number of clusters using the simple random sampling technique
- All pop. Members in the included cluster are part of the sample
Stratified sampling
We subdivide the pop. into at least two different subgroups (or strata) that share the same characteristics ( such as age or gender), then draw a sample from each stratum
- identify and define the pop.
- determine the sample size
- Identify variable and strata for which equal representation is desired
- Classify all members of the population as a member of one strata
- Choose the desired number of subjects from each strata using the simple random sampling technique
Descriptive statistics
To summarize or describe the important characteristics of a set if data (the results of data)
Inferential statistics
We use these methods when we use sample data to make inferences or generalizations about a populations
When describing, exploring, and comparing quantitative data sets, the following characteristics of data are usually most important
- Shape
- Center
- Spread
Frequency distribution
List classes (or categories) of values, along with frequencies (or counts) of the number of values that fall into each class
Lower class limits
The smallest numbers that belong to different classes
Upper class limits
The larger numbers that belong to different classes
Class boundaries
The numbers used to separate classes, but without the gaps created by class limits
- find the size of the gap between the upper limit of one class and lower limit of the next
- add half the amount if each upper class limit to find the upper class boundaries
- subtract half of that out from each lower class limit to find the lower class boundaries
Class midpoints
Midpoints of the classes found by adding the lower and upper class limits if each class an dividing by 2
Class width
The difference between two consecutive lower class limits
Steps to a frequency distribution
- Figure out class width:
- class width = max number -min number /number of classes (btw 5-20)
- range/ number of classes
- make it the next largest integer ( always round up to have enough categories) - start the first class with min #
- add the class width to the min # to get the next lower limit
- continuer to do this until you have the number of classes that are required
- go back and create you upper limited ( 1 less than the next lower limit)
- last upper limit ( either add our class width to the previous upper limit or what would the upper limit be if there was another class)
- start the first class with min #
- Make tallies
- what class does the data piece fall into out a tally at that class - Count tallies and put the number in frequency column
- Find the midpoint of each class
- add the upper and lower class limits together and divide by 2 - Find the relative frequency for each class
The frequency in that class/ total number of frequency
- Find the cumulative frequency
- the sum of the frequency for that class and all the classes above
Pie chart
Contains slices of the pie that are proper proportions of the total categorical data
Bar chart
Categories are on the x-axis and frequencies are on the y-axis bars have gapes between them
Compressed scale
- The sacks from 0-100 could be compressed and then continued normally from 100-400
- this is shown by a squiggle
- the bar themselves could be also compressed
Steam and leaf
Represents data by separating each value into two parts: the stem and the leaves
- it shows the same distribution of a histogram, but preserves the raw data
- if your data are too crowded in a row, separate the leaves from 0-4, 5-9
Dot plots
Consist of a graph in which each data value is plotted as a point along scale of values. Dots represent the same values that are stacked, so they also preserve original data values
Scatter plot
A plot of the paired (x,y) data to measure the correlation or association between two quantitative variable
Unimodal distribution/histogram
Has one apparent peak
Bimodal
Histogram has two apparent peaks
Uniform
A histogram that doesn’t appear to have any mode in which all the bars are approximately the same height
Symmetric distributions
If you fold the histogram along a vertical line through the middle and have the edges match pretty closely, the histogram is symmetric
Skewed distributions
- the (usually) thinner ends of a distribution are called the tails. If one tail stretches out farther than the other, the histogram is said to be skewed to the side of the longer tail
- in the figure below, the histogram in the left is said to be skewed left while the histogram on the right is said to be skewed right
Relative frequency histogram
These have the same shape as a histogram with frequency, but the frequencies change to relative frequency percents
Frequency polygon
Uses line agents connected to points located directly above class midpoint values
Ogive
A line graph that depicts cumulative frequencies, jut as the cumulative frequency table lists cumulative frequencies
Pareto chart
Another bar chart for categorical data where the bars are arranged in ascending r descending order according to frequencies
Histogram
A histogram bar graph for quantitative data in which the horizontal scale represents the classes and the vertical sale represents the frequencies. The heights of the bars correspond to the frequency values, an the bars touch -NO GAPS (unless there are gaps in the data)
3 types of center
Mean and median (quantitative)
Mode ( categorical)
Mean
Divide the sum of all datum and divide by the sample size
-the mean is generally the most important/most utilized descriptive measurement
Median
- the middle value of the data set arranged in ascending ( or descending) order
- if the number if values is odd, the median is the exact middle value
- if the number if values is even? The median is the mean of the two middle values
- often denotes as c with a tilde
Mode
- the value in a set of data that occurs the most
- if no value is repeated, we say that it has no mode. However, one could argue that all values are modes…
Outliers
A value that is much higher r lower than the mean.
- affects the mean
- does not affect the median
General notes about center measures
- when data are fairly symmetric, the mean and median tend to be about the same, but the mean is usually a better measure of center
- if the data are skewed, the median is the better measure of center
4 measures of variation (spread)
- Range
- Interquartile range
- Variance
- Standard deviation
Range
The distance between the maximum and minimum values
Range = max-min
-affected by outliers
Interquartile range
- not affected by outliers
- the distance between the first and third quartiles
- IQR=Q3-Q1
- Q1 25% of the data lie below the first quartile
- Q2 the median
- Q3 25% of the data lie above the third quartile
Variance
- main measure of spread
- the variance, notated by s^2, is found by summing the squared deviation and ( almost) averaging them
Standard deviation
The square root of the variance as is measured in the same units as the original data
- the standard deviation measures how far each value is from the mean
- unless otherwise specified, assume the data collected are a sample and find the sample SD
- affected by outliers
Calculator for standard deviation
Use sx for the sample standard deviation and ox for population standard deviation
Rule of thumb for spread
- When the histogram of our data is fairly symmetric, report the standard deviation, because it is a more accurate measure of spread
- when the histogram of you data is skewed in any direction, report the IQR as a appropriate measure of spread
Five number summary
Of a distribution reports it’s median, quartiles, and extremes (max and min)
Boxplot
- Is a graphical display of the five- number summary
- are useful when comparing groups
- !good at pointing out outliers
Constructing boxplots
- Draw a single vertical axis spanning the range if the data. Draw short horizontal lines at the lower and upper quartiles and at the median. Then connect them with vertical lines to form a box
- Draw “fences” around the main part of the data
- the upper fence is 1.5*(IQR) above the upper quartile Q3 + 1.5( IQR)
- the lower fence is 1.5 * (IQR) below the lower quartile Q1- 1.5(IQR)
Note: the fences only help with constructing the box plot and should not appear in the final display
- anything above the upper fence is an outlier
- anything below the lower fence is and outlier
- Use the fences to grow “whiskers”
- draw lines from the ends if the box up and down to the most extreme dat values found within the fences
- if a fat value falls outside one of the fences, we do not connect it with a whisker - Add the outliers by displaying any data values beyond the fences with special symbols.
- we often use a different symbol for “far outliers” that are farther than 3 IQRs from the quartiles
Overview of boxplots
- if the tail is longer to the high end it is skewed right
- outliers tell us which way it is skewed
- when I doubt use the whiskers and outliers
What do boxplots tell us
- the center of the boxplot shows up the middle half of the data between the quartiles
- the height of the box is equal to the IQR
- If the median is roughly centered between the quartiles, then the middle half of the data is roughly symmetric. Thus, if median is not centered, the distribution is skewed
- the whiskers also show the skew was if they are not the same length
- outliers are out of the way to keep you from judging skewness, but give them special attention
What do z-scores represent
The number of standard deviations that value is from the mean
Event
A trial of an experiment
Outcome
Result of a single trial
Simple event
An outcome or even that cannot be further broken down Ito simpler components
Theoretical probability
What the probability should be
Experimental probability
The actual probability found after an experiment
Sample space
The set of all possible simple even outcomes
Law of large numbers
The more an experiment is repeated the closer the experimental probability will get to the theoretical probability
Odds against
Two events are mutually exclusive if they cannot occur at the same time
Independent events
When the outcome of one event does not affect the probability of the other event
- not the same as mutually exclusive
Dependent events
When the outcome of one event affects the probability of the other event
Conditional probability
The probability of event b occurring after it is assumed the event a has already occurred
Multiplication rule
The probability if event a times the probability I event b occurring, given event a already occurred
- if your events are independent, your second probability won’t be affected by the first so you would just multiply the two probabilities together
- if your events are dependent, you have to calculated the second, given that the first already occurred
Contingency table
Know how to do it
Simulation
Is a process that behaves the same way as trials of an experiment, so that similar results are produced
random digits tables
These digits have been generated recall it’s use from chapter 1
Graphing calculator
Recall that we can use randInt in the graphing calc
Online random number generators
There are many random number generators online
Random number generator software
There are many software packages random number generators. Minitab is one of then
Combination rule
When order does not matter and we want to calculate the number of ways (combinations) r items can be selected from n different items
Permutations (where all items are different)
When r items are selected from n available items (without replacement)
Distinguishable permutations
When some items are identical
Factorial rule
A collection of n different items can be arranged in order n! Different ways
Expected value
The expected value of a discrete random variable represents the average value of the outcomes, this is the same as the mean of the distribution
Random variable
A variable, usually denoted as z, that has a single numerical value, determined by chance, for each outcome of a procedure
Probability distribution
A graph, table, or formula that gives the probability for each value of the random variable
Discrete random variable
Has either a finite number of values or a countable number of values
Continuous random variable
Has infinitely many values, and those values can be associated with measurements on a continuous scale in such a ways that there are no gaps or interruptions
-usually has units
Discrete probability distribution
List each possible random variable value with it’s corresponding probability
- all of the probabilities must be between 0 and 1
- the sun of the probabilities must equal 1
Binomial distribution
- Procedure has a fixed number of trials
- The trials must be independent
- Each trial must have all outcomes classified into two categories (usually success or failure)
- the probabilities must remain constant for each trial
Geometric probability
- Each observation is in one of two categories success or failure
- The probability is the same for each observation
- Observations are independent
- The variable of interest is the number of trials requires to obtains the first success
Normal distribution
If a continuous random variable has a distribution with a graph that is symmetric and bell-shaped has a normal distribution
-total area under the curve is 1
Z scores
Tell us a value’a distance from the mean in therms of standard deviations
Usual or not usual
Usual is outside of two standard deviations
Sampling distribution
Of a mean is the probability distribution if sample means, with all samples having the same sample size n
Central limit theorem
The distribution of sample means will ask the sample size increases approach a normal distribution
Clt rules
- for sample sizes later than 30 the distribution of the sample mean can be approximated reasonably by a normal distribution
- if the original probability is itself normally distributed then the sample means will be normally distributed for any sample size If 30 or under