Chapter 1 Flashcards
Cases
The objects described by a set of data.
Ex. Customers, companies, subjects in a study, stock
Label
Is a SPECIAL VARIABLE used in some data sets to distinguish the different cases
Variable
Is a characteristic of the case–> different cases can have different values for variables
Observation
Describes the data for a particular case
Categorical Variable
Places a case into one of several groups or categories
Ex. Bar Graphs, Pie Charts, and Pareto Charts
Quantitative Variable
Takes numerical values arithmetic operations, such as adding and averaging, makes sense
Statistical Software
In some statistical software spaces are not allowed in variable names–> instead use an underscore
Ordered Categorical Variable
Possible values for a grade…A, B, C, D..etc because A is better than B which is better then C and so on
Nominal Variable
A categorical variable that is not ordered
Instruments
Different areas of application (marketing) can also have their own special variables–> these variable are measured with instruments
Rate
Computing a rate is one of several ways of adjusting one variable to create another–> sometime more meaningful than count
Distribution
Describes how to values of a variable vary from case to case
Pareto Chart
Categories are ordered from MOST frequent–>least frequent–>most important categories for a categorical variable
Ex. frequently used in quality control settings
Histogram
The most common graph of the distribution of a quantitative variable wear we group near values into classes–> for small data sets a stemplot can be used
How can you describe the overall pattern of a histogram
You can describe the overall pattern of a histogram by its SHAPE, CENTER, and SPREAD
Outlier
The most important type of deviation–> an individual value that falls outside the overall pattern
When is a distribution symmetric?
If the right and left sides of the histogram are mirror images of each other
Skewed to the right
If the right side of the histogram extends much farther out than the left side..and vice versa
Positively skewed
Data that skews to the right–> positive skewness is the MOST common type of skewness that we see in real data
Time plot
Plots each observation against the time it was measured–> time on a horizontal and the variable you are measuring on a vertical scale
Mean
The most common measure of center is the ordinary arithmetic average–> NOT a resistant measure of center as it can be influenced by outliers
Median
The median is the midpoint of a distribution, the number such that half the observations are smaller and half are larger
Median Odd
(N+1)/2 observations up from the bottom of the list
Median Even
It is the mean of the two numbers in the middle
Median vs Mean
The median is more resistant than the mean
Median and Mean in a Symmetric Distibution
They are close together–> exactly symmetric exactly the same
Median and Mean in a skewed distribution
The mean is farther out on the long tail than the median
The five number summary
Boxplot–>consits of the smallest observation, the first quartile, the median, the thrid quartile, and the largest observation –> in order form largest to smallest
The five number summary vs. distribution
Not the most common numerical description of distribution
Most common numerical description of distribution
The mean to measure the center and the standard deviation to measure the spread
Standard deviation
Measures spread by caluculating how far the observations are from their mean–> should only be used when the mean is chosen as the method of center
n-1
Degrees of freedom of the variance or standard deviation
S=0
Only when ther is no spread–> means all the observations have the same value, otherwise S is greater than 0
What does it mean if the standard deviation is higher?
S gets larger when the observations are more spread out across their mean
Units
S has the same units of measurement as the original observation
S and the Mean
Like the mean, S is not resistant a few outliers or strong skewness can greatly increase S
How do you measure risk in finance
Taking a looking at the standard deviation of returns –> large spread –> less predictable–> more risky
BUT five number summary would be more informative
Density curve
A density curve is a mathematic model for the distribution of a quantitative variable
What does a density curve describe?
The overall pattern of a distribution. Thea area under the curve AND within any range of values is the proportion of all observations that fall within that range
68-95-99.7 rule
68% of observations fall within 1 standard deviation of the mean
95% of observations fall within 2 standard deviations of the mean
99.7% of observations fall within 3 standard deviations of the mean
Z-Score
Standardized value–> tells us how many standard deviations the observation falls away from the mean and in which direction
Z-score positive
Observations larger than the mean
Z-score negative
Observations smaller than the mean
Sample survey
Collects data from a sample of cases that represent a larger population of cases
Observation vs Experiment
We do not attempt to influence the responses by imposing a treatment (change)
Training Data Set
In some studies we generate one set of data to generate a set of results
Ex. model to predict something
Database
Data sets for statistical analysis can be extracted
Data warehouse
System for organizing, storing, and analyzing complex data
Sampling frame
A list of items to be sampled
Response rate
The proportion of the original sample who actually provide usable data
Undercoverage
Some groups in the population are left out of the process of choosing the sample
Nonresponse
Occurs when a case chosen for the sample cannot be contacted or does not cooperate