Midterm 1 Flashcards
What do measures of central tendency yield?
Measures of central tendency yield information about the center (middle) of a group of numbers
What is the mode?
Mode: the most frequently occurring value in a data set
* Applicable to all levels of data measurement
* Sometimes no mode exists or there is more than one mode (bimodal or multimodal)
* Often used with nominal/ordinal data (e.g., determining the most common hair color/ blood type)
What is the median? What are some advantages and disadvantages?
Median: the middle value in an ordered array of numbers
* Array values in order
* The median of the array is the center number, or with an even number of observations, the average of the middle two terms
* Advantage: not affected by extreme values, so often preferable to the mean when the data includes some unusually large or small observations (e.g., income in the U.S., house prices in a given area)
* Disadvantage: it does not include all of the information in the data
* Data measurement level must at least be ordinal
What is the arithmetic mean?
Arithmetic Mean: the average of a group of numbers
* Most common measure of central tendency
* Includes all information in the data set
What are percentiles?
Percentiles: measures of central tendency that divide a group of data into 100 parts
* At least n% of the data lie at or below the nth percentile, and at most (100 - n)% of the
data lie above the nth percentile
* Example: 90th percentile indicates that at least 90% of the data are equal to or less
than it, and 10% of the data lie above it
What are quartiles?
Quartiles: measures of central tendency that divide a group of data into four subgroups
25% of the data set is below the first quartile
50% of the data set is below the second quartile (also called the median) 75% of the data set is below the third quartile
100% of the data set is below the fourth quartile
What are measure’s of variability
Measures of variability: describe the spread or dispersion of a set of data
* Distributions may have the same mean but different variability
Explain range and give an advantage and a disadvantage
Range: the difference between the largest and the smallest values in a set of data
* Advantage – easy to compute
* Disadvantage – affected by extreme values
What is interquartile range?
Interquartile range: range of values between the first and third quartile
* Range of the “middle half”; middle 50%
* Useful when analysts are interested in the
middle 50% and not the extremes
What is variance?
Variance: average of the squared deviations about the arithmetic mean for a set of numbers
What is the Standard Deviation and what does it allow for?
Standard Deviation: square root of the variance
* Closely related to the variance but more easily interpretable
* The standard deviation allows us to apply the empirical rule and Chebyshev’s Theorem
Explain the empirical rule
Used to state the approximate percentage of values that lie within a given number of standard deviations from the mean of a set of data if the data are normally distributed
* Data must be normally distributed
* Since this is common for many things, the empirical rule is widely used
Explain Chebyshev’s Theorem
Chebyshev’s theorem tells us at least what percentage of the data will lie within a certain range; if the distribution is closer to normal, the actual amount will be greater
Unlike the empirical rule, data can have any distribution
For example, 75% of data will lie within 2 standard deviations of the data, no matter how the data is distributed
Sample variance and standard deviation estimate what? Why is the denominator important?
The sample variance and standard deviation are used as estimators of the population values
The denominator is (n − 1) rather than N, which makes the sample statistics unbiased estimators of the population parameters
Explain z-scores and what the z scores represent if positive or negative
z Scores: represent the number of standard deviations a value (x) is above or below the mean for normally distributed data
Negative z scores indicate that the raw value (x) is below the mean; positive z scores indicate x values above the mean
What is the coefficient of variation? What does it measure?
The Coefficient of Variation: ratio of the standard deviation to the mean expressed as a
percentage
The CV can be used as a measure of risk
What are measure’s of shape?
tools that can be used to describe the shape of a distribution of data
What is skewness?
is when a distribution is asymmetrical or lacks symmetry
Skewed portion is the long, thin part of the curve
Explain in depth how measure’s of central tendency relate to skewness
- The relationship of the mean, median, and the mode relate to skew
- Symmetric: mean, median, and mode are equal
- Negatively skewed: mean is less than the median, which is less than the mode
- Positively skewed: mode is less than the median, which is less than the mean
What does Kurtosis describe?
the amount of peakedness of a distribution
Explain the box-and-whisker plot
a diagram that utilizes the upper and lower quartiles along with the median and the two most extreme values to depict a distribution graphically
Sometimes called the 5-number summary
* A box is drawn around the median with the upper and lower quartiles as the box endpoints
(hinges)
* The interquartile range is used to construct the inner fences, ± 1.5 ∙ IQR
* If data fall outside the inner fences, outer fences are constructed, ± 3.0 ∙ IQR
* A line segment (whisker) is drawn from the lower hinge of the box outward to the smallest data value
* A second whisker is drawn from the upper hinge to the largest data value
What is the use of box and whisker plots?
One use of box-and-whisker plots is to find outliers
* Data values that fall outside the mainstream of values in a distribution are called outliers
o Sometimes merely extremes of the data
o Sometimes due to measurement or recording error
o Sometimes so unusual that they should not be considered with the rest of the data
* Values that are outside the inner fences but inside the outer fences are mild outliers
* Values that fall outside the outer fences are extreme outliers
Another use is to determine if the distribution is skewed
* The position of the median in the box gives information about the skew of the middle 50% of the data
o If the median is to the left, the middle 50% is skewed right
o If the median is to the right, the middle 50% is skewed left
* The length of the whiskers shows the skew of the outer values
Why do business analytics use descriptive statistics?
- Descriptive statistics are at the foundation of statistical techniques and numerical measures that can be used to gain an initial understanding of data in business analytics
- Descriptive statistics allows a business analyst begin to mine and understand any meanings and/or relationships that might exist in data
What is a (random) experiment? Give an example
a process that produces well-defined outcome(s)
Sampling every 200th bottle of cola and weighing it
What is an event, give an example
an outcome of an experiment
There are 10 bottles that are too full
What is an elementary event? Give an example
event that cannot be decomposed or broken down into other events
o Elementary events are denoted by lowercase letters
o Suppose that the experiment is to roll a die
o Elementary events are to roll a 1, a 2, a 3, etc.
o In this case, there are six elementary events, e1, e2, etc.
What is the sample space?
a complete listing of all elementary events (all possible outcomes ) for a random experiment
What is the classical method of assigning probability?
The probability of an individual event occurring is determined by the ratio of the number of items in a population that contain the event (ne) to the total number of items in the population (N)
- Because ne can never be greater than N, the highest value of a probability is 1
- The lowest probability, if none of the N possibilities has the desired characteristic, e, is 0
- Thus, 0≤P(E)≤1
What is a priori probability?
(classical probability)– the probability can be
determined before the experiment takes place
What is the relative frequency of occurrence (empirical probability)?
Probability of an event occurring is equal to the number of times the event has occurred in the past divided by the total number of opportunities for the event to have occurred
Based on historical data; the past may or may not be a good predictor of the future
What is subjective probability? Give an example
- Based on the insights or feelings of the person determining the probability
- Different individuals may (correctly or incorrectly) assign different numeric probabilities to the same event
- subjective approach is usually limited to experiments that are unrepeatable
An experienced airline mechanic estimates the probability that a
particular plane will have a certain type of defect
Explain the Venn diagram structure of probability
- Rectangular area represents the sample space for the random experiment and contains all possible outcomes.
- Circle represents event A and contains only the outcomes that belong to A.
- Shaded region of the rectangle contains all outcomes not in event A.