Midterm 1 Flashcards
What do measures of central tendency yield?
Measures of central tendency yield information about the center (middle) of a group of numbers
What is the mode?
Mode: the most frequently occurring value in a data set
* Applicable to all levels of data measurement
* Sometimes no mode exists or there is more than one mode (bimodal or multimodal)
* Often used with nominal/ordinal data (e.g., determining the most common hair color/ blood type)
What is the median? What are some advantages and disadvantages?
Median: the middle value in an ordered array of numbers
* Array values in order
* The median of the array is the center number, or with an even number of observations, the average of the middle two terms
* Advantage: not affected by extreme values, so often preferable to the mean when the data includes some unusually large or small observations (e.g., income in the U.S., house prices in a given area)
* Disadvantage: it does not include all of the information in the data
* Data measurement level must at least be ordinal
What is the arithmetic mean?
Arithmetic Mean: the average of a group of numbers
* Most common measure of central tendency
* Includes all information in the data set
What are percentiles?
Percentiles: measures of central tendency that divide a group of data into 100 parts
* At least n% of the data lie at or below the nth percentile, and at most (100 - n)% of the
data lie above the nth percentile
* Example: 90th percentile indicates that at least 90% of the data are equal to or less
than it, and 10% of the data lie above it
What are quartiles?
Quartiles: measures of central tendency that divide a group of data into four subgroups
25% of the data set is below the first quartile
50% of the data set is below the second quartile (also called the median) 75% of the data set is below the third quartile
100% of the data set is below the fourth quartile
What are measure’s of variability
Measures of variability: describe the spread or dispersion of a set of data
* Distributions may have the same mean but different variability
Explain range and give an advantage and a disadvantage
Range: the difference between the largest and the smallest values in a set of data
* Advantage – easy to compute
* Disadvantage – affected by extreme values
What is interquartile range?
Interquartile range: range of values between the first and third quartile
* Range of the “middle half”; middle 50%
* Useful when analysts are interested in the
middle 50% and not the extremes
What is variance?
Variance: average of the squared deviations about the arithmetic mean for a set of numbers
What is the Standard Deviation and what does it allow for?
Standard Deviation: square root of the variance
* Closely related to the variance but more easily interpretable
* The standard deviation allows us to apply the empirical rule and Chebyshev’s Theorem
Explain the empirical rule
Used to state the approximate percentage of values that lie within a given number of standard deviations from the mean of a set of data if the data are normally distributed
* Data must be normally distributed
* Since this is common for many things, the empirical rule is widely used
Explain Chebyshev’s Theorem
Chebyshev’s theorem tells us at least what percentage of the data will lie within a certain range; if the distribution is closer to normal, the actual amount will be greater
Unlike the empirical rule, data can have any distribution
For example, 75% of data will lie within 2 standard deviations of the data, no matter how the data is distributed
Sample variance and standard deviation estimate what? Why is the denominator important?
The sample variance and standard deviation are used as estimators of the population values
The denominator is (n − 1) rather than N, which makes the sample statistics unbiased estimators of the population parameters
Explain z-scores and what the z scores represent if positive or negative
z Scores: represent the number of standard deviations a value (x) is above or below the mean for normally distributed data
Negative z scores indicate that the raw value (x) is below the mean; positive z scores indicate x values above the mean
What is the coefficient of variation? What does it measure?
The Coefficient of Variation: ratio of the standard deviation to the mean expressed as a
percentage
The CV can be used as a measure of risk
What are measure’s of shape?
tools that can be used to describe the shape of a distribution of data
What is skewness?
is when a distribution is asymmetrical or lacks symmetry
Skewed portion is the long, thin part of the curve
Explain in depth how measure’s of central tendency relate to skewness
- The relationship of the mean, median, and the mode relate to skew
- Symmetric: mean, median, and mode are equal
- Negatively skewed: mean is less than the median, which is less than the mode
- Positively skewed: mode is less than the median, which is less than the mean
What does Kurtosis describe?
the amount of peakedness of a distribution
Explain the box-and-whisker plot
a diagram that utilizes the upper and lower quartiles along with the median and the two most extreme values to depict a distribution graphically
Sometimes called the 5-number summary
* A box is drawn around the median with the upper and lower quartiles as the box endpoints
(hinges)
* The interquartile range is used to construct the inner fences, ± 1.5 ∙ IQR
* If data fall outside the inner fences, outer fences are constructed, ± 3.0 ∙ IQR
* A line segment (whisker) is drawn from the lower hinge of the box outward to the smallest data value
* A second whisker is drawn from the upper hinge to the largest data value
What is the use of box and whisker plots?
One use of box-and-whisker plots is to find outliers
* Data values that fall outside the mainstream of values in a distribution are called outliers
o Sometimes merely extremes of the data
o Sometimes due to measurement or recording error
o Sometimes so unusual that they should not be considered with the rest of the data
* Values that are outside the inner fences but inside the outer fences are mild outliers
* Values that fall outside the outer fences are extreme outliers
Another use is to determine if the distribution is skewed
* The position of the median in the box gives information about the skew of the middle 50% of the data
o If the median is to the left, the middle 50% is skewed right
o If the median is to the right, the middle 50% is skewed left
* The length of the whiskers shows the skew of the outer values
Why do business analytics use descriptive statistics?
- Descriptive statistics are at the foundation of statistical techniques and numerical measures that can be used to gain an initial understanding of data in business analytics
- Descriptive statistics allows a business analyst begin to mine and understand any meanings and/or relationships that might exist in data
What is a (random) experiment? Give an example
a process that produces well-defined outcome(s)
Sampling every 200th bottle of cola and weighing it
What is an event, give an example
an outcome of an experiment
There are 10 bottles that are too full
What is an elementary event? Give an example
event that cannot be decomposed or broken down into other events
o Elementary events are denoted by lowercase letters
o Suppose that the experiment is to roll a die
o Elementary events are to roll a 1, a 2, a 3, etc.
o In this case, there are six elementary events, e1, e2, etc.
What is the sample space?
a complete listing of all elementary events (all possible outcomes ) for a random experiment
What is the classical method of assigning probability?
The probability of an individual event occurring is determined by the ratio of the number of items in a population that contain the event (ne) to the total number of items in the population (N)
- Because ne can never be greater than N, the highest value of a probability is 1
- The lowest probability, if none of the N possibilities has the desired characteristic, e, is 0
- Thus, 0≤P(E)≤1
What is a priori probability?
(classical probability)– the probability can be
determined before the experiment takes place
What is the relative frequency of occurrence (empirical probability)?
Probability of an event occurring is equal to the number of times the event has occurred in the past divided by the total number of opportunities for the event to have occurred
Based on historical data; the past may or may not be a good predictor of the future
What is subjective probability? Give an example
- Based on the insights or feelings of the person determining the probability
- Different individuals may (correctly or incorrectly) assign different numeric probabilities to the same event
- subjective approach is usually limited to experiments that are unrepeatable
An experienced airline mechanic estimates the probability that a
particular plane will have a certain type of defect
Explain the Venn diagram structure of probability
- Rectangular area represents the sample space for the random experiment and contains all possible outcomes.
- Circle represents event A and contains only the outcomes that belong to A.
- Shaded region of the rectangle contains all outcomes not in event A.
What is a mutually exclusive event?
- Mutually Exclusive Events
o Events with no common outcomes
o Occurrence of one event precludes the occurrence of the other event
o Example: if you toss a coin and get heads, you cannot get tails
What are collectively exhaustive events?
o Contains all possible elementary events for an experiment
* Rolling a die: {1,2,3,4,5,6}
* Generating a random integer number: { >5,= <5}
o The sample space for an experiment can be described as mutually exclusive (events do not have any outcome in common) and collectively exhaustive
What are complementary events?
o Given an event X, the complement of X is defined to be the event consisting of all outcomes that are not in X.
o Complementary events are denoted X′ (or 𝑋), which is pronounced as “not X”
o In any probability application, either event X or its complement X′ must occur.
P(X′) = 1 − P(X)
Explain unions and intersections
- Set notation is the use of braces to group numbers o The union of sets X, Y is denoted X ∪Y
- Given two events X and Y, the union of X and Y is defined as the event containing all outcomes belonging to X or Y or both
o The intersection of sets X, Y is denoted X ∩ Y - An element is part of the intersection if it is in set X and set Y
Describe Addition Laws, when does a special case arise?
The General Law of Addition (addition law) is used to find the probability of the union of two events
* The probability that event A or event B or both will occur (at least one of two events will occur).
P ( X ∪ Y ) = P ( X ) + P (Y ) − P ( X ∩ Y )
A special case arises for mutually exclusive events.
Under addition laws, what is a probability matrix, joint probabilities, and marginal probabilities?
A probability matrix displays the intersection (joint) probabilities along with the marginal probabilities of a given problem
* When values give the probability of the intersection of two events, the probabilities are called joint probabilities.
o Inner cells show joint probabilities
* Marginal probabilities are found by summing the joint probabilities in the corresponding row or column of the joint probability table.
o Outer cells show marginal probabilities
What is the Counting rule? give an example
The mn Counting Rule:
* If an operation can be done m ways and a second operation can be done n ways, then there are mn ways for the two operations to occur in order
o A cafeteria offers 5 salads, 4 meats, 8 vegetables, 3 breads, 4 desserts, and 3 drinks
* How many meals are available?
* 5 × 4 × 8 × 3 × 4 × 3 = 5760
Explain sampling from a population with replacement
Sampling from a Population with Replacement:
* Sampling n items from a population of size N begin underline with replacement end underline would provide (N) n possibilities
o Six lottery numbers are drawn from the digits 0 to 9, with replacement
Explain sampling from a population without replacement
Sampling n items from a population of size N without replacement provides the following number of possibilities
What are independent events?
o The occurrence or nonoccurrence of one event does not affect the occurrence or nonoccurrence of the other event(s)
o The probability of someone wearing glasses is unlikely to affect the probability that the person likes milk
o Many events are not independent
* The probability of carrying an umbrella changes when the weather
forecast predicts rain If events are independent, then:
P ( X |Y ) = P ( X ), and P (Y | X ) = P (Y )
P(X |Y ) is the probability that X occursbegin underline given thatend underline Y has occurred.
What is conditional probability?
Conditional probability: When the probability of one event is dependent on whether some related event has already occurred.
Conditional probabilities can be computed as the ratio of joint probability to a marginal probability.
What are multiplication laws?
General Law of Multiplication
P ( X ∩ Y ) = P ( X ) ⋅ P (Y | X ) = P (Y ) ⋅ P ( X | Y )
* Used to find the joint probability
What is the special law of multiplication?
Special Law of Multiplication
* If X and Y are independent,
P(X ∩ Y) = P(X) · P(Y)
What are independent events under conditional probability?
Independent Events
If events are independent, then
P ( X | Y ) = P ( X ) and P (Y | X ) = P (Y )
Explain the law of conditional probability?
Law of Conditional Probability: the conditional probability of X occurring, given that Y is known or has occurred is expressed
What is Baye’s Rule?
Bayes’ Rule extends the use of the law of conditional probabilities to allow revision of original probabilities with new information
o The denominator is a weighted average of the conditional probabilities, with the weights being the prior probabilities
o The formula allows statisticians to incorporate new information to revise probability estimates
What is statistics?
o A science dealing with the collection, analysis, interpretation, and presentation of numerical data
o Collect data -> analyze data -> interpret data -> present findings
Population Vs Sample
Population: all
A collection of all persons, objects, or items under study
Can be broadly or narrowly defined
Census: gathering data from the whole population
Sample: gathering data on a subset of the population
Should be representative of the whole population
Use information about the sample to infer about the population
What are the two branches of statistics?
Descriptive
Uses data gathered on a group to describe or reach conclusions about that same group
Produces graphical or numerical summaries of data
Inferential
Gathers data from a sample and uses the statistics generated to reach conclusions about the population from which the sample was taken
Sometimes called inductive statistics
What is a parameter?
Parameter: descriptive measure of the population
What is s statistic?
- Statistic: descriptive measure of a sample
What are the levels of data measurement?
Nominal -> ordinal -> interval -> ratio data (levels of data)
What is nominal data?
Nominal: Used only to classify or categorize
No quantitative value statement is implied
Lowest level of measurement
Examples
* Profession (doctor, lawyer…)
* Sex (male, female)
* Eye color (brown, green, blue)
* Location (zip code)
Best way to represent is by pie charts
For example, the name of a class (9200) and (9201) is simply the name and there is no meaning attached to the numbers therefore it is nominal. 9201 doesn’t have more seats than 9200
What is ordinal data?
Ordinal: ranking or ordering
Distances between ranks are not always equal
Nominal and ordinal data are nonmetric data or qualitative data because their measurements are imprecise
Example
* Ranking mutual funds by risk
* 50 most-admired companies
* Coffee cup size
Often used in surveys
* Like a professor very much, not very much, worst professor ever
What is interval data?
Interval: numerical data in which the distances between consecutive numbers have meaning
Interval data have equal intervals
Example
* Fahrenheit temperature scale
o The zero point is a matter of convenience or convention
o A temperature of 0 does not mean that there is no temperature
o The amounts of heat between consecutive readings are the same
* Time
0 is a value*
What is ratio data?
Ratio: numerical data in which the distances between consecutive numbers have meaning and the zero value represents the absence of the characteristic being studied
Highest level of data measurement
Interval and ratio data are called metric or quantitative data because their measurements are precise
* Example
o Volume
o Weight
o Kelvin temperature
Metric Vs Nonmetric data
o Nominal and ordinal (qualitative)
o Interval and ratio (quantitative)
What does parametric statistics require?
require interval or ratio data
What can nonparametric statistics be used with?
can be used with any data, but nominal and ordinal data require nonparametric methods
What is Big data?
A collection of large and complex datasets from different sources that are difficult to process using traditional data management and processing applications
What are the 5 V’s?
Volume
Ever-increasing size of data and databases
Velocity
The speed with which the data are available and can be processed
Variety
Different forms and sources of data
Veracity
Data quality, correctness, and accuracy
Value
Sometimes considered a fifth characteristic
What are the categories of Business Analytics?
- Descriptive analytics
- Predictive analytics
- Prescriptive analytics
What is descriptive analytics?
-Descriptive analytics: takes traditional data and describes what has or is happening in a business
o Used to discover hidden relationships and patterns
o Simplest and most commonly used category
o Data visualization
o Also called reporting analytics
What is Predictive Analytics?
Predictive analytics: finds relationships in the data that are not readily apparent with descriptive analytics
o Patterns or relationships are extrapolated forward in time and the past is used to make predictions about the future
o Topics include, regression, time-series, forecasting, data mining, statistical modeling, machine learning techniques, decision tree models, and neural networks
What is Prescriptive analytics?
Prescriptive analytics: examines current trends and likely forecasts to make better decisions
o Optimization models are an example of prescriptive analytics
o Takes uncertainty into account, recommends ways to mitigate risks, and tries to foresee the effects of future decisions
o Uses a set of mathematical techniques that determine optimal decisions given a complex set of objects, requirements, and constraints
o Topics include management science or operations research aimed at optimizing performance of a system such as mathematical programming, simulation, and network analysis
What is data mining?
Data mining: collecting, exploring, and analyzing large volumes of data to uncover hidden patterns to enhance decision making
What is Data visualization?
Data visualization: the study of the visual representation of data and is employed to convey data or information by imparting it as visual objects
What is a discrete random variable?
Discrete random variable
o If the set of all possible values is at most a finite or a countably infinite number of possible values
o Most of the time produce nonnegative whole numbers
o Example: A group of 6 people are randomly selected from a population and the number of left-handed people are to be determined, the random variable produced is discrete because the only possible numbers are {0,1,2,3,4,5,6}, it is impossible to obtain a non-whole number.
What are continuous distributions?
Take on values at every point over a given interval
o No gaps or un-assumed values
o Are generated from things that are measured
o Examples are:
Time, weight, height, and volume
o Once this type of data is recorded it becomes discrete data because the data is rounded off to a discrete number
What are the 3 types of discrete distributions?
- Binomial distribution
- Poisson distribution
- Hypergeometric distribution
What are the 6 continuous distributions?
- Uniform distribution
- Normal distribution
- Exponential distribution
- T distributions
- Chi-square distribution
- F distribution
What are the binomial assumptions?
Binomial assumptions
The experiment involves n identical trial
Each trial has only 2 possible outcomes denoted as success or failure
Each trial is independent of the previous trials
The terms p and q remain constant throughout the experiment, where the term p is the probability of getting a success on any one trial and the term q = 1 – p is the probability of getting a failure on any one trial
Binomial trials must be what?
Independent
o This means that either the experiment is by nature one that produces independent trials, or the experiment is conducted with replacement.
Explain the mean and standard deviation for a binomial distribution?
A binomial distribution has an expected value or a long-run average, which is denoted by u and the value of is determined by n*p
The standard deviation of a binomial distribution is denoted by SD = (square root)npq
What is the Poisson distribution?
- Poisson distribution focuses only on the number of discrete occurrences over some interval or continuum
- Another discrete distribution
- Has been referred to as the law of improbable events
- Often used to describe the number of random arrivals per some time interval
What are the characteristics of the Poisson distribution?
- Discrete distribution
- Describes rare events
- Each occurrence is independent of the other occurrences
- It describes discrete occurrences over a continuum or interval
- The occurrences in each interval can range from zero to infinity
- The expected number of occurrences must hold constant throughout the experiment
Give an example of a Poisson distribution
- Number of telephone calls per minute at a small business
- Number of hazardous waste sites per province in Canada