Exam 1: Lectures 1, 2, 3 Flashcards
Business Analytics
refers to the skills, technologies, and practices for continuous iterative exploration and investigation of past business performance (e.g., sales and return on investment) to gain insight and drive business planning
Descriptive Analytics
Tools that summarize what happened
Prescriptive Analytics
Statistical techniques that make predictions and then suggest decision options to take advantage of the predictions
Predictive Analytics
A variety of statistical techniques that analyze data to make predictions about future
Business Analytics Advantages
1) Drive Revenue
2) Save Money
3) Encourage Experimentation
4) Side-step Politics
5) Persuade Executives
4 Key Challenges in Doing Business Analytics
1) Managing 6V’s of Big Data
2) Growth of Unstructured Data
3) Underestimating the Hard Work
4) Hiring the Right Person(s)
4 Main Elements of Data-Driven Tasks
1) Data Access
2) Data Management
3) Data analysis
4) Data presentation
Model
an abstraction of a real problem that tries to capture the essence and key features of the problem
Key Challenges of Managing 6V’s of Big Data
1) Volume
2) Velocity
3) Variety
4) Volatility
5) Validity
6) Value
Volume
Big data implies large volumes of data
Velocity
It is the speed of data processing
Variety
Many sources and types of data are structured and unstructured
Volatility
It refers to how long data is valid and how long it should be stored
Validity
Data should be correct and accurate for the intended use
Seven step modeling process
1) Define the problem
2) Collect and summarize data
3) Develop a model
4) Verify the model
5) Select one or more suitable decisions
6) Present the results to the organization
7) Implement the model and update it over time
graphs
bar charts, pie charts, histograms, scatter charts, and time series graphs
numerical summary measures
counts, percentages, averages, and measures of variability
tables of summary measures
totals, averages, counts, and grouped by categories
population
includes all of the entities of interest in a study (people, households, machines, etc.)
sample
a subset of the population, often randomly chosen and preferably representative of the population as a whole
Four Scales of Measurement
1) Nominal
2) Ordinal
3) Interval
4) Ratio
Nominal
have two or more categories without having any kind of natural order, two levels: gender (male and female), multiple levels: marital status (single, married, divorced, widowed)
ordinal
a categorical variable for which the possible categories are ordered, education level: less than high school, high school, college degree, graduate degree
interval
measure is ordered and the distance between each number is equal; however, there is no natural zero condition, temperature: the difference between 10C and 20C is the same as the difference between 20C and 30C
ratio
variables are interval variables, but with the added condition of zero (origin), money, sales revenue
interquartile range
the third quartile minus the first quartile
Thus, it is the range of the middle 50% of the data
It is less sensitive to extreme values than the range
variance
essentially the average of the squared deviations from the mean
If Xi is a typical observation, its squared deviation from the mean is (Xi – mean)2
range
the maximum value minus the minimum value
standard deviation
the square root of the variance
skewness
occurs when there is a lack of symmetry
kurtosis
has to do with the “fatness” of the tails of the distribution relative to the tails of a normal distribution
Statisticians generally consider a value as an outlier if
it is more than three standard deviations from the mean
dummy variable
a 0–1 coded variable for a specific category
It is coded as 1 for all observations in that category and 0 for all observations not in that category
bin variable
corresponds to a numerical variable that has been categorized into discrete categories
when a distribution has a negative (or positive) skew, ____ is larger than ____
median, mean
Two Types of Estimators
1) Point Estimators
2) Interval Estimators
Point Estimators
to estimate a population characteristic with a single value
Interval Estimators
to estimate a population characteristic with an interval, or range, of values
simple random sampling mechanism
the sample mean is typically used as a “best guess.” This estimate is a point estimate
The accuracy of the point estimate is measured by its standard error It is the standard deviation of the sampling distribution of the point estimate
A confidence interval (with 95% confidence) for the population mean extends to approximately two standard errors on either side of the sample mean
From the central limit theorem, the sampling distributionof 𝑋̅ is approximately normal when n is reasonably large
There is approximately a 95% chance that any particular 𝑋̅ will be within two standard errors of the population mean μ
For a simple random sampling, if we have 10,000 customers and we want to select 1,000 customers at random; each customer should have ___ chance to be selected
1 in 10
typical sampling mistakes
1) Unrepresentative sample
2) Biased respondents
3) Low response rate (non-response bias)
4) Biased questions
unrepresentative sample
Sample does not represent population
biased respondents
Respondents incorrectly answer sensitive questions such as annual income
low response rate (non-response bias)
Only few respondents participate in surveys
biased questions
Incorrect wordings make hard to understand what respondents answer
confidence interval
a range of values we are fairly sure our true value lies in
systematic sampling
is a type of probability sampling method in which sample members from a larger population are selected according to a random starting point and a fixed periodic interval. This interval, called the sampling interval, is calculated by dividing the population size by the desired sample size
stratified sampling
Suppose various subpopulations within the total population can be identified. These subpopulations are strata
Instead of taking a simple random sample from the entire population, it might make more sense to select a simple random sample from each stratum separately
cluster sampling
the population is separated into clusters, such as cities or city blocks, and then a random sample of the clusters is selected
p-value
the probability of obtaining a result equal to what was actually observed, when the null hypothesis is true
What % of observations within 1, 2, or 3 standard deviations of its mean when a variable x follows a normal distribution
68, 95, 99.7
When we reject or fail to reject null hypothesis at 0.05 significant level
p-value<0.05, rejected, p-value>0.05 fail to reject
hypothesis
a claim that can be tested statistically
one-tailed alternative
supported only by evidence in a single direction
two-tailed alternative
supported by evidence in either of two directions
how to deal with missing values in variables
One option is to simply ignore them. Then you will have to be aware of how the software deals with missing values
Another option is to fill in missing values with the average of nonmissing values. We use this option!
A third option is to examine the nonmissing values in the row of a missing value; these values might provide clues on what the missing value should be