Exam 1 Review Flashcards
The process of extracting portions of a data set that are relevant to the analysis is called
subsetting
The methodology of extracting information and knowledge from data to improve a company’s bottom line and enhance the consumer experience
business analytics
How does business analytics benefit companies? (6)
- develop better marketing strategies
- deepen customer engagement
- enhance efficient in procuremnt
- uncover ways to reduce expense
- identify emerging market trends
- mitigate risk and fraud
What topics do business analytics encompass?
- statistics
- computer science
- information systems
What questions do the 3 types of analytics techniques ask?
- Descriptive: What has happened?
- Predictive: What could happen in the future?
- Prescriptive: What should we do?
Data that have been organized, analyzed, and processed in a meaningul and purposeful way
Information
Derived from a blend of data, contextual information, experience, and intuition
Knowledge
Data collected by recording a characteristic of many subjects at the same point in time
cross-sectional data
Data collected over several time periods
Time series data
Provide examples of human-generated and machine-generated, structured and unstructured data
What are the 3 characteristics of big data?
- volume (immense amount)
- velocity (generated at rapid speed)
- variety (different types and forms of data)
When a characteristic of interest differs in kind or degree among various observations
variable
What are the 2 broad types of variable divisions?
- Categorical (qualitative)
- Numerical (quantitative)
What are the 2 types of numerical variables?
provide examples
- continuous
ex: weight, time, height, investment return - discrete (countable)
ex: number of points or children
What are the 4 measurement scales?
Provide definitions and examples
- nominal (categorical): observations just differ by name
- ordinal (categorical): observations can be categorized or ranked (but differences are meaningless)
ex: ratings - interval (numerical): observations can be categorized or ranked (differences are meaningful)
ex: temperatures - ratio (numerical): observations are on interval-scale w/true zero point
ex: grades, weight, time, distance
Process of retrieving, cleansing, integrating, transforming, and enriching data to support subsequent data analysis
Data wrangling
What are the objectives of data wrangling? (3)
- improve data quality
- reduce time and effort required to perform analytics
- help reveal true intelligence in the data
What helps us to verify that the data set is complete or may have missing values
counting & sorting
What allows us to review the range of values for each variable?
sorting data
What are 2 common strategies for dealing with missing values?
Provide definitions and when to use them
- omission (complete-case analysis): exclude missing values
ex: use when amount of missing values is small and expected to be randomly distributed across observations - imputation: replace missing values
ex: may replace with mean; used when variable w/missing values is deemed important
Process of converting data from one format or structure to another
Provide Examples
Data transformation
ex: convert dates into seasons; convert values into natural logarithms; combine height and weight to create BMI
Process of transforming numerical into categorical variables
What are the constraints?
binning
Bins must be consecutive and nonoverlapping
What are 3 common approaches for transforming categorical data?
Explain/provide examples
- category reduction: combining categories
ex: Mon-Fri = Weekdays; “Other” - dummy variables: AKA indicator or binary variable that takes on value of 1 or 0 to describe two cateogires of a variable (n - 1)
- category scores: ex: recode satisfaction survey to numbers
ex: used when data are ordinal and have natural, ordered categories
In addition to binning, another common approach is to create new variables through ____ transformations
mathematical
What are the 3 common measures of central location?
- mean
- median
- mode
What is the measure of relative position?
Explain how it works
percentile
- Approx. p% of observations are less than the pth percentile
- Approx (100-p)% of observations are greater than the pth percentile
If a variable has outliers, which measure of central location is preffered?
median is preferred over mean
What type of variable is mode useful for?
categorial variable
What are the 5 measures of dispersion?
Define them
- Range: max - min
- IRQ: Q3 - Q1
range of middle 50% of oservations - Mean absolute deviation
absolute differences of all observations from mean- Variance: avg of squared differences from mean
- Standard Deviation: square root of variance (lower value means obs closer to mean)
What are the measures of shape? (2)
Define
- Skewness Coefficient: degree of distribution not symmetric about mean
symmetric distribution = 0 - Kurtosis Coefficient: adnormal tails
norm = 3; excess is KC - 3
What are the measures of association? (2)
Define
Covariance: direction of linear relationship (senstitive to units of measure)
Correlation Coefficient: dirent and strength of linear relationship
Identifiers: 0 (no linear relation); 0.12 (weak); 0.8 (strong)
What does a box plot graphically display?
- min
- Q1
- Q2 (median)
- Q3
- max
How are the upper and lower fence calculated on a boxplot graph?
- lower fence: Q1 - (1.5 x IQR)
- upper fence: Q3 + (1.5 x IQR)
Anything greater or less is outlier
What does the Empirical Rule state?
- ~ 68% of all obs fall in between sample mean +/- sample SD
- ~ 95% of all obs fall in between sample mean +/- 2Xsample SD
- ~ 100% of all obs fall in between sample mean +/- 3Xsample SD
The population mean is referred to as a ____ and the sample mean is referred to as a _______.
- parameter
- statistic
What is the z-score used for?
Provide example
- find distance of obs from mean in terms of SD
z score of 2 -> obs is 2 SD above mean
What is standardizing?
When is it commonly used?
converting obs into z-scores
common when dealing w/ variabes measured using different scales
What methods are used to visualize a categorical variable? (2)
- frequency distribution
- bar chart (graphical rep of frequency distribution)
What methods are used to visualize a numerical variable? (2)
- frequency distribution
- histogram (helps see shape of distribution (skewness)
What methods are used to visualize two categorical variables? (2)
- contingency table (frequency for 2 categorical variables)
- stacked column chart
What data visualization techniques can be used with multiple variables? (3)
Explain
- bubble plot (3 numerical variables)
- line chart (connects consecutive obs of numerical variable) (can track changes over time)
- heat map (can identify combinations of categorical variables that have economic significance)
What method is used to visualize two numerical variables?
- scatter plot (shows linear relationship) (can also use for categorical variable)
Reminder: Tableau can extract data from many sources, including Excel
When the value of the repsonse variable is uniquely determined by predictor values
Provide example
Deterministic Relationship
ex: p = mv
When the value of the response variable is not uniquely determined due to other factors
stochastic relationship
A dummy variable can also be callled? (2)
- reference
- benchmark
What is a measure that summarizes how well the sample regression equation fits the data?
Goodness-of-fit
Instead of se2,we generally report the standard deviation of the residual, denoted se, more commonly referred to as…?
the standard error of the estimate
What is the residual in linear regression?
difference btwn the observed and predicted values of variable
What are the Goodness-of-fit measures? (3)
State ideal preferences
- Standard error of the estimate (Se)
smaller Se is preffered - Coefficient of Determination (R2)
never decreases as add more predictor variables to the model; closer to 1, better the fit - Adjusted Coefficient of Determination (adjusted R2)
choose the model w/ the highest adjusted R2 value
We use analysis of variance (ANOVA) in the context of the linear regression model to derive R2.We denote the total variation in y as Σ(yi−y ̄)2, which is the numerator in the formula for the variance of y. What is this total variation called?
Total sum of squares
What is a good solution when confronted with multicollinearity?
- drop one of the collinear variables
- obtain more data b/c the sample correlation may get weaker
- sometimes, do nothing
The logistic regression model cannot be estimated with standard ordinary least squares (OLS) procedures. Instead, we rely on which method?
Maximum likelihood estimation (MLE)
In the holdout method we partition the data into two independent and mutually exclusive data sets. What are they called?
- training set
- validation set
Often it is preferable to use the k-fold cross-validation method, where we partition the data into k subsets, and the one that is left out in each iteration is the ____ set.
validation
What are the other performance measures for logistic regression?
Define them
- accuracy: making sure the #’s are accurate
- sensitivity: proportion of target class cases that are classified correctly
- specificity: proportion of nontarget class cases that are classified correctly