Exam 1 Review Flashcards

1
Q

The process of extracting portions of a data set that are relevant to the analysis is called

A

subsetting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

The methodology of extracting information and knowledge from data to improve a company’s bottom line and enhance the consumer experience

A

business analytics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How does business analytics benefit companies? (6)

A
  • develop better marketing strategies
  • deepen customer engagement
  • enhance efficient in procuremnt
  • uncover ways to reduce expense
  • identify emerging market trends
  • mitigate risk and fraud
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What topics do business analytics encompass?

A
  • statistics
  • computer science
  • information systems
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What questions do the 3 types of analytics techniques ask?

A
  • Descriptive: What has happened?
  • Predictive: What could happen in the future?
  • Prescriptive: What should we do?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Data that have been organized, analyzed, and processed in a meaningul and purposeful way

A

Information

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Derived from a blend of data, contextual information, experience, and intuition

A

Knowledge

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Data collected by recording a characteristic of many subjects at the same point in time

A

cross-sectional data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Data collected over several time periods

A

Time series data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Provide examples of human-generated and machine-generated, structured and unstructured data

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the 3 characteristics of big data?

A
  • volume (immense amount)
  • velocity (generated at rapid speed)
  • variety (different types and forms of data)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

When a characteristic of interest differs in kind or degree among various observations

A

variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the 2 broad types of variable divisions?

A
  • Categorical (qualitative)
  • Numerical (quantitative)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the 2 types of numerical variables?

provide examples

A
  • continuous
    ex: weight, time, height, investment return
  • discrete (countable)
    ex: number of points or children
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the 4 measurement scales?

Provide definitions and examples

A
  • nominal (categorical): observations just differ by name
  • ordinal (categorical): observations can be categorized or ranked (but differences are meaningless)
    ex: ratings
  • interval (numerical): observations can be categorized or ranked (differences are meaningful)
    ex: temperatures
  • ratio (numerical): observations are on interval-scale w/true zero point
    ex: grades, weight, time, distance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Process of retrieving, cleansing, integrating, transforming, and enriching data to support subsequent data analysis

A

Data wrangling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are the objectives of data wrangling? (3)

A
  • improve data quality
  • reduce time and effort required to perform analytics
  • help reveal true intelligence in the data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What helps us to verify that the data set is complete or may have missing values

A

counting & sorting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What allows us to review the range of values for each variable?

A

sorting data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What are 2 common strategies for dealing with missing values?

Provide definitions and when to use them

A
  • omission (complete-case analysis): exclude missing values
    ex: use when amount of missing values is small and expected to be randomly distributed across observations
  • imputation: replace missing values
    ex: may replace with mean; used when variable w/missing values is deemed important
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Process of converting data from one format or structure to another

Provide Examples

A

Data transformation
ex: convert dates into seasons; convert values into natural logarithms; combine height and weight to create BMI

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Process of transforming numerical into categorical variables

What are the constraints?

A

binning

Bins must be consecutive and nonoverlapping

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What are 3 common approaches for transforming categorical data?

Explain/provide examples

A
  • category reduction: combining categories
    ex: Mon-Fri = Weekdays; “Other”
  • dummy variables: AKA indicator or binary variable that takes on value of 1 or 0 to describe two cateogires of a variable (n - 1)
  • category scores: ex: recode satisfaction survey to numbers
    ex: used when data are ordinal and have natural, ordered categories
24
Q

In addition to binning, another common approach is to create new variables through ____ transformations

A

mathematical

25
Q

What are the 3 common measures of central location?

A
  • mean
  • median
  • mode
26
Q

What is the measure of relative position?

Explain how it works

A

percentile
- Approx. p% of observations are less than the pth percentile
- Approx (100-p)% of observations are greater than the pth percentile

27
Q

If a variable has outliers, which measure of central location is preffered?

A

median is preferred over mean

28
Q

What type of variable is mode useful for?

A

categorial variable

29
Q

What are the 5 measures of dispersion?

Define them

A
  • Range: max - min
  • IRQ: Q3 - Q1
    range of middle 50% of oservations
  • Mean absolute deviation
    absolute differences of all observations from mean
    • Variance: avg of squared differences from mean
    • Standard Deviation: square root of variance (lower value means obs closer to mean)
30
Q

What are the measures of shape? (2)

Define

A
  • Skewness Coefficient: degree of distribution not symmetric about mean
    symmetric distribution = 0
  • Kurtosis Coefficient: adnormal tails
    norm = 3; excess is KC - 3
31
Q

What are the measures of association? (2)

Define

A

Covariance: direction of linear relationship (senstitive to units of measure)
Correlation Coefficient: dirent and strength of linear relationship
Identifiers: 0 (no linear relation); 0.12 (weak); 0.8 (strong)

32
Q

What does a box plot graphically display?

A
  • min
  • Q1
  • Q2 (median)
  • Q3
  • max
33
Q

How are the upper and lower fence calculated on a boxplot graph?

A
  • lower fence: Q1 - (1.5 x IQR)
  • upper fence: Q3 + (1.5 x IQR)

Anything greater or less is outlier

34
Q

What does the Empirical Rule state?

A
  • ~ 68% of all obs fall in between sample mean +/- sample SD
  • ~ 95% of all obs fall in between sample mean +/- 2Xsample SD
  • ~ 100% of all obs fall in between sample mean +/- 3Xsample SD
35
Q

The population mean is referred to as a ____ and the sample mean is referred to as a _______.

A
  1. parameter
  2. statistic
36
Q

What is the z-score used for?

Provide example

A
  • find distance of obs from mean in terms of SD

z score of 2 -> obs is 2 SD above mean

37
Q

What is standardizing?

When is it commonly used?

A

converting obs into z-scores

common when dealing w/ variabes measured using different scales

38
Q

What methods are used to visualize a categorical variable? (2)

A
  • frequency distribution
  • bar chart (graphical rep of frequency distribution)
39
Q

What methods are used to visualize a numerical variable? (2)

A
  • frequency distribution
  • histogram (helps see shape of distribution (skewness)
40
Q

What methods are used to visualize two categorical variables? (2)

A
  • contingency table (frequency for 2 categorical variables)
  • stacked column chart
41
Q

What data visualization techniques can be used with multiple variables? (3)

Explain

A
  • bubble plot (3 numerical variables)
  • line chart (connects consecutive obs of numerical variable) (can track changes over time)
  • heat map (can identify combinations of categorical variables that have economic significance)
42
Q

What method is used to visualize two numerical variables?

A
  • scatter plot (shows linear relationship) (can also use for categorical variable)
43
Q

Reminder: Tableau can extract data from many sources, including Excel

44
Q

When the value of the repsonse variable is uniquely determined by predictor values

Provide example

A

Deterministic Relationship

ex: p = mv

45
Q

When the value of the response variable is not uniquely determined due to other factors

A

stochastic relationship

46
Q

A dummy variable can also be callled? (2)

A
  • reference
  • benchmark
47
Q

What is a measure that summarizes how well the sample regression equation fits the data?

A

Goodness-of-fit

48
Q

Instead of se2,we generally report the standard deviation of the residual, denoted se, more commonly referred to as…?

A

the standard error of the estimate

49
Q

What is the residual in linear regression?

A

difference btwn the observed and predicted values of variable

50
Q

What are the Goodness-of-fit measures? (3)

State ideal preferences

A
  • Standard error of the estimate (Se)
    smaller Se is preffered
  • Coefficient of Determination (R2)
    never decreases as add more predictor variables to the model; closer to 1, better the fit
  • Adjusted Coefficient of Determination (adjusted R2)
    choose the model w/ the highest adjusted R2 value
51
Q

We use analysis of variance (ANOVA) in the context of the linear regression model to derive R2.We denote the total variation in y as Σ(yi−y ̄)2, which is the numerator in the formula for the variance of y. What is this total variation called?

A

Total sum of squares

52
Q

What is a good solution when confronted with multicollinearity?

A
  • drop one of the collinear variables
  • obtain more data b/c the sample correlation may get weaker
  • sometimes, do nothing
53
Q

The logistic regression model cannot be estimated with standard ordinary least squares (OLS) procedures. Instead, we rely on which method?

A

Maximum likelihood estimation (MLE)

54
Q

In the holdout method we partition the data into two independent and mutually exclusive data sets. What are they called?

A
  • training set
  • validation set
55
Q

Often it is preferable to use the k-fold cross-validation method, where we partition the data into k subsets, and the one that is left out in each iteration is the ____ set.

A

validation

56
Q

What are the other performance measures for logistic regression?

Define them

A
  • accuracy: making sure the #’s are accurate
  • sensitivity: proportion of target class cases that are classified correctly
  • specificity: proportion of nontarget class cases that are classified correctly