Exam One Flashcards

1
Q

What is Analytics

A

Transforms data into insight for making decisions “Informs”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What do data analyst do?

A

Collect and interpret data
- Analyze results
- Report results back to the relevant members
- Identifies patterns and trends in data sets
- Work alongside teams within the business or the management team to establish business needs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Applications of Business Analytics

A
  • Customer relationship
  • Sports game strategies
  • Pricing Decision
  • Health care
  • Human resource planning
  • Supply Chain Management
  • Financial and Marketing
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Importance of Business Analytics

A
  • Profitability of businesses
  • Revenue of businesses
  • Shareholder return
  • Enhances understanding the data
  • Vital to remain competitive
  • Enables creation of informative reports
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Descriptive analytics

A
  • Uses data to understand past and present
  • Summarizes data into meaningful charts and reports
  • Identify patterns and trends in data
    (Pie chart showing sales of product X and Y by region)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Predictive analytics

A
  • Analyzes past performance
  • Extrapolating to future
  • Predicts risk
    (Linear demand Prediction model. As price increases, demand falls line chart)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Prescriptive analytics

A
  • Uses optimization techniques to identify best alternatives
  • Often combines with predictive analytics to account for rist
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

For analysis and Decision making, you need

A

Metrics to quantify performance
Measures are the values of metrics
Discrete metrics involve counting (on time or not, number of on-time deliveries)
Continuous metrics are measured on a continuum (Delivery time, package weight, purchase price)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Categorical data

A

Data that helps sort things into groups or types. Doesn’t involve numbers but rather labels or names

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Ordinal Data

A

Involves categories that can be arranged in a specific order or rank. Rating experience at restaurant as “bad”, “okay”, “good”. You know that one is better than the other but not by a certain amount.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Interval data

A

has order and measurable differences between values and does not have a true zero point. An example is degrees in temp. Interval data has no starting point or true “zero”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Ratio

A

It has all the features of interval but also has true zero. With ratio you can add, subtract, and use comparisons like “twice as much”.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Good decision making

A

requires a mixture of skills: creative development and identification of options, clarity of judgment, firmness of decision, effective implementation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Steps to problem solving

A
  • Recognize problem
  • Define problem
  • Structure the problem
  • Analyze the problem (Role of BA)
  • Interpreting results and making decisions
  • Implement the solution
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Recognizing the Problem

A

Exist when there is a gap between what is happening and what we think should be happening
(Distribution costs being too high)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Defining the problem

A

Clearly defining the problem
ex. High distribution costs stem from:
- Inefficiencies in routing trucks
- Poor location of distribution centers
- External factors such as increasing fuel costs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Structuring the problem

A
  • Stating the goals and objectives (minimizing the total delivered costs of the product)
  • Characterizing the possible decisions (New manuf, New loc for warehouses)
  • Identifying any constraints or restrictions (Deliver orders within 48 hrs)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Analyzing the Problem

A

Identifying and applying appropriate BA techniques

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Interpreting results and Making Decision

A
  • Managers interpret results from the analysis phase
  • Incorporate subjective judgment as needed
  • Understand limitations and model assumptions
  • Make a decision utilizing the information
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Implementing the solution

A
  • Translate the results of the model back to the real world
  • Make solution work in the organization by:
    – Providing adequate resources
    – Motivating Employees
    – Eliminating resistance to change
    — Modifying organizational policies
    – Developing Trust
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Experiment (random)

A

Process of observation that leads to a single outcome that cannot be predicted with certainty

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Sample point

A

The most basic outcome of a random experiment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Sample Space

A

Collection of all possible outcomes (Depends on experimenter)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Event

A

Set of outcomes of a probability experiment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Steps for calculating probability

A
  • Define experiment; describe the process used to make an observation and the type of observation that will be recorded
  • List sample points
  • Assign probabilities to sample points
  • Determine collection of sample points contained in the event of interest
  • Sum the sample point’s probabilities to get the event
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Union

A

Outcomes in either events A or B or both
- Denoted by U. AUB
- ‘Or’ Statement

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Intersection

A

Outcomes in both events A and B
- ‘AND’ Statement
Denoted by n AnB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

P(A|B)

A

P(AnB)/P(B)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Data preprocessing

A
  • Transforming raw data into an understandable format
  • Helps us to understand and make knowledge discovery of data at the same time
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Why is data preprocessing needed?

A

Real-world data tends to be incomplete, noisy, and inconsistent
- leads to poor-quality data and models built on the data

It provides operations that helps to organize data into a proper form for a better understanding in the data mining process

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Examples of poor-quality data

A

Incomplete - Lacking attribute values, lacking certain attributes of interest or containing only aggregate data
Noisy - Contains too many outliers
Intentional - Disguised missing data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Why preprocess data?

A

Accuracy
Completeness
Consistency
Timeliness
Believability
Interpretability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Data Cleaning

A
  • Handling missing data
  • Outlier detection and removal
  • Noise reduction
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Data Transformation

A
  • Scaling
  • Smoothing
  • Aggregation
  • Generalization
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Data reduction

A
  • Feature selection
  • Dimensionality
  • Numerosity reduction
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Handling Imbalance

A
  • Oversampling
  • Under-sampling
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Data Integration

A

Combining tables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Tasks of data cleaning

A
  • Fill in missing values
  • Identify outliers
  • Smooth out noisy data
  • Correct inconsistent data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Handling missing data

A

Data is not always available
- many tuples have no recorded value for several attributes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Missing data may be due to

A
  • Equipment malfunction
  • Inconsistent with other recorded data thus deleted
  • Data not entered due to misunderstanding
  • Certain data may not be considered important at the time of entry
  • Missing data may need to be inferred
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

Causes of outliers

A
  • Experimental errors
  • Measurement errors (instrument errors)
  • Data entry errors (human errors)
  • Data processing errors (data manipulation or data set unintended mutations)
  • Sampling errors (extracting or mixing data from wrong or various sources)
  • Natural (not an error, novelties in data)
41
Q

Outlier detection and Removal

A
  • Z-score or Extreme Value Analysis
  • Interquartile Range Method
  • Probabilistic and Statistical modeling
  • Linear Regression Models
  • Proximity Based Models
  • Information Theory Models
  • Information Theory Models
  • High Dimensional Outlier Detection Methods
42
Q

Z-score removal

A
  • Very effective when values in the feature fit a Gaussian distribution
  • Easy to implement
  • It is useful for low-dimensional feature set
  • Not recommended when data cannot be assumed to be parametric
  • Eliminate those data with z value greater or less than 3 (-3)
  • Eliminate .27% data points
43
Q

Noisy reduction can be handles:

A

Binning :
- First sort data and partition it into (equal-frequency) bins
- Then can smooth by bin means, median, and bin boundaries

Regression:
- Smooth by fitting the data into regression functions

Clustering:
- Detect and remove outliers

44
Q

Scaling

A

Normalization - the process of normalization entails converting numerical values into a new range using mathematical function

45
Q

Two primary reasons why to normalize data

A

Make two variables in different scales comparable

Some models may need the data to be normalized before modeling

46
Q

Min-max normalization

A

v’ = (v-mina)/(maxa - min a)

47
Q

Smoothing

A

Statistical technique designed to detect trends in the presence of noisy data, assuming that the trend is smooth

48
Q

Different types of smoothing

A
  • Bin Smoothing
  • Kernels
  • Local weighted regression
49
Q

Generalization

A

Process of broadening the classification of data into a database

Ex:
Age groups instead of age
Income levels instead of income
State instead of county

50
Q

Data reduction

A

Obtain a reduced representation of the data set that is much smaller in volume however produces the same analytical results

51
Q

Why is data reduction

A

Complex data analysis may take a very long time to run on the complete data set

Additional data does not mean a better result outcomes

52
Q

Feature Selection

A

Construct new features combining the given features to make the data mining process more efficient

53
Q

Dimensionality Reduction

A

Used to reduce the amount of features

54
Q

Numerosity Reduction

A

Replace original data by a smaller form of data representation
- Parametric - Regression models
- Non-parametric - histograms, data sampling and data cube aggregation

55
Q

Why do feature selection?

A
  • Improved model performance
  • Interpretability and Simplicity
  • Identification of important feature variabilities
56
Q

Feature selection methods:

A

Pearson Correlation Coefficient:
- Measures linear relationship between two continuous variables

Chi-squared test:
- Used to test if two categorical variables are independent

Analysis of Variance
- Used to compare one categorical and one continuous variable
- Tests if the mean of Variable 1 in different groups of Variable 2 are equal

57
Q

Data Integration

A

Combining data from different sources to provide a unified view or dataset

58
Q

Data imbalance

A

Uneven distribution of classes in a dataset

Ex.
- Fraud detection data
- Spam classification data
- Medical diagnosis data

59
Q

Why data imbalance is a problem?

A
  • Bias towards the majority class
  • Poor generalization
  • Misleading metrics
60
Q

Popular data-level approaches to handle imbalance:

A
  • Over-sampling minority class
  • Under-sampling majority class
61
Q

Steps in machine learning data preprocessing

A

1- import libraries
2 - Import data-set
3 - Check out the missing values
4 - See the categorical values
5 - Splitting the data set into Training and Test Set
6 - Feature engineering (scaling, selection, etc.)

62
Q

Population

A

Set of all items of interest for a particular decision or investigation

63
Q

Sample

A
  • Subset of population
    Ex.
    List of individuals who rented a comedy from Netflix in the past year
  • Purpose of sampling is to obtain sufficient information to draw a valid inference about a population
64
Q

Statistics

A

Any function of the random variables constituting a random sample is called a statistic

65
Q

Probability density function (PDF)

A

Statistical expression that defines a probability distribution (the likelihood of an outcome) for a discrete random variable

66
Q

Skewness

A

Measure of asymmetry of a distribution

67
Q

Variance

A

Average of the squared deviations from the mean

68
Q

Chebyshev’s Theorem

A

Proportion of any distribution that lies within K standard deviations of the mean

1-(1/K^2)

K is any positive number greater than 1

69
Q

Standard Deviation

A

Square root of the variance

70
Q

Population

A

Set of all items of interest for a particular decision or investigation

Ex.
- All former Texas A&M ID graduates
- All subscribers to Netflix

71
Q

Sample

A

Subset of the population

Ex. List of individuals who rented a comedy from netflix in the past year

Purpose of a sample is to obtain sufficient information to draw a valid inference about a population

72
Q

Correlation

A

Used to determine when a change in one variable can result in a change in another

73
Q

Measures of Association

A
  • Both covariance and correlation measure the linear relationship and the dependency between two variables
  • Covariance indicates the direction of the linear relationship between variables
  • Correlation measures both the strength and directin of the linear relationship between two variables
  • Correlation values are standardized
  • Covariance values are not standardized
74
Q

Correlation Coefficient

A

Tells us how two variables are related. To find it:

  1. Calculate how the two variables change together (covariance)
  2. Divide the amount of variation (spread) in each variable using Stdevs
75
Q

What does the covariance mean

A

A standardized number between -1 and 1 makes it easy to see how strongly the variables are connected. A value close to 1 means a strong positive relationship and a value close to -1 means a strong negative relationship

76
Q

Statistical analysis

A

All about data

77
Q

Sampling

A

Foundation of statistical analysis

78
Q

Sampling Plan

A

Description of the approach that is used to obtain samples from a population prior to any data collection activity

79
Q

Sampling plan states

A
  • Its objectives
  • Target population
  • Population Frame
  • Operational procedures for data collection
  • Statistical tools for data analysis
80
Q

Subjective sampling method

A

Judgment sampling
- Expert judgment is used

Convenience sampling
- collect sample based on convenience

81
Q

Probability Sampling method

A

Simple Random Sampling
- selecting items from population so that every subset of given size has equal chance of being selected

82
Q

Systematic (periodic) sampling (statistical sampling)

A
  • selects every nth item from population
83
Q

Stratified sampling (statistical sampling)

A
  • Applied to population divided into subsets and allocates an appropriate proportion of samples to each subset
84
Q

Cluster sampling (statistical sampling)

A
  • Divide population into clusters and sample a set of clusters
85
Q

Sampling from a continuous process (statistical sampling)

A
  • Fix the time and select ‘n’ items after that time or select ‘n’ times at random and select the next item produced after each of these items
86
Q

Sample Data

A

Provides basis for many useful analyses to support decision making

87
Q

Estimation

A

Assess the value of unknown population parameters (mean, proportion, population variance) using sample data

88
Q

Unbiased Estimator of population mean u is Sample mean

A

If sampling is done randomly and correctly, the sample mean will provide a good estimate of the true population mean

89
Q

Unbiased Estimator of the Population variance o^2 is sample variance s^2

A

Using (n-1) instead of n compensates for the fact that we’re estimating based on the sample and accounts for variability in that estimate

90
Q

Point Estimate

A

Single number derived from the sample to estimate the population parameter

  • If the long-term average of point estimates from population samples provides a true estimate of the population parameter, the estimator is called unbiased estimator
91
Q

Sampling error

A

Difference between observed values of statistic and the quantity it is intended to estimate
- Any difference between sampling mean and population mean

92
Q

Causes of Sampling error

A

Sampling errors: The sample is NOT representative of the population as a whole

Non-sampling errors: Systematic errors such as asking the so-called leading questions during an interview

93
Q

Central limit theorem

A

If sample size is large enough, sampling distribution of the mean is:

  • Aprox. normally distributed regardless of the distribution of the population
  • Has a mean equal to the population mean
  • If population is normally distributed, then sampling distribution is also normally distributed for any sample size
  • Theorem is one of the most practical results in statistics
94
Q

Sampling distribution

A

Mean is the distribution of the means of all possible samples of fixed size n from some population

95
Q

Standard error of the mean

A

Standard deviation of the sampling distribution of the mean

96
Q

Confidence intervals

A
  • Provide a range for a population characteristic based on sample
  • Provide a way of assessing the accuracy of a point estimate
97
Q

Level of confidence

A

1-alpha

98
Q

Interval Estimates

A

x% +- x%

Gallup poll reports 56% of voters support certain candidate with margin of error +- 3%
- We have a lot of confidence candidate would win since it is [53, 59]

99
Q

T-distribution

A

Used for confidence intervals when the population standard deviation is unknown

  • Only parameter is degrees of freedom
100
Q
A