Exam One Flashcards
What is Analytics
Transforms data into insight for making decisions “Informs”
What do data analyst do?
Collect and interpret data
- Analyze results
- Report results back to the relevant members
- Identifies patterns and trends in data sets
- Work alongside teams within the business or the management team to establish business needs
Applications of Business Analytics
- Customer relationship
- Sports game strategies
- Pricing Decision
- Health care
- Human resource planning
- Supply Chain Management
- Financial and Marketing
Importance of Business Analytics
- Profitability of businesses
- Revenue of businesses
- Shareholder return
- Enhances understanding the data
- Vital to remain competitive
- Enables creation of informative reports
Descriptive analytics
- Uses data to understand past and present
- Summarizes data into meaningful charts and reports
- Identify patterns and trends in data
(Pie chart showing sales of product X and Y by region)
Predictive analytics
- Analyzes past performance
- Extrapolating to future
- Predicts risk
(Linear demand Prediction model. As price increases, demand falls line chart)
Prescriptive analytics
- Uses optimization techniques to identify best alternatives
- Often combines with predictive analytics to account for rist
For analysis and Decision making, you need
Metrics to quantify performance
Measures are the values of metrics
Discrete metrics involve counting (on time or not, number of on-time deliveries)
Continuous metrics are measured on a continuum (Delivery time, package weight, purchase price)
Categorical data
Data that helps sort things into groups or types. Doesn’t involve numbers but rather labels or names
Ordinal Data
Involves categories that can be arranged in a specific order or rank. Rating experience at restaurant as “bad”, “okay”, “good”. You know that one is better than the other but not by a certain amount.
Interval data
has order and measurable differences between values and does not have a true zero point. An example is degrees in temp. Interval data has no starting point or true “zero”
Ratio
It has all the features of interval but also has true zero. With ratio you can add, subtract, and use comparisons like “twice as much”.
Good decision making
requires a mixture of skills: creative development and identification of options, clarity of judgment, firmness of decision, effective implementation
Steps to problem solving
- Recognize problem
- Define problem
- Structure the problem
- Analyze the problem (Role of BA)
- Interpreting results and making decisions
- Implement the solution
Recognizing the Problem
Exist when there is a gap between what is happening and what we think should be happening
(Distribution costs being too high)
Defining the problem
Clearly defining the problem
ex. High distribution costs stem from:
- Inefficiencies in routing trucks
- Poor location of distribution centers
- External factors such as increasing fuel costs
Structuring the problem
- Stating the goals and objectives (minimizing the total delivered costs of the product)
- Characterizing the possible decisions (New manuf, New loc for warehouses)
- Identifying any constraints or restrictions (Deliver orders within 48 hrs)
Analyzing the Problem
Identifying and applying appropriate BA techniques
Interpreting results and Making Decision
- Managers interpret results from the analysis phase
- Incorporate subjective judgment as needed
- Understand limitations and model assumptions
- Make a decision utilizing the information
Implementing the solution
- Translate the results of the model back to the real world
- Make solution work in the organization by:
– Providing adequate resources
– Motivating Employees
– Eliminating resistance to change
— Modifying organizational policies
– Developing Trust
Experiment (random)
Process of observation that leads to a single outcome that cannot be predicted with certainty
Sample point
The most basic outcome of a random experiment
Sample Space
Collection of all possible outcomes (Depends on experimenter)
Event
Set of outcomes of a probability experiment
Steps for calculating probability
- Define experiment; describe the process used to make an observation and the type of observation that will be recorded
- List sample points
- Assign probabilities to sample points
- Determine collection of sample points contained in the event of interest
- Sum the sample point’s probabilities to get the event
Union
Outcomes in either events A or B or both
- Denoted by U. AUB
- ‘Or’ Statement
Intersection
Outcomes in both events A and B
- ‘AND’ Statement
Denoted by n AnB
P(A|B)
P(AnB)/P(B)
Data preprocessing
- Transforming raw data into an understandable format
- Helps us to understand and make knowledge discovery of data at the same time
Why is data preprocessing needed?
Real-world data tends to be incomplete, noisy, and inconsistent
- leads to poor-quality data and models built on the data
It provides operations that helps to organize data into a proper form for a better understanding in the data mining process
Examples of poor-quality data
Incomplete - Lacking attribute values, lacking certain attributes of interest or containing only aggregate data
Noisy - Contains too many outliers
Intentional - Disguised missing data
Why preprocess data?
Accuracy
Completeness
Consistency
Timeliness
Believability
Interpretability
Data Cleaning
- Handling missing data
- Outlier detection and removal
- Noise reduction
Data Transformation
- Scaling
- Smoothing
- Aggregation
- Generalization
Data reduction
- Feature selection
- Dimensionality
- Numerosity reduction
Handling Imbalance
- Oversampling
- Under-sampling
Data Integration
Combining tables
Tasks of data cleaning
- Fill in missing values
- Identify outliers
- Smooth out noisy data
- Correct inconsistent data
Handling missing data
Data is not always available
- many tuples have no recorded value for several attributes
Missing data may be due to
- Equipment malfunction
- Inconsistent with other recorded data thus deleted
- Data not entered due to misunderstanding
- Certain data may not be considered important at the time of entry
- Missing data may need to be inferred
Causes of outliers
- Experimental errors
- Measurement errors (instrument errors)
- Data entry errors (human errors)
- Data processing errors (data manipulation or data set unintended mutations)
- Sampling errors (extracting or mixing data from wrong or various sources)
- Natural (not an error, novelties in data)
Outlier detection and Removal
- Z-score or Extreme Value Analysis
- Interquartile Range Method
- Probabilistic and Statistical modeling
- Linear Regression Models
- Proximity Based Models
- Information Theory Models
- Information Theory Models
- High Dimensional Outlier Detection Methods
Z-score removal
- Very effective when values in the feature fit a Gaussian distribution
- Easy to implement
- It is useful for low-dimensional feature set
- Not recommended when data cannot be assumed to be parametric
- Eliminate those data with z value greater or less than 3 (-3)
- Eliminate .27% data points
Noisy reduction can be handles:
Binning :
- First sort data and partition it into (equal-frequency) bins
- Then can smooth by bin means, median, and bin boundaries
Regression:
- Smooth by fitting the data into regression functions
Clustering:
- Detect and remove outliers
Scaling
Normalization - the process of normalization entails converting numerical values into a new range using mathematical function
Two primary reasons why to normalize data
Make two variables in different scales comparable
Some models may need the data to be normalized before modeling
Min-max normalization
v’ = (v-mina)/(maxa - min a)
Smoothing
Statistical technique designed to detect trends in the presence of noisy data, assuming that the trend is smooth
Different types of smoothing
- Bin Smoothing
- Kernels
- Local weighted regression
Generalization
Process of broadening the classification of data into a database
Ex:
Age groups instead of age
Income levels instead of income
State instead of county
Data reduction
Obtain a reduced representation of the data set that is much smaller in volume however produces the same analytical results
Why is data reduction
Complex data analysis may take a very long time to run on the complete data set
Additional data does not mean a better result outcomes
Feature Selection
Construct new features combining the given features to make the data mining process more efficient
Dimensionality Reduction
Used to reduce the amount of features
Numerosity Reduction
Replace original data by a smaller form of data representation
- Parametric - Regression models
- Non-parametric - histograms, data sampling and data cube aggregation
Why do feature selection?
- Improved model performance
- Interpretability and Simplicity
- Identification of important feature variabilities
Feature selection methods:
Pearson Correlation Coefficient:
- Measures linear relationship between two continuous variables
Chi-squared test:
- Used to test if two categorical variables are independent
Analysis of Variance
- Used to compare one categorical and one continuous variable
- Tests if the mean of Variable 1 in different groups of Variable 2 are equal
Data Integration
Combining data from different sources to provide a unified view or dataset
Data imbalance
Uneven distribution of classes in a dataset
Ex.
- Fraud detection data
- Spam classification data
- Medical diagnosis data
Why data imbalance is a problem?
- Bias towards the majority class
- Poor generalization
- Misleading metrics
Popular data-level approaches to handle imbalance:
- Over-sampling minority class
- Under-sampling majority class
Steps in machine learning data preprocessing
1- import libraries
2 - Import data-set
3 - Check out the missing values
4 - See the categorical values
5 - Splitting the data set into Training and Test Set
6 - Feature engineering (scaling, selection, etc.)
Population
Set of all items of interest for a particular decision or investigation
Sample
- Subset of population
Ex.
List of individuals who rented a comedy from Netflix in the past year - Purpose of sampling is to obtain sufficient information to draw a valid inference about a population
Statistics
Any function of the random variables constituting a random sample is called a statistic
Probability density function (PDF)
Statistical expression that defines a probability distribution (the likelihood of an outcome) for a discrete random variable
Skewness
Measure of asymmetry of a distribution
Variance
Average of the squared deviations from the mean
Chebyshev’s Theorem
Proportion of any distribution that lies within K standard deviations of the mean
1-(1/K^2)
K is any positive number greater than 1
Standard Deviation
Square root of the variance
Population
Set of all items of interest for a particular decision or investigation
Ex.
- All former Texas A&M ID graduates
- All subscribers to Netflix
Sample
Subset of the population
Ex. List of individuals who rented a comedy from netflix in the past year
Purpose of a sample is to obtain sufficient information to draw a valid inference about a population
Correlation
Used to determine when a change in one variable can result in a change in another
Measures of Association
- Both covariance and correlation measure the linear relationship and the dependency between two variables
- Covariance indicates the direction of the linear relationship between variables
- Correlation measures both the strength and directin of the linear relationship between two variables
- Correlation values are standardized
- Covariance values are not standardized
Correlation Coefficient
Tells us how two variables are related. To find it:
- Calculate how the two variables change together (covariance)
- Divide the amount of variation (spread) in each variable using Stdevs
What does the covariance mean
A standardized number between -1 and 1 makes it easy to see how strongly the variables are connected. A value close to 1 means a strong positive relationship and a value close to -1 means a strong negative relationship
Statistical analysis
All about data
Sampling
Foundation of statistical analysis
Sampling Plan
Description of the approach that is used to obtain samples from a population prior to any data collection activity
Sampling plan states
- Its objectives
- Target population
- Population Frame
- Operational procedures for data collection
- Statistical tools for data analysis
Subjective sampling method
Judgment sampling
- Expert judgment is used
Convenience sampling
- collect sample based on convenience
Probability Sampling method
Simple Random Sampling
- selecting items from population so that every subset of given size has equal chance of being selected
Systematic (periodic) sampling (statistical sampling)
- selects every nth item from population
Stratified sampling (statistical sampling)
- Applied to population divided into subsets and allocates an appropriate proportion of samples to each subset
Cluster sampling (statistical sampling)
- Divide population into clusters and sample a set of clusters
Sampling from a continuous process (statistical sampling)
- Fix the time and select ‘n’ items after that time or select ‘n’ times at random and select the next item produced after each of these items
Sample Data
Provides basis for many useful analyses to support decision making
Estimation
Assess the value of unknown population parameters (mean, proportion, population variance) using sample data
Unbiased Estimator of population mean u is Sample mean
If sampling is done randomly and correctly, the sample mean will provide a good estimate of the true population mean
Unbiased Estimator of the Population variance o^2 is sample variance s^2
Using (n-1) instead of n compensates for the fact that we’re estimating based on the sample and accounts for variability in that estimate
Point Estimate
Single number derived from the sample to estimate the population parameter
- If the long-term average of point estimates from population samples provides a true estimate of the population parameter, the estimator is called unbiased estimator
Sampling error
Difference between observed values of statistic and the quantity it is intended to estimate
- Any difference between sampling mean and population mean
Causes of Sampling error
Sampling errors: The sample is NOT representative of the population as a whole
Non-sampling errors: Systematic errors such as asking the so-called leading questions during an interview
Central limit theorem
If sample size is large enough, sampling distribution of the mean is:
- Aprox. normally distributed regardless of the distribution of the population
- Has a mean equal to the population mean
- If population is normally distributed, then sampling distribution is also normally distributed for any sample size
- Theorem is one of the most practical results in statistics
Sampling distribution
Mean is the distribution of the means of all possible samples of fixed size n from some population
Standard error of the mean
Standard deviation of the sampling distribution of the mean
Confidence intervals
- Provide a range for a population characteristic based on sample
- Provide a way of assessing the accuracy of a point estimate
Level of confidence
1-alpha
Interval Estimates
x% +- x%
Gallup poll reports 56% of voters support certain candidate with margin of error +- 3%
- We have a lot of confidence candidate would win since it is [53, 59]
T-distribution
Used for confidence intervals when the population standard deviation is unknown
- Only parameter is degrees of freedom