Exploratory Data Analysis for Machine Learning Flashcards

1
Q

In this overview, we will discuss

A

-Define Artificial Intelligence (AI) Machine Learning (ML) and Deep Learning (DL).

-Explain how DL helps solve classical ML Limitations.
- Explain key historical developments, and the hype-AI winter cycle.

  • Differentiate modern AI from prior AI.
  • Relate sample applications of AI.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Artificial Intelligence (AI)

A

A Program that can sense, reason, act and adapt.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Machine Learning

A

Algorithms whose performance improve as they are exposed to more data over time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Deep Learning

A

Subset of machine learning in which multilayered neural networks learn from vests amounts of data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Artificial Intelligence dictionary definition.

A

A Branch of computer science dealing with the simulation of intelligent behavior’s in computers. - Merriam-Webster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Machine Learning

A

The Study and construction of programs that are not explicitly programmed, but learn patterns as they are exposed to more data over time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Two types of Machine Learning

A

-Supervised Learning
-Unsupervised Learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Supervised Learning

A
  • Dataset: Has a target Column
  • Goal :Make Predictions
    -Example Fraud Detection
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Unsupervised Learning

A

-Dataset: Does not have a Target Column
- Goal: Find Structure in data.
-Example: Customer segmentation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Machine Learning( example)

A
  • Suppose you wanted to identify fraudulent credit card transactions.
    (you could define features to be:
    -Transaction time
  • Transaction amount
    -Transaction location
    -Category of purchase
  • The algorithm could learn what feature combinations suggest unusual activity.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Deep Learning

A

Machine learning that involves using very complicated models called deep neural networks.

  • Models determine best representation of original data; in classic machine learning, humans must do this.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Deep Learning example

A

Classic machine learning - Feature Detection, Machine Learning Classifier Algorithm - Arjun (output).

  • Deep Learning ( Steps 1 and 2 are combined into 1 step )
    using complex model neural Network.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Deep Learning

A

Machine learning that involves using very complicated models called deep neural networks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Deep Learning

A

Machine learning that involves using very complicated models called deep neural networks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

History of AI

A

-AI has experienced several hype cycles, where it has oscillated between periods of excitement and disappointment.

-AI has experienced cycles of AI winters and AI booms.

AI solutions include speech recognition, computer vision, assisted medical diagnosis, robotics, and others.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Learning Goals

A

In this section, we will cover:
- Background and tools used in this course.
-Machine Learning workflow
-Machine Learning vocabulary

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Background and Tools

A

-Examples assume familiarity with:
- Python libraries(e.g. NumPy or Pandas), Jupyter Notebooks.

  • Basic statistics including probability, calculating moments, bayes’ rule.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Examples use iPython(via Jupyter Lab/Notebook), with the following libraries:

A

-NumPy
-Pandas(We will usually read data into a Pandas DataFrame)
-Matplotlib
-Seaborn
-Scikit-Learn
-TensorFlow
-Keras

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Machine Learning Workflow

A

-Problem Statement: What Problem are you trying to solve?

-Data Collection: What data do you need to solve it?

  • Data Exploration and Preprocessing: How should you clean your data so y our model can use it?
  • Modeling: Build a model to solve your Problem?

-Validation: Did I solve the problem?

-Decision Making and deployment: Communicate to stakeholders or put into production?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Machine Learning Vocabulary

A

Target: Category or value that we are trying to predict.

Features: Properties of the data used for prediction ( explanatory variables).

Example:/Observation a single data point within the data (one row).

Label: the target value for a single data point.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Modern AI

A

Factors that have contributed to the current state of Machine Learning are: bigger data sets, faster computers, open source packages, and a wide range of neural network architectures.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Learning Goals

A

In this section, we will cover:
- Retrieving data from multiple data sources:
- SQL databases
- NoSQL databases
- APIs
- Cloud data sources

  • Understand common issues that arise with importing data.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Reading CSV Files

A

Comma-separated (CSV) files consist of rows of data, separated by commas.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

JSON Files

A

JavaScript Object Notation (JSON) files are a standard way to store data across platforms.
JSON files are very similar in structure to Python dictionaries.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

SQL Databases

A

Structured Query Language represents a set of relational databases with fixed schemas.

There are many types of SQL databases, which function similarly ( with some subtle differences in syntax).

examples of SQL databases:
-Microsoft SQL server
-Postgres
-MySQL
-AWS Redshift
-Oracle DB
- Db2 Family

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

SQL Databases

A

Structured Query Language represents a set of relational databases with fixed schemas.

There are many types of SQL databases, which function similarly ( with some subtle differences in syntax).

examples of SQL databases:
-Microsoft SQL server
-Postgres
- MYSQL
-AWS Redshift
-Oracle DB
-Db2 Family

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Not-only SQL (NoSQL)

A

databases are not relational, vary more in structure. Depending on the application, may perform more quickly or reduce technical overhead. Most NoSQL store data in JSON format.

Example of NoSQL databases:
- Document databases: mongoDB, couchDB
-Key-value stores: Riak, Voldenmort, Redis
-Graph databases: Neo4j, HyperGraph
-Wide-column stores: Cassandra, Hbase.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

APIs and Cloud Data Access

A

A variety of data providers make data available via Application programming interfaces (APIs), that make it easy to access such data via python.

  • There are also a number of datasets available online in various formats.

An online available example is the UC Irvine Machine Learning Library.

Here, we read one of its datasets into Pandas directly via the URL.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Reading SQL Data

A

-While this example uses sqlite3, there are several other packages available.

  • The SQL module creates a connection with the database.
  • Data is read into pandas by combining a query with this connection.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Reading NoSQL Data

A

-This example uses the pymongo module to read files stored in MongoDB, although there are several other packages available.

  • We first make a connection with the database ( MongoDB needs to be running).

-Data is read into pandas by combining a query with this connection.

  • Here, query should be replaced with a mongoDB query string (or{} to select all).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Data Cleaning Machine Learning

A

Learning Goals:
In this section, we will cover:
- Why data cleaning is important for machine learning.
- Issues that arise with messy data.
- How to identify duplicate or unnecessary data.
- Policies for dealing with outliers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Why is data cleaning so important?

A
  • Decisions and analytics are increasingly driven by data and models.

Key aspects of Machine Learning Workflow depend on cleaned data:
- Observations: An instance of the data( usually a point or row in a dataset)
- Labels: Output Variables (s) being predicted
- Algorithms: Computer programs that estimate models based on available data.
- Features: Information we have for each observation (Variables)
- Model : Hypothesized relationship between observations and data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Why is data cleaning so important?

A

Messy data can lead to garbage-in, garbage-out effect, and unreliable outcomes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

The Main data problems companies face:

A
  • Too much data
  • Lack of data
  • Bad data
    Having data ready for ML and AI ensures you are ready to infuse AI across your organization.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

How can data be messy?

A
  • Duplicate or unnecessary data
  • Inconsistent text and typos
  • Missing data
    -Outliers
    -Data Sourcing issues:
  • Multiple systems
  • Different database types
    -on premises, in cloud
  • and more.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Duplicate or unnecessary data

A

-Pay attention to duplicate values and research why there are multiple values.

  • It’s a good idea to look at the features you are bringing in and filter the data as necessary ( Be careful not to filter too much if use features later).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Policies for Missing Data

A

-Remove the data: remove the rows entirely.
- Impute the data: replace with substituted values. Fill in the missing data with the most common value, the average value, etc.

  • Mask the data: create a category for missing values.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

What are the pros and cons for each of these approaches?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Remove the data Pros and cons

A

-Pros
* It will quickly clean your dataset without having to guess an appropriate replacement value.
Cons
* If certain values are missing values for many rows, we may end up losing too much information, or a biased dataset to some reason that the data was not collected.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

Impute the data Pros and cons.

A

-Pros
* we don’t lose full rows or columns that may be important for our model as we would have when try to remove full rows.

-Cons
* We add another level of uncertainty to our model, as this is now based on estimates of what we think the true value for that missing value would have been.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

Outliers

A

-An outlier is an observation in data that is distant from most other observations.

  • Typically, these observations are aberrations and do not accurately represent the phenomenon we are trying to explain through the model.

-If we do not identify and deal with outliers, they can have a significant impact on the model.
- It is important to remember that some outliers are informative and provide insights into the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

How to find outliers?

A
  • Plots ( Histogram, Density plot, Box plot)
  • Statistics( Interquartile Range, Standard deviation)
  • Residuals ( Standardized, Deleted, Studentized)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

Residuals

A

(difference between actual and predicted values of the outcome variable represent model failure).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

Approaches to calculating residuals:

A
  • Standardized: Residual divided by standard error.
  • Deleted : residual from fitting model on all data excluding current observation.
    -Studentized: Deleted residuals divided by residual standard error ( based on all data, or all data excluding current observation.)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

Policies for outliers

A
  • Remove them
  • Assign the mean or median value
  • Transform the variable.
  • Predict what the value should be:
  • Using similar observations to predict likely values.
  • Using regression.

Keep them, but focus on model that are resistant to outliers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

Policies for outliers

A
  • Remove them
  • Assign the mean or median value
  • Transform the variable.
  • Predict what the value should be:
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

Learning Goals

A
  • In this section, we will cover
  • Approaches to conducting exploratory Data Analysis (EDA )
  • EDA Techniques
  • Sampling from DataFrames
  • Producing EDA Visualizations
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

What is Exploratory Data Analysis?

A

Exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with Visual methods.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

Why is EDA Useful?

A

-EDA Allows us to get an initial feel for the data.
- This lets us determine if the data makes sense, or if further cleaning or more data is needed.
- EDA helps to identify patterns and trends in the data ( these can be just as important as findings from modeling).

49
Q

Summary Statistics:

A

Average, Median, Min, Max, correlations, etc.

50
Q

Visualizations:

A

Histograms, Scatter Plots, Box Plots, etc.

51
Q

Tools for EDA

A

Data Wrangling : Pandas
Visualization: Matplotlib, Seaborn

52
Q

EDA: Job Applicant Summary Statistics

A

Suppose we want to examine Characteristics of Jobs applicants:
* Average: we could look at the average of all interview scores( perhaps by city or job function).
* Max: We could look at most common words applicants use in application materials.
* Correlations: We could look at the correlations between technical assessments and years experience ( perhaps by type of experience).

53
Q

Sampling from DataFrames

A

There are many reasons to consider random samples from DataFrames:
- For Large Data, a random sample can make computation easier.
- We may want to train models on a random sample of the data.
- We may want to over- or under-sample observations when outcomes are uneven.

54
Q

Visualizations Libraries

A

Visualizations can be created in multiple ways:
- Matplotlib
- Pandas ( Via Matplotlib)
- Seaborn
* Statistically- focused plotting methods.
* Global preferences incorporated by Matplotlib.

55
Q

Feature Engineering and Variable Transformation

A
56
Q

Learning Goals ( In this section, we will cover:

A
  • Feature engineering and variable Transformation
  • Feature encoding
  • Feature scaling
57
Q

Transforming Data : Background

A

-Models used in machine learning workflows often make assumptions about the data.

  • A common example is the linear regression model. This assumes a linear relationship between observations and target (outcome) variables.
  • An example of a linear model relating (feature) variables X1 and X2 with target (label) variable y, is:

YB (X)= B0+B1x1+B2x2
here, B= (B0,B1,B2) represent the model’s parameters.

58
Q

Transformation of Data Distributions

A

-Predictions from linear regression models assume residuals are normally distributed.
- Features and predicted data are often skewed ( distorted away from the center).
- Data Transformations can solve this issue.

59
Q

Log Features

A

-Log Transformations can be useful for linear regression.
(yb(x)= B0+B1 log(x)
- The linear regression model involves linear combinations of features.

60
Q

Polynomial Features

A

-We can estimate highest-order relationships in this data by adding polynomial features.
(yB(x)=B0+B1x+B2X squared)

  • This allows us to use the same linear model.
  • Even with higher-order polynomials. (YB(X)=B0+B1x+B2x squared+ B3Xcubed)
61
Q

Variable selection

A

involves choosing the set of features to include in the model.
- Variables must often be transformed before they can be included in models. In addition to log and polynomial transformations, the is can involve:
- Encoding: converting non-numeric features to numeric features.
- Scaling: converting the scale of numeric data so they are comparable.
- The appropriate method of scaling or encoding depends on the type of feature.

62
Q

Types of Features

A

Encoding is often applied to categorical features, that take non-numeric values.
Two primary types:
- Nominal: Categorical variables take values in unordered categories ( e.g. Red, Blue, Green; True, False).

  • Ordinal: Categorical variables take values in ordered categories (e.g. High, Medium, Low.)
63
Q

Feature encoding: Approaches

A

There are several common approaches to encoding variables:
- Binary encoding: converts variables to either 0 or 1 and is suitable for variables that take two possible values (e.g. True, False).

  • One-hot encoding: converts variables that take multiple values into binary (0,1) variables one for each category. This creates serval new variable s.
  • Ordinal encoding: involves converting ordered categories to numerical values, usually by creating one variable that takes integer equal to the number of categories. ( e.g. 0,1,2,3)
64
Q

Feature scaling

A

involves adjusting a variable’s scale . This allows comparison of variables with different scales.
- Different continuous ( numeric) features often have different scales.

65
Q

Feature Scaling: Approaches

A

There are many approaches to scaling features. Some of the more common approaches include:
- Standard scaling: converts features to standard normal variables ( by subtracting the mean and dividing by the standard error).
- Min- Max scaling: Converts variables to continuous variables in the (0,1) interval by mapping minimum values to 0 and maximum to 1.
This type of scaling is sensitive to outliers.
- Robust scaling: is similar to min-max scaling, but instead maps the interquartile range( the 75th percentile value minus the 25th percentile value) to (0,1). This means the variable itself takes values outside of the (0,1) interval.

66
Q

Estimation and Inference

A

Learning Goals
In this section, we will cover
- Statistical estimation and inference
-Parametric and non-parametric approaches to modeling.
- Common statistical distributions
- Frequentist vs Bayesian statistics.

67
Q

Estimation Vs Inference

A

Estimation: is the application of an algorithm, for example taking an average.

Inference: Involves putting an accuracy on the estimate ( e.g. standard error of an average)

68
Q

Machine Learning and Statistical Inference

A
  • Machine learning and statistical inference are similar ( a case of computer science borrowing from a long history in statistics)

In both cases, we are using data to learn/ infer qualities of a distribution that generated the data ( often termed the data-generating process).

We may care either about the whole distribution or just features (e.g. mean)

Machine learning applications that focus on understanding parameters and indivudal effects involve more tools from statistical inference ( some applications are focused only on the results).

69
Q

Example Customer Churn

A

-Customer Churn occurs when a customer leaves a company.

  • Data related to churn may include a target variable for whether or not the customer left.

Features could include:
- The length of time as a customer.
- The type and amount purchased
- Other customer characteristics. (Age, Location).

  • Churn prediction is often approached by predicting a score for individuals that estimates the probability the customer will leave.
70
Q

Customer Churn: Estimation

A

*Estimation of factors customer churn involves measuring the impact of each feature in predicting churn.

  • Inference involves determining whether these measured impacts are statistically significant.
71
Q

Customer Churn: Example Dataset

A

IBM Cognos Customer Churn Dataset:
- Data from fictional telecommunications firm.

-Includes account type, customer characteristics, revenue per customer, satisfaction score, estimate of customer lifetime value.

-Includes information on whether customer churned ( and some categories of churn type).

72
Q

Parametric Vs Non-parametric

A

-If inference is about trying to find out the data-generating process (DGP), then we can say that a statistical model (of the data) is a set of possible distributions or maybe even regressions.

  • A Parametric model is a particular type of statistical model: it’s also a set of distributions or regressions, but they have a finite number of parameters.
  • Non-parametric statistics
    In Particular, we don’t assume that the data belongs to any particular distribution ( also called distribution-free inference).

This doesn’t mean that we know nothing though!

73
Q

Non-Parametric Inference example

A

An example of non-parametric inference is creating a distribution of the data (CDF or cumulative distribution function) using a histogram.

-In this case, we are not specifying parameters.

74
Q

Parametric models example

A

The normal distribution

75
Q

Example: Customer Lifetime Value

A

-Customer lifetime value is an estimate of the customer’s value to the company.
- Data related to customer lifetime value might include:
- The expected length of time as customer
- The expected amount spent over time.

To estimate lifetime value, we make assumptions about the data.

These assumptions can be parametric (assuming a specific distribution) , or non- parametric

76
Q

Parametric Models: Maximum Likelihood

A
  • The most common way of estimating parameters in a parametric model is through maximum likelihood estimation (MLE).
  • The likelihood function is related to probability and is a function of the parameters of the model
77
Q

Commonly used Distributions

A

1) Uniform
2) Gaussian/Normal
3) Log Normal Distribution
4) Exponential Curve
5) Poisson

78
Q

Frequentist Vs Bayesian Statistics

A
  • A Frequentist is concerned with repeated observations in the limit.
  • In the frequentist approach is to estimate the probabilities, or in the Bayesian approach as well, to estimate the probabilities of a certain number of customers coming over a fixed period of time period.
79
Q

Queing Theory

A

It’s the study of working with queues or lines and how many servers we need to match the size of that queue or the size of the line . e.g. Grocery store and how many cashiers, we will need to check out our customers in a timely fashion.

80
Q

Frequentist Vs Bayesian Statistics

A

-Processes may have true frequencies, but we are interested in modeling probabilities as many repeats of an experiment.

81
Q

Frequentist Approach

A

1) Derive the probabilistic property of a procedure. ( There is a fixed value for a given probability in the population of our sample. We derive the estimate directly from the data with no external influence.

2) Apply the probability directly to t he observed data.

82
Q

Bayesian Approach

A

-A Bayesian describes parameters by probability distributions.

-Before seeing any data, a prior distribution (based on the experimenters’ belief) is formulated.

  • This prior distribution is then updated after seeing data ( a sample from the distribution).

-After updating, the distribution is called the posterior distribution.

  • We use much of the same math and the same formulas in both Frequentist and Bayesian statistics.
  • The element that differs is the interpretation.
  • We will point out the difference in interpretation, where appropriate.
83
Q

Hypothesis Testing

A

Learning Goals
- Overview of Hypothesis testing
- Bayesian approach to hypothesis testing
- An example of hypothesis testing involving coin-tossing

84
Q

Hypothesis

A

A Hypothesis is a statement about a population parameter such as the mean of our poison distribution, and estimate of the number of people that will come into the line of our grocery store example in the next hour.

85
Q

We Create two hypotheses:

A
  • The null hypothesis (H0)
  • The Alternative hypothesis (H1 or Ha)
    -We Create which one to call the null depending on how the problem is set up.
86
Q

Hypothesis Testing: Decision Rules

A

A hypothesis testing procedure gives us a rule to decide:
- For which values of the test static do we accept H0
- For which values of the test statistic do we reject H0 and accept H1.

  • You may hear some people say that you can reject H0 but you never accept H1.
  • Here this doesn’t matter very much, since we are using hypothesis testing in order to decide which of the two paths to take in the project.
87
Q

Hypothesis Testing: Bayesian Approach

A

-In the Bayesian Interpretation ( example to follow), we don’t get a decision boundary.

  • Instead we get updated (posterior) probabilities.

-

88
Q

Coin Tossing Example

A

You have two coins:
- Coin 1 has a 70% probability of coming up heads.
- Coin 2 has a 50% Probability of coming up heads.

  • Pick one coin without looking.
    Toss the coin 10 times and record the number of heads.
  • Given the number of heads you see, which of the two coins did you toss?
89
Q

Hypothesis Testing: Bayesian Interpretation

A

In the Bayesian interpretation, we need priors for each hypothesis:
- In this case, we randomly chose the coin to flip.
-P(H1=we chose coin 1)=1/2 and P(H2=we chose coin 2)=1/2.
- Updating priors after seeing the data 3 heads ( Bayes’ rule).

since we have no way, before seeing the data, to determine the coin that was chosen, we just assign 1/2 to each.

90
Q

Bayesian Interpretation

A
  • The Priors are multiplied by the likelihood ratio which does not depend on the priors.
  • The Likelihood ratio tells us how we should update the priors in reaction to seeing a given set of data!.
91
Q

Type 1 vs Type 2 Error

A

Learning Goals
In this section, we will cover :
- Hypothesis testing terminology including Type-1 and Type-2 errors.
- Examples of Hypothesis tests in Practice.

92
Q

Neyman-Pearson Interpretation

A

The Neyman-Pearson paradigm (1933) is non-Bayesian. This gives an up or down vote on H0 Vs H1.

93
Q

Neyman-Pearson Interpretation example

A
  • Tossing a coin and our null hypothesis is that we are working with a fair coin, so it’s a 50-50 probability it will land on heads.
  • The Alternative Hypothesis is that it is not a fair coin. So the alternative is just that it’s not a 50-50 heads.
94
Q

Type-1 error (Case study)

A

A Type 1 error is this case is going to be incorrectly rejecting the null. So this would mean we are indeed working with a fair coin,

but we make the error given our sample data, that we should decide to reject the null that it is a fair coin.

95
Q

Type 2 error

A

Type 2 error is going to incorrectly accept the null. So this would mean we are working with a bias coin, but instead accept that we are working with a fair coin given our data or fail to reject that it is a fair coin given our data.

96
Q

Example customer churn

A

-Customer churn occurs when a customer leaves a company.

  • Data related to churn may include a target variable for whether or not the customer left.
  • Features could include:
  • The Length of time as a customer
  • The Type and amount purchased
  • Other Customer Characteristics
  • Churn predication is often approached by predicting a score for individuals that estimates the probability the customer will leave.
97
Q

Customer Churn: Type 1 VS Type 2 Error

A

-Suppose we use data on customer characteristics to predict who will churn over the next year.

  • In our data, customers who have been with the company for longer are less likely to churn.
  • This could be due to an underlying effect, or due to chance:
  • A type 1 Error occurs when this effect is due to chance, but we will find it to be significant in the model.
  • A type 2 Error occurs when we ascribe the effect to chance, but the effect is non-coincidental.
98
Q

Hypothesis Testing: Terminology

A

The likelihood ratio is called a test statistic: we use it to decide whether to accept/ reject H0.

99
Q

The rejection region

A

is the set of values of the test statistic that lead to rejection of H0.

100
Q

The Acceptance region

A

is the set values of the test statistic that lead to acceptance of H0.

101
Q

The Null distribution

A

is test statistic’s distribution when the null is true.

102
Q

Hypothesis Testing: Marketing Intervention

A

Testing Marketing intervention effectiveness:
- For a new direct mail marketing campaign to existing customers, the null hypothesis (H0) suggests the campaign does not impact purchasing.

  • The alternate hypothesis (H1) suggests it has a n impact.
103
Q

Hypothesis Testing: Website Layout

A

Testing a change in website layout:
- For a proposed change to a web layout, we may test a null hypothesis (H0) that the change has no impact on traffic.
- Here, we would look for evidence to reject the null in favor of an alternative hypothesis ( H1, that there is an impact on Traffic).

104
Q

Hypothesis Testing: Product Quality/Size

A

Testing whether a product meets expected size threshold:
-Suppose a product is produced in various factories, with expected size S.

  • To confirm that the product size meets the standard within a margin of error , the company might:
  • randomly sample from each production source.
  • establish H0 ( product size is not significantly different from S and H1 ( there is a significant deviation in product size),
  • Test whether H0 can be rejected in favor of H1 based on the observed mean and standard deviation.
105
Q

Significance level and P-Values

A

Learning Goals
In this section, we will cover:
- Hypothesis testing: significance level and p-values
- Power and sample size considerations.

106
Q

Significance Level and P-Values

A

-We know the distribution of the null hypothesis.
- To get a rejection region, we calculate the test statistic.
- We will choose, before testing the data, the level at which we will reject the null hypothesis.

107
Q

Significance Level and P-Values

A

A significance level (a) is a probability threshold below which the null hypothesis will be rejected.
- We must choose an a before computing the test statistic! if we don’t we might be accused of p-hacking.
- Choosing a is somewhat arbitrary, but often.01 or .05.

108
Q

P-Value

A

The P-value is the probability under the null distribution of result a more extreme than what was actually observed. so its small significance level at which the null hypothesis will be rejected.

109
Q

The Confidence interval

A

the values of the statistic for which we accept the null.

110
Q

F-Statistic

A

*H0: the data can be modeled by setting all betas to zero.
- Reject the null if the p-value is small enough.

111
Q

Power and Sample size

A

-If you do many 5% significance tests looking for a significant result, the chances of making at least one Type-1 error increase.

  • Probability of at least one type 1 error is approximately=1-(1-0.05)#tests.
  • This is roughly 0.05x (#tests),if you have 10 or fewer tests .
112
Q

Power: Bonferroni Correction

A

-The Bonferroi Correction: says “choose p threshold so that the probability of making a type 1 error ( assuming no effect) is 5%.
- Typically choose: p Threshold = 0.05/ (#tests)
Bonferroni correction allows the probability of a type 1 error to be controlled, but at the cost of power.
- Effects either need to be larger or the tests need larger samples, to be detected.
- Best Practice is to limit the number of comparisons done to a few well-motivated cases.

113
Q

Here’s a summary of the key concepts in hypothesis testing:

A

-Hypothesis testing is a statistical method to determine if a claim (hypothesis) about a population is supported by sample data.

-The significance level (alpha) is a threshold used to determine if the evidence is strong enough to reject the null hypothesis. A common value for alpha is 0.05.

-The p-value is the probability of obtaining a result as extreme as, or more extreme than, the observed result, assuming the null hypothesis is true. A p-value less than or equal to alpha means we reject the null hypothesis.

-Power is the probability of correctly rejecting the null hypothesis when it is false. A higher power means we are more likely to detect a true effect.

-Sample size considerations are important because they can affect the significance level and power of a hypothesis test. A larger sample size generally leads to a higher power to detect a true effect, while a small sample size may increase the chance of a Type II error (failing to reject the null hypothesis when it is false).

114
Q
A

Overview of Hypothesis Testing:

Hypothesis testing is a statistical method used to determine if a claim (hypothesis) about a population is supported by sample data.
It involves formulating a null hypothesis (the claim being tested) and an alternative hypothesis (the claim we want to support).
We then collect data and use statistical tests to determine if the data provides enough evidence to reject the null hypothesis in favor of the alternative hypothesis.
Bayesian Approach to Hypothesis Testing:

The Bayesian approach to hypothesis testing involves calculating the probability of the hypothesis being true given the observed data, rather than calculating the probability of the data given the hypothesis (as in traditional hypothesis testing).
It involves using prior knowledge or beliefs about the hypothesis, updating these beliefs based on the observed data, and calculating the posterior probability of the hypothesis.
Example of Hypothesis Testing Involving Coin-Tossing:

Suppose we want to test the hypothesis that a coin is fair (i.e., has a 50-50 chance of landing heads or tails).
We can formulate the null hypothesis as “the coin is fair” and the alternative hypothesis as “the coin is not fair”.
We then toss the coin a certain number of times and record the number of heads and tails.
We can use a statistical test (such as the chi-square test) to determine if the observed data provides enough evidence to reject the null hypothesis in favor of the alternative hypothesis.
For example, if we toss the coin 100 times and get 60 heads and 40 tails, we can calculate a p-value (the probability of getting a result as extreme or more extreme than the observed result, assuming the null hypothesis is true) and compare it to the significance level (the threshold for rejecting the null hypothesis). If the p-value is less than or equal to the significance level (often set at 0.05), we reject the null hypothesis and conclude that the coin is not fair.
I hope that helps! Let me know if you have any other questions.

115
Q

Overview of Hypothesis Testing:

A

-Hypothesis testing is a statistical method used to determine if a claim (hypothesis) about a population is supported by sample data.
It involves formulating a null hypothesis (the claim being tested) and an alternative hypothesis (the claim we want to support).
We then collect data and use statistical tests to determine if the data provides enough evidence to reject the null hypothesis in favor of the alternative hypothesis.
Bayesian Approach to Hypothesis Testing:

The Bayesian approach to hypothesis testing involves calculating the probability of the hypothesis being true given the observed data, rather than calculating the probability of the data given the hypothesis (as in traditional hypothesis testing).
It involves using prior knowledge or beliefs about the hypothesis, updating these beliefs based on the observed data, and calculating the posterior probability of the hypothesis.
Example of Hypothesis Testing Involving Coin-Tossing:

Suppose we want to test the hypothesis that a coin is fair (i.e., has a 50-50 chance of landing heads or tails).
We can formulate the null hypothesis as “the coin is fair” and the alternative hypothesis as “the coin is not fair”.
We then toss the coin a certain number of times and record the number of heads and tails.
We can use a statistical test (such as the chi-square test) to determine if the observed data provides enough evidence to reject the null hypothesis in favor of the alternative hypothesis.
For example, if we toss the coin 100 times and get 60 heads and 40 tails, we can calculate a p-value (the probability of getting a result as extreme or more extreme than the observed result, assuming the null hypothesis is true) and compare it to the significance level (the threshold for rejecting the null hypothesis). If the p-value is less than or equal to the significance level (often set at 0.05), we reject the null hypothesis and conclude that the coin is not fair.
I hope that helps! Let me know if you have any other questions.

116
Q

Correlation Vs Causation

A

Learning Goals
In this section, we will cover:
- Correlation Vs causation
- Confounding variables
- Examples of spurious correlations.

117
Q

Does it Rain more on cooler Days?

A

-We associate with cold weather.
- Does it actually rain
more when days are cooler?
- Maybe it depends on where you are.

  • Some places have summer monsoons, so maybe as it gets warmer there, it rains more.
  • Warmer weather increases evaporation , which can increase humidity. In warm weather, there is water in the air to form precipitation. This mechanism would suggest warmer weather more rain.
  • Cooler weather decreases dew point ( i.e. air can hold less water.) This suggests if humid air enters the air and cools it will turn into rain. This mechanism would suggest cooler weather more rain.
118
Q

How Correlations are Important?

A

-If two variables X and Y are correlated, then X is useful for predicting Y.
- If we are trying to model Y, and we find things that correlate with Y, we may improve the modeling.
- We should be careful about changing X with the Hope of changing Y.

X and Y can be correlated for different reasons:

*X causes Y ( What we want).our marketing budget successfully leading to a higher revenue.
*Y causes X ( mixing up cause and effect)
* X and Y are both caused by something else ( Confounding).
* X and Y are not related, we just got lucky in the sample ( spurious)

Examples: Confounding correlation is actually a subset of spurious correlation. We can think of it as anytime two values just really aren’t related at all, maybe there’s a confounding variable or maybe it’s just random that marketing spend versus revenue both going up at the same time.

119
Q

Mixing up Cause and Effect

A

1) Student Test Scores are positively Correlated with amount of time studied.

This does not mean we should get students to study more by curving everyone’s grades upward. It is more likely that studying helps students learn material, so studying causes better performance.

2) Customer Satisfaction is negatively correlated with customer service call volume.

This doesn’t mean that we should remove or hide the customer service numbers, with the hope of improving customer satisfaction.

120
Q

Confounding Variables

A

-A Confounding Variable is something that causes both X and Y to change.

-X and Y are correlated even though X doesn’t cause Y and Y doesn’t cause X.

121
Q

Example of Confounding Variable

A

1) The Number of annual car accidents and the number of people named John are positively correlated ( both are correlated with the population size .)

2) The amount of ice-cream sold and the number of drownings in a week are positively correlated ( both are positively correlated with temperature.
3) Number of factories a chip manufacturer owns and the number of chips sold are positively correlated ( but both are driven by demand from the market.