Exploratory Data Analysis for Machine Learning Flashcards

Question

SQL Databases

Answer 1

Structured Query Language represents a set of relational databases with fixed schemas. There are many types of SQL databases, which function similarly ( with some subtle differences in syntax). examples of SQL databases: -Microsoft SQL server -Postgres -MySQL -AWS Redshift -Oracle DB - Db2 Family

Answer 2

Structured Query Language represents a set of relational databases with fixed schemas. There are many types of SQL databases, which function similarly ( with some subtle differences in syntax). examples of SQL databases: -Microsoft SQL server -Postgres - MYSQL -AWS Redshift -Oracle DB -Db2 Family

Answer 3

databases are not relational, vary more in structure. Depending on the application, may perform more quickly or reduce technical overhead. Most NoSQL store data in JSON format. Example of NoSQL databases: - Document databases: mongoDB, couchDB -Key-value stores: Riak, Voldenmort, Redis -Graph databases: Neo4j, HyperGraph -Wide-column stores: Cassandra, Hbase.

Answer 4

A variety of data providers make data available via Application programming interfaces (APIs), that make it easy to access such data via python. - There are also a number of datasets available online in various formats. An online available example is the UC Irvine Machine Learning Library. Here, we read one of its datasets into Pandas directly via the URL.

Answer 5

-While this example uses sqlite3, there are several other packages available. - The SQL module creates a connection with the database. - Data is read into pandas by combining a query with this connection.

Answer 6

-This example uses the pymongo module to read files stored in MongoDB, although there are several other packages available. - We first make a connection with the database ( MongoDB needs to be running). -Data is read into pandas by combining a query with this connection. - Here, query should be replaced with a mongoDB query string (or{} to select all).

Answer 7

Learning Goals: In this section, we will cover: - Why data cleaning is important for machine learning. - Issues that arise with messy data. - How to identify duplicate or unnecessary data. - Policies for dealing with outliers.

Answer 8

- Decisions and analytics are increasingly driven by data and models. Key aspects of Machine Learning Workflow depend on cleaned data: - Observations: An instance of the data( usually a point or row in a dataset) - Labels: Output Variables (s) being predicted - Algorithms: Computer programs that estimate models based on available data. - Features: Information we have for each observation (Variables) - Model : Hypothesized relationship between observations and data.

Answer 9

Messy data can lead to garbage-in, garbage-out effect, and unreliable outcomes.

Answer 10

- Too much data - Lack of data - Bad data Having data ready for ML and AI ensures you are ready to infuse AI across your organization.

Answer 11

- Duplicate or unnecessary data - Inconsistent text and typos - Missing data -Outliers -Data Sourcing issues: - Multiple systems - Different database types -on premises, in cloud - and more.

Answer 12

-Pay attention to duplicate values and research why there are multiple values. - It's a good idea to look at the features you are bringing in and filter the data as necessary ( Be careful not to filter too much if use features later).

Answer 13

-Remove the data: remove the rows entirely. - Impute the data: replace with substituted values. Fill in the missing data with the most common value, the average value, etc. - Mask the data: create a category for missing values.

Answer 14

-Pros * It will quickly clean your dataset without having to guess an appropriate replacement value. Cons * If certain values are missing values for many rows, we may end up losing too much information, or a biased dataset to some reason that the data was not collected.

Answer 15

-Pros * we don't lose full rows or columns that may be important for our model as we would have when try to remove full rows. -Cons * We add another level of uncertainty to our model, as this is now based on estimates of what we think the true value for that missing value would have been.

Answer 16

-An outlier is an observation in data that is distant from most other observations. - Typically, these observations are aberrations and do not accurately represent the phenomenon we are trying to explain through the model. -If we do not identify and deal with outliers, they can have a significant impact on the model. - It is important to remember that some outliers are informative and provide insights into the data.

Answer 17

- Plots ( Histogram, Density plot, Box plot) - Statistics( Interquartile Range, Standard deviation) - Residuals ( Standardized, Deleted, Studentized)

Answer 18

(difference between actual and predicted values of the outcome variable represent model failure).

Answer 19

- Standardized: Residual divided by standard error. - Deleted : residual from fitting model on all data excluding current observation. -Studentized: Deleted residuals divided by residual standard error ( based on all data, or all data excluding current observation.)

Answer 20

- Remove them - Assign the mean or median value - Transform the variable. - Predict what the value should be: * Using similar observations to predict likely values. * Using regression. Keep them, but focus on model that are resistant to outliers.

Answer 21

- Remove them - Assign the mean or median value - Transform the variable. - Predict what the value should be:

Answer 22

- In this section, we will cover - Approaches to conducting exploratory Data Analysis (EDA ) - EDA Techniques - Sampling from DataFrames - Producing EDA Visualizations

Answer 23

Exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with Visual methods.

Answer 24

-EDA Allows us to get an initial feel for the data. - This lets us determine if the data makes sense, or if further cleaning or more data is needed. - EDA helps to identify patterns and trends in the data ( these can be just as important as findings from modeling).

Answer 25

Average, Median, Min, Max, correlations, etc.

Answer 26

Histograms, Scatter Plots, Box Plots, etc.

Answer 27

Data Wrangling : Pandas Visualization: Matplotlib, Seaborn

Answer 28

Suppose we want to examine Characteristics of Jobs applicants: * Average: we could look at the average of all interview scores( perhaps by city or job function). * Max: We could look at most common words applicants use in application materials. * Correlations: We could look at the correlations between technical assessments and years experience ( perhaps by type of experience).

Answer 29

There are many reasons to consider random samples from DataFrames: - For Large Data, a random sample can make computation easier. - We may want to train models on a random sample of the data. - We may want to over- or under-sample observations when outcomes are uneven.

Answer 30

Visualizations can be created in multiple ways: - Matplotlib - Pandas ( Via Matplotlib) - Seaborn * Statistically- focused plotting methods. * Global preferences incorporated by Matplotlib.

Answer 31

- Feature engineering and variable Transformation - Feature encoding - Feature scaling

Answer 32

-Models used in machine learning workflows often make assumptions about the data. - A common example is the linear regression model. This assumes a linear relationship between observations and target (outcome) variables. - An example of a linear model relating (feature) variables X1 and X2 with target (label) variable y, is: YB (X)= B0+B1x1+B2x2 here, B= (B0,B1,B2) represent the model's parameters.

Answer 33

-Predictions from linear regression models assume residuals are normally distributed. - Features and predicted data are often skewed ( distorted away from the center). - Data Transformations can solve this issue.

Answer 34

-Log Transformations can be useful for linear regression. (yb(x)= B0+B1 log(x) - The linear regression model involves linear combinations of features.

Answer 35

-We can estimate highest-order relationships in this data by adding polynomial features. (yB(x)=B0+B1x+B2X squared) - This allows us to use the same linear model. - Even with higher-order polynomials. (YB(X)=B0+B1x+B2x squared+ B3Xcubed)

Answer 36

involves choosing the set of features to include in the model. - Variables must often be transformed before they can be included in models. In addition to log and polynomial transformations, the is can involve: - Encoding: converting non-numeric features to numeric features. - Scaling: converting the scale of numeric data so they are comparable. - The appropriate method of scaling or encoding depends on the type of feature.

Answer 37

Encoding is often applied to categorical features, that take non-numeric values. Two primary types: - Nominal: Categorical variables take values in unordered categories ( e.g. Red, Blue, Green; True, False). - Ordinal: Categorical variables take values in ordered categories (e.g. High, Medium, Low.)

Answer 38

There are several common approaches to encoding variables: - Binary encoding: converts variables to either 0 or 1 and is suitable for variables that take two possible values (e.g. True, False). - One-hot encoding: converts variables that take multiple values into binary (0,1) variables one for each category. This creates serval new variable s. - Ordinal encoding: involves converting ordered categories to numerical values, usually by creating one variable that takes integer equal to the number of categories. ( e.g. 0,1,2,3)

Answer 39

involves adjusting a variable's scale . This allows comparison of variables with different scales. - Different continuous ( numeric) features often have different scales.

Answer 40

There are many approaches to scaling features. Some of the more common approaches include: - Standard scaling: converts features to standard normal variables ( by subtracting the mean and dividing by the standard error). - Min- Max scaling: Converts variables to continuous variables in the (0,1) interval by mapping minimum values to 0 and maximum to 1. This type of scaling is sensitive to outliers. - Robust scaling: is similar to min-max scaling, but instead maps the interquartile range( the 75th percentile value minus the 25th percentile value) to (0,1). This means the variable itself takes values outside of the (0,1) interval.

Answer 41

Learning Goals In this section, we will cover - Statistical estimation and inference -Parametric and non-parametric approaches to modeling. - Common statistical distributions - Frequentist vs Bayesian statistics.

Answer 42

Estimation: is the application of an algorithm, for example taking an average. Inference: Involves putting an accuracy on the estimate ( e.g. standard error of an average)

Answer 43

- Machine learning and statistical inference are similar ( a case of computer science borrowing from a long history in statistics) In both cases, we are using data to learn/ infer qualities of a distribution that generated the data ( often termed the data-generating process). We may care either about the whole distribution or just features (e.g. mean) Machine learning applications that focus on understanding parameters and indivudal effects involve more tools from statistical inference ( some applications are focused only on the results).

Answer 44

-Customer Churn occurs when a customer leaves a company. - Data related to churn may include a target variable for whether or not the customer left. Features could include: - The length of time as a customer. - The type and amount purchased - Other customer characteristics. (Age, Location). - Churn prediction is often approached by predicting a score for individuals that estimates the probability the customer will leave.

Answer 45

*Estimation of factors customer churn involves measuring the impact of each feature in predicting churn. * Inference involves determining whether these measured impacts are statistically significant.

Answer 46

IBM Cognos Customer Churn Dataset: - Data from fictional telecommunications firm. -Includes account type, customer characteristics, revenue per customer, satisfaction score, estimate of customer lifetime value. -Includes information on whether customer churned ( and some categories of churn type).

Answer 47

-If inference is about trying to find out the data-generating process (DGP), then we can say that a statistical model (of the data) is a set of possible distributions or maybe even regressions. - A Parametric model is a particular type of statistical model: it's also a set of distributions or regressions, but they have a finite number of parameters. - Non-parametric statistics In Particular, we don't assume that the data belongs to any particular distribution ( also called distribution-free inference). This doesn't mean that we know nothing though!

Answer 48

An example of non-parametric inference is creating a distribution of the data (CDF or cumulative distribution function) using a histogram. -In this case, we are not specifying parameters.

Answer 49

The normal distribution

Answer 50

-Customer lifetime value is an estimate of the customer's value to the company. - Data related to customer lifetime value might include: - The expected length of time as customer - The expected amount spent over time. To estimate lifetime value, we make assumptions about the data. These assumptions can be parametric (assuming a specific distribution) , or non- parametric

Answer 51

- The most common way of estimating parameters in a parametric model is through maximum likelihood estimation (MLE). - The likelihood function is related to probability and is a function of the parameters of the model

Answer 52

1) Uniform 2) Gaussian/Normal 3) Log Normal Distribution 4) Exponential Curve 5) Poisson

Answer 53

- A Frequentist is concerned with repeated observations in the limit. - In the frequentist approach is to estimate the probabilities, or in the Bayesian approach as well, to estimate the probabilities of a certain number of customers coming over a fixed period of time period.

Answer 54

It's the study of working with queues or lines and how many servers we need to match the size of that queue or the size of the line . e.g. Grocery store and how many cashiers, we will need to check out our customers in a timely fashion.

Answer 55

-Processes may have true frequencies, but we are interested in modeling probabilities as many repeats of an experiment.

Answer 56

1) Derive the probabilistic property of a procedure. ( There is a fixed value for a given probability in the population of our sample. We derive the estimate directly from the data with no external influence. 2) Apply the probability directly to t he observed data.

Answer 57

-A Bayesian describes parameters by probability distributions. -Before seeing any data, a prior distribution (based on the experimenters' belief) is formulated. - This prior distribution is then updated after seeing data ( a sample from the distribution). -After updating, the distribution is called the posterior distribution. - We use much of the same math and the same formulas in both Frequentist and Bayesian statistics. - The element that differs is the interpretation. - We will point out the difference in interpretation, where appropriate.

Answer 58

Learning Goals - Overview of Hypothesis testing - Bayesian approach to hypothesis testing - An example of hypothesis testing involving coin-tossing

Answer 59

A Hypothesis is a statement about a population parameter such as the mean of our poison distribution, and estimate of the number of people that will come into the line of our grocery store example in the next hour.

Answer 60

- The null hypothesis (H0) - The Alternative hypothesis (H1 or Ha) -We Create which one to call the null depending on how the problem is set up.

Answer 61

A hypothesis testing procedure gives us a rule to decide: - For which values of the test static do we accept H0 - For which values of the test statistic do we reject H0 and accept H1. - You may hear some people say that you can reject H0 but you never accept H1. - Here this doesn't matter very much, since we are using hypothesis testing in order to decide which of the two paths to take in the project.

Answer 62

-In the Bayesian Interpretation ( example to follow), we don't get a decision boundary. - Instead we get updated (posterior) probabilities. -

Answer 63

You have two coins: - Coin 1 has a 70% probability of coming up heads. - Coin 2 has a 50% Probability of coming up heads. - Pick one coin without looking. Toss the coin 10 times and record the number of heads. - Given the number of heads you see, which of the two coins did you toss?

Answer 64

In the Bayesian interpretation, we need priors for each hypothesis: - In this case, we randomly chose the coin to flip. -P(H1=we chose coin 1)=1/2 and P(H2=we chose coin 2)=1/2. - Updating priors after seeing the data 3 heads ( Bayes' rule). since we have no way, before seeing the data, to determine the coin that was chosen, we just assign 1/2 to each.

Answer 65

- The Priors are multiplied by the likelihood ratio which does not depend on the priors. - The Likelihood ratio tells us how we should update the priors in reaction to seeing a given set of data!.

Answer 66

Learning Goals In this section, we will cover : - Hypothesis testing terminology including Type-1 and Type-2 errors. - Examples of Hypothesis tests in Practice.

Answer 67

The Neyman-Pearson paradigm (1933) is non-Bayesian. This gives an up or down vote on H0 Vs H1.

Answer 68

- Tossing a coin and our null hypothesis is that we are working with a fair coin, so it's a 50-50 probability it will land on heads. - The Alternative Hypothesis is that it is not a fair coin. So the alternative is just that it's not a 50-50 heads.

Answer 69

A Type 1 error is this case is going to be incorrectly rejecting the null. So this would mean we are indeed working with a fair coin, but we make the error given our sample data, that we should decide to reject the null that it is a fair coin.

Answer 70

Type 2 error is going to incorrectly accept the null. So this would mean we are working with a bias coin, but instead accept that we are working with a fair coin given our data or fail to reject that it is a fair coin given our data.

Answer 71

-Customer churn occurs when a customer leaves a company. - Data related to churn may include a target variable for whether or not the customer left. - Features could include: * The Length of time as a customer * The Type and amount purchased * Other Customer Characteristics - Churn predication is often approached by predicting a score for individuals that estimates the probability the customer will leave.

Answer 72

-Suppose we use data on customer characteristics to predict who will churn over the next year. - In our data, customers who have been with the company for longer are less likely to churn. - This could be due to an underlying effect, or due to chance: - A type 1 Error occurs when this effect is due to chance, but we will find it to be significant in the model. - A type 2 Error occurs when we ascribe the effect to chance, but the effect is non-coincidental.

Answer 73

The likelihood ratio is called a test statistic: we use it to decide whether to accept/ reject H0.

Answer 74

is the set of values of the test statistic that lead to rejection of H0.

Answer 75

is the set values of the test statistic that lead to acceptance of H0.

Answer 76

is test statistic's distribution when the null is true.

Answer 77

Testing Marketing intervention effectiveness: - For a new direct mail marketing campaign to existing customers, the null hypothesis (H0) suggests the campaign does not impact purchasing. - The alternate hypothesis (H1) suggests it has a n impact.

Answer 78

Testing a change in website layout: - For a proposed change to a web layout, we may test a null hypothesis (H0) that the change has no impact on traffic. - Here, we would look for evidence to reject the null in favor of an alternative hypothesis ( H1, that there is an impact on Traffic).

Answer 79

Testing whether a product meets expected size threshold: -Suppose a product is produced in various factories, with expected size S. - To confirm that the product size meets the standard within a margin of error , the company might: * randomly sample from each production source. * establish H0 ( product size is not significantly different from S and H1 ( there is a significant deviation in product size), * Test whether H0 can be rejected in favor of H1 based on the observed mean and standard deviation.

Answer 80

Learning Goals In this section, we will cover: - Hypothesis testing: significance level and p-values - Power and sample size considerations.

Answer 81

-We know the distribution of the null hypothesis. - To get a rejection region, we calculate the test statistic. - We will choose, before testing the data, the level at which we will reject the null hypothesis.

Answer 82

A significance level (a) is a probability threshold below which the null hypothesis will be rejected. - We must choose an a before computing the test statistic! if we don't we might be accused of p-hacking. - Choosing a is somewhat arbitrary, but often.01 or .05.

Answer 83

The P-value is the probability under the null distribution of result a more extreme than what was actually observed. so its small significance level at which the null hypothesis will be rejected.

Answer 84

the values of the statistic for which we accept the null.

Answer 85

*H0: the data can be modeled by setting all betas to zero. - Reject the null if the p-value is small enough.

Answer 86

-If you do many 5% significance tests looking for a significant result, the chances of making at least one Type-1 error increase. - Probability of at least one type 1 error is approximately=1-(1-0.05)#tests. - This is roughly 0.05x (#tests),if you have 10 or fewer tests .

Answer 87

-The Bonferroi Correction: says "choose p threshold so that the probability of making a type 1 error ( assuming no effect) is 5%. - Typically choose: p Threshold = 0.05/ (#tests) Bonferroni correction allows the probability of a type 1 error to be controlled, but at the cost of power. - Effects either need to be larger or the tests need larger samples, to be detected. - Best Practice is to limit the number of comparisons done to a few well-motivated cases.

Answer 88

-Hypothesis testing is a statistical method to determine if a claim (hypothesis) about a population is supported by sample data. -The significance level (alpha) is a threshold used to determine if the evidence is strong enough to reject the null hypothesis. A common value for alpha is 0.05. -The p-value is the probability of obtaining a result as extreme as, or more extreme than, the observed result, assuming the null hypothesis is true. A p-value less than or equal to alpha means we reject the null hypothesis. -Power is the probability of correctly rejecting the null hypothesis when it is false. A higher power means we are more likely to detect a true effect. -Sample size considerations are important because they can affect the significance level and power of a hypothesis test. A larger sample size generally leads to a higher power to detect a true effect, while a small sample size may increase the chance of a Type II error (failing to reject the null hypothesis when it is false).

Answer 89

Overview of Hypothesis Testing: Hypothesis testing is a statistical method used to determine if a claim (hypothesis) about a population is supported by sample data. It involves formulating a null hypothesis (the claim being tested) and an alternative hypothesis (the claim we want to support). We then collect data and use statistical tests to determine if the data provides enough evidence to reject the null hypothesis in favor of the alternative hypothesis. Bayesian Approach to Hypothesis Testing: The Bayesian approach to hypothesis testing involves calculating the probability of the hypothesis being true given the observed data, rather than calculating the probability of the data given the hypothesis (as in traditional hypothesis testing). It involves using prior knowledge or beliefs about the hypothesis, updating these beliefs based on the observed data, and calculating the posterior probability of the hypothesis. Example of Hypothesis Testing Involving Coin-Tossing: Suppose we want to test the hypothesis that a coin is fair (i.e., has a 50-50 chance of landing heads or tails). We can formulate the null hypothesis as "the coin is fair" and the alternative hypothesis as "the coin is not fair". We then toss the coin a certain number of times and record the number of heads and tails. We can use a statistical test (such as the chi-square test) to determine if the observed data provides enough evidence to reject the null hypothesis in favor of the alternative hypothesis. For example, if we toss the coin 100 times and get 60 heads and 40 tails, we can calculate a p-value (the probability of getting a result as extreme or more extreme than the observed result, assuming the null hypothesis is true) and compare it to the significance level (the threshold for rejecting the null hypothesis). If the p-value is less than or equal to the significance level (often set at 0.05), we reject the null hypothesis and conclude that the coin is not fair. I hope that helps! Let me know if you have any other questions.

Answer 90

-Hypothesis testing is a statistical method used to determine if a claim (hypothesis) about a population is supported by sample data. It involves formulating a null hypothesis (the claim being tested) and an alternative hypothesis (the claim we want to support). We then collect data and use statistical tests to determine if the data provides enough evidence to reject the null hypothesis in favor of the alternative hypothesis. Bayesian Approach to Hypothesis Testing: The Bayesian approach to hypothesis testing involves calculating the probability of the hypothesis being true given the observed data, rather than calculating the probability of the data given the hypothesis (as in traditional hypothesis testing). It involves using prior knowledge or beliefs about the hypothesis, updating these beliefs based on the observed data, and calculating the posterior probability of the hypothesis. Example of Hypothesis Testing Involving Coin-Tossing: Suppose we want to test the hypothesis that a coin is fair (i.e., has a 50-50 chance of landing heads or tails). We can formulate the null hypothesis as "the coin is fair" and the alternative hypothesis as "the coin is not fair". We then toss the coin a certain number of times and record the number of heads and tails. We can use a statistical test (such as the chi-square test) to determine if the observed data provides enough evidence to reject the null hypothesis in favor of the alternative hypothesis. For example, if we toss the coin 100 times and get 60 heads and 40 tails, we can calculate a p-value (the probability of getting a result as extreme or more extreme than the observed result, assuming the null hypothesis is true) and compare it to the significance level (the threshold for rejecting the null hypothesis). If the p-value is less than or equal to the significance level (often set at 0.05), we reject the null hypothesis and conclude that the coin is not fair. I hope that helps! Let me know if you have any other questions.

Answer 91

Learning Goals In this section, we will cover: - Correlation Vs causation - Confounding variables - Examples of spurious correlations.

Answer 92

-We associate with cold weather. - Does it actually rain more when days are cooler? - Maybe it depends on where you are. - Some places have summer monsoons, so maybe as it gets warmer there, it rains more. - Warmer weather increases evaporation , which can increase humidity. In warm weather, there is water in the air to form precipitation. This mechanism would suggest warmer weather more rain. - Cooler weather decreases dew point ( i.e. air can hold less water.) This suggests if humid air enters the air and cools it will turn into rain. This mechanism would suggest cooler weather more rain.

Answer 93

-If two variables X and Y are correlated, then X is useful for predicting Y. - If we are trying to model Y, and we find things that correlate with Y, we may improve the modeling. - We should be careful about changing X with the Hope of changing Y. X and Y can be correlated for different reasons: *X causes Y ( What we want).our marketing budget successfully leading to a higher revenue. *Y causes X ( mixing up cause and effect) * X and Y are both caused by something else ( Confounding). * X and Y are not related, we just got lucky in the sample ( spurious) Examples: Confounding correlation is actually a subset of spurious correlation. We can think of it as anytime two values just really aren't related at all, maybe there's a confounding variable or maybe it's just random that marketing spend versus revenue both going up at the same time.

Answer 94

1) Student Test Scores are positively Correlated with amount of time studied. This does not mean we should get students to study more by curving everyone's grades upward. It is more likely that studying helps students learn material, so studying causes better performance. 2) Customer Satisfaction is negatively correlated with customer service call volume. This doesn't mean that we should remove or hide the customer service numbers, with the hope of improving customer satisfaction.

Answer 95

-A Confounding Variable is something that causes both X and Y to change. -X and Y are correlated even though X doesn't cause Y and Y doesn't cause X.

Answer 96

1) The Number of annual car accidents and the number of people named John are positively correlated ( both are correlated with the population size .) 2) The amount of ice-cream sold and the number of drownings in a week are positively correlated ( both are positively correlated with temperature. 3) Number of factories a chip manufacturer owns and the number of chips sold are positively correlated ( but both are driven by demand from the market.