DS interview questions Flashcards

Question

How does data cleaning plays a vital role in the analysis?

Answer 1

Data cleaning can help in analysis because: * Cleaning data from multiple sources **helps to transform it into a format that data analysts or data scientists can work with.** * Data Cleaning **helps to increase the accuracy of the model** in machine learning. * It is a cumbersome process because as the number of data sources increases, the time taken to clean the data increases exponentially due to the number of sources and the volume of data generated by these sources. * It might take up to 80% of the time for just cleaning data making it a critical part of the analysis task.

Answer 2

* **Data profiling**: Almost everyone starts off by getting an understanding of their dataset. More specifically, you can look at the shape of the dataset with .shape and a description of your numerical variables with .describe(). * **Data visualizations**: Sometimes, it’s useful to visualize your data with histograms, boxplots, and scatterplots to better understand the relationships between variables and also to identify potential outliers. * **Syntax error**: This includes making sure there’s no white space, making sure letter casing is consistent, and checking for typos. You can check for typos by using .unique() or by using bar graphs. * **Standardization** or **normalization**: Depending on the dataset your working with and the machine learning method you decide to use, it may be useful to standardize or normalize your data so that different scales of different variables don’t negatively impact the performance of your model. * **Handling null values**: There are a number of ways to handle null values including deleting rows with null values altogether, replacing null values with the mean/median/mode, replacing null values with a new category (eg. unknown), predicting the values, or using machine learning models that can deal with null values. Read more here. * **Other** things include: removing irrelevant data, removing duplicates, and type conversion.

Answer 3

**Decision trees** * popular model, used in operations research, strategic planning, and machine learning. * Each square above is called a node, and the more nodes you have, the more accurate your decision tree will be (generally). The last nodes of the decision tree, where a decision is made, are called the leaves of the tree. * Decision trees are intuitive and easy to build but fall short when it comes to accuracy.

Answer 4

Dimensionality reduction is the **process of reducing the number of features in a dataset**. This is *important mainly in the case when you want to reduce variance in your model (overfitting)*. Four advantages of dimensionality reduction: * It **reduces the time and storage space required** * **Removal of multi-collinearity** improves the interpretation of the parameters of the machine learning model * It becomes **easier to visualize the data** when reduced to very low dimensions such as 2D or 3D * It **avoids the curse of dimensionality**

Answer 5

When the **number of features is greater than the number of observations**, then performing dimensionality reduction will generally improve the SVM.

Answer 6

**Eigenvectors** are used for understanding linear transformations. In data analysis, we usually calculate the eigenvectors for a correlation or covariance matrix. Eigenvectors are the **directions along which a particular linear transformation acts by flipping, compressing or stretching.** **Eigenvalue** can be referred to as the strength of the transformation in the direction of eigenvector or the factor by which the compression occurs.

Answer 7

**Observational data** comes from observational studies which are when you *observe certain variables and try to determine if there is any correlation.* **Experimental data** comes from experimental studies which are when you *control certain variables and hold them constant to determine if there is any causality.* * An example of experimental design is the following: split a group up into two. The control group lives their lives normally. The test group is told to drink a glass of wine every night for 30 days. Then research can be conducted to see how wine affects sleep.

Answer 8

**False positive** is an incorrect identification of the presence of a condition when it’s absent. * predict positive when actual value is negative * spam dectection **False negative** is an incorrect identification of the absence of a condition when it’s actually present. * predict negative when actual value is positive * screening for cancer. This is a subjective argument, but false positives can be worse than false negatives from a psychological point of view. For example, a false positive for winning the lottery could be a worse outcome than a false negative because people normally don’t expect to win the lottery anyways.

Answer 9

Since 70 is one standard deviation below the mean, take the area of the Gaussian distribution to the left of one standard deviation. = 2.3 + 13.6 = 15.9%

Answer 10

* **Null hypothesis**: males and females are the same height on average * **Alternative hypothesis**: the average height of males is greater than the average height of females. * **Collect random sample** of heights of males and females and * Use a **t-test** to determine if you reject the null or not.

Answer 11

* A **kernel** is a way of computing the **dot product of two vectors 𝐱x and 𝐲y** in some (possibly very high dimensional) feature space, which is why kernel functions are sometimes called “*generalized dot product*” [2] * The kernel *trick is a method of using a linear classifier to solve a non-linear problem by transforming linearly inseparable data to linearly separable ones in a higher dimension.*

Answer 12

The Law of Large Numbers is a theory that states that as the number of trials increases, the average of the result will become closer to the expected value. It says that the sample means, the sample variance and the sample standard deviation converge to what they are trying to estimate. Eg. flipping heads from fair coin 100,000 times should be closer to 0.5 than 100 times.

Answer 13

**Lift**: lift is a measure of the performance of a targeting model measured against a random choice targeting model; in other words, lift tells you how much better your model is at predicting things than if you had no model. **KPI**: stands for Key Performance Indicator, which is a measurable metric used to determine how well a company is achieving its business objectives. Eg. error rate. **Robustness**: generally robustness refers to a system’s ability to handle variability and remain effective. **Model fitting**: refers to how well a model fits a set of observations. **Design of experiments**: also known as DOE, it is the design of any task that aims to describe and explain the variation of information under conditions that are hypothesized to reflect the variable. [4] In essence, an experiment aims to predict an outcome based on a change in one or more inputs (independent variables). **80/20 rule**: also known as the Pareto principle; states that 80% of the effects come from 20% of the causes. Eg. 80% of sales come from 20% of customers.

Answer 14

Linear Model assumptions: * **Linear relationship** - relationship between the independent and dependent variables to be linear. The linearity assumption can best be tested with scatter plots, and a regression line. * **Multivariate normality** - data (all variables) normally distributed, bell curve test with Histogram plot * **No or little multicollinearity** - Multicollinearity occurs when the independent variables are too highly correlated with each other. Simplest way to address the problem is to remove independent variables with high VIF values. Multicollinearity may be tested with three central criteria: 1. **Correlation matrix** - Pearson’s Bivariate Correlation 2. **Tolerance** - measures the influence of one independent variable on all other independent variables; the tolerance is calculated with an initial linear regression analysis. 3. **Variance Inflation Factor (VIF)** - With VIF \> 5 there is an indication that multicollinearity may be present; with VIF \> 10 there is certainly multicollinearity among the variables. * **No auto-correlation** - Autocorrelation occurs when the residuals are not independent from each other. Durbin-Watson test for auto-correlation. * **Homoscedasticity** - The scatter plot is good way to check whether the data are homoscedastic (*meaning the residuals are equal across the regression line*).

Answer 15

Same as Linear Model assumptions: * The sample data used to fit the model is **representative of the population** * The relationship between X and the mean of Y is **linear** * The variance of the residual is the same for any value of X (**homoscedasticity**) * Observations are **independent of each other** * For any value of X, Y is **normally distributed**. * Extreme violations* of these assumptions will make the **results redundant**. * Small violations* of these assumptions will result in a **greater bias or variance of the estimate.**

Answer 16

* In the wide-format, a subject’s repeated responses will be in a single row, and each response is in a separate column. * In the long-format, each row is a one-time point per subject. You can recognize data in wide format by the fact that columns generally represent groups.

Answer 17

A long-tailed distribution is a type of heavy-tailed distribution that has a tail (or tails) that drop off gradually and asymptotically. It’s important to be mindful of long-tailed distributions in classification and regression problems because the least frequently occurring values make up the majority of the population. This can ultimately change the way that you deal with outliers, and it also conflicts with some machine learning techniques with the assumption that the data is normally distributed. 3 practical examples include the power law, the Pareto principle (more commonly known as the 80–20 rule), and product sales (i.e. best selling products vs others).

Answer 18

A Markov chain is a mathematical system that experiences transitions from one state to another according to certain probabilistic rules. The defining characteristic of a Markov chain is that no matter how the process arrived at its present state, the possible future states are fixed. In other words, the probability of transitioning to any particular state is dependent solely on the current state and time elapsed.

Answer 19

**Mean imputation** is the practice of replacing null values in a data set with the mean of the data. * Mean imputation is generally bad practice because **it doesn’t take into account feature correlation**. For example, imagine we have a table showing age and fitness score and imagine that an eighty-year-old has a missing fitness score. If we took the average fitness score from an age range of 15 to 80, then the eighty-year-old will appear to have a much higher fitness score that he actually should. * Second, mean imputation **reduces the variance of the data and increases bias in our data**. This leads to a less accurate model and a narrower confidence interval due to a smaller variance.

Answer 20

When there are a number of outliers that positively or negatively **skew** the data.

Answer 21

The metric(s) chosen to evaluate a machine learning model depends on various factors: * Is it a **regression or classification** task? * What is the business objective? Eg. **precision vs recall** * What is the **distribution of the target variable**? There are a number of metrics that can be used, including adjusted r-squared, MAE, MSE, accuracy, recall, precision, f1 score, and the list goes on.

Answer 22

There are several ways to handle missing data: * Delete rows with missing data * Mean/Median/Mode imputation * Assigning a unique value * Predicting the missing values * Using an algorithm which supports missing values, like random forests

Answer 23

**Mean Squared Error** (MSE) - * gives a relatively high weight to large errors * put too much emphasis on large deviations. * A more robust alternative is **MAE** (mean absolute deviation).

Answer 24

One major drawback of Naive Bayes is that it holds a strong **assumption in that the features are assumed to be uncorrelated** with one another, which typically is never the case. One way to **improve such an algorithm that uses Naive Bayes is by decorrelating the features** so that the assumption holds true. Pros: * Easy and fast. * works for multi class prediction * performs great when independence assumption holds true * also performs well with bell curve assumption Cons: * Category in test data not mentioned in train data will result in 0 probalility. * almost impossible to find data with no correlation

Answer 25

* A **neural network** is a **multi-layered** model inspired by the human brain. Like the neurons in our brain, the circles above represent a node. * The blue circles represent the **input layer**, the black circles represent the **hidden layers**, and the green circles represent the **output layer**. * Each node in the hidden layers represents a function that the inputs go through, ultimately leading to an output in the green circles. The formal term for these functions is called the sigmoid **activation function**.

Answer 26

**Natural Language Processing** * It is a branch of artificial intelligence that gives machines the ability to **read and understand human languages**.

Answer 27

* Any type of **categorical data** won’t have a gaussian distribution or lognormal distribution. * **Exponential distributions** — eg. the amount of time that a car battery lasts or the amount of time until an earthquake occurs.

Answer 28

Data is distributed around a central value without any bias to the left or right and reaches normal distribution in the form of a bell-shaped curve. The random variables are distributed in the form of a symmetrical, bell-shaped curve. Properties of Normal Distribution are as follows; * Unimodal -one mode * Symmetrical -left and right halves are mirror images * Bell-shaped -maximum height (mode) at the mean * Mean, Mode, and Median are all located in the center * Asymptotic

Answer 29

**Outlier**: * Data point that differs significantly from other observations. * hey can be bad from a machine learning perspective because they can worsen the accuracy of a model. * it’s important to remove them from the dataset. * identify outliers: * **Z-score/standard deviations**: if we know that 99.7% of data in a data set lie within three standard deviations, then we can calculate the size of one standard deviation, multiply it by 3, and identify the data points that are outside of this range. Likewise, we can calculate the z-score of a given point, and if it’s equal to +/- 3, then it’s an outlier. (data needs to be not small and normally distributed) * **Interquartile Range (IQR)**: IQR, the concept used to build boxplots, can also be used to identify outliers. * Other methods include **DBScan clustering**, **Isolation Forests**, and **Robust Random Cut Forests**. **Inlier**: * Data observation that lies within the rest of the dataset and is unusual or an error. * it is typically harder to identify than an outlier and requires external data to identify them. * Should you identify any inliers, you can simply remove them from the dataset to address them.

Answer 30

In **overfitting**, a statistical model *describes random error or noise instead of the underlying relationship*. Overfitting **occurs when a model is excessively complex, such as having too many parameters relative to the number of observations**. A model that has been overfitted, has *poor predictive performance*, as it overreacts to minor fluctuations in the training data. **Underfitting** occurs when a statistical model or machine learning **algorithm cannot capture the underlying trend of the data**. Underfitting would occur, for example, **when fitting a linear model to non-linear data**. Such a model too would have **poor predictive performance**.

Answer 31

**Overfitting** is an error where the **model ‘fits’ the data too well**, resulting in a model with **high variance and low bias**. As a consequence, an overfit model will **inaccurately predict new data points** even though it has a high accuracy on the training data.

Answer 32

To combat overfitting and underfitting, you can **resample the data** to estimate the model accuracy (**k-fold cross-validation**) and by having a **validation dataset** to evaluate the model.

Answer 33

Bayes's formula: p(A|B) = (p(B|A)\*p(A))/p(A) = p(A and B)/p(A)

Answer 34

When you perform a *hypothesis test* in statistics, a **p-value can help you determine the strength of your results**. p-value is a number between 0 and 1. Based on the value it will denote the strength of the results. The claim which is on trial is called the Null Hypothesis. Low p-value (≤ 0.05) indicates strength against the null hypothesis which means we can reject the null Hypothesis. High p-value (≥ 0.05) indicates strength for the null hypothesis which means we can accept the null Hypothesis p-value of 0.05 indicates the Hypothesis could go either way. To put it in another way, * **High P values**: your data are likely with a true null. * **Low P values**: your data are unlikely with a true null.

Answer 35

Since we looking at the number of events (# of infections) occurring within a given timeframe, this is a **Poisson distribution** question. The probability of observing k events in an interval * *Null** (H0): 1 infection per person-days * *Alternative** (H1): \>1 infection per person-days k (actual) = 10 infections lambda (theoretical) = (1/100)\*1787 p = 0.032372 or 3.2372% *calculated using .poisson() in excel or ppois in R* Since p-value \< alpha (assuming 5% level of significance), we reject the null and conclude that the hospital is below the standard.

Answer 36

**Principal Component Analysis,** PCA, involves project **higher dimensional data (eg. 3 dimensions) to a smaller space (eg. 2 dimensions**). This results in a lower dimension of data, (2 dimensions instead of 3 dimensions) while keeping all original variables in the model. PCA is commonly *used for compression purposes, to reduce required memory and to speed up the algorithm, as well as for visualization purposes, making it easier to summarize data.*

Answer 37

* Since this is a Poisson distribution question, mean = lambda = variance, which also means that standard deviation = square root of the mean * a 95% confidence interval implies a z score of 1.96 * one standard deviation = 10 Therefore the confidence interval = 100 +/- 19.6 = [964.8, 1435.2]

Answer 38

* Since this is a Poisson distribution question, mean = lambda = variance, which also means that standard deviation = square root of the mean * a 95% confidence interval implies a z score of 1.96 * one standard deviation = sqrt(115) = 10.724 Therefore the confidence interval = 115+/- 21.45 = [93.55, 136.45]. Since 99 is within this confidence interval, we can assume that this change is not very noteworthy.

Answer 39

**Recall** - proportion of actual positives was identified correctly * Recall = TP/(TP+FN) **Precision** - proportion of positive identifications was actually correct * Precision = TP/(TP+FP)

Answer 40

``` Precision = Positive Predictive Value = PV PV = (0.001\*0.997)/[(0.001\*0.997)+((1–0.001)\*(1–0.985))] PV = 0.0624 or 6.24% ```

Answer 41

3 possibilities of BG, GB & BB, we have to find the probability of the case with two girls. Thus, P(Having two girls given one girl) = 1 / 3

Answer 42

Perform a hypothesis test: 1. The null hypothesis is that the coin is not biased and the probability of flipping heads should equal 50% (p=0.5). The alternative hypothesis is that the coin is biased and p != 0.5. 2. Flip the coin 500 times. 3. Calculate Z-score (if the sample is less than 30, you would calculate the t-statistics). 4. Compare against alpha (two-tailed test so 0.05/2 = 0.025). 5. If p-value \> alpha, the null is not rejected and the coin is not biased. If p-value \< alpha, the null is rejected and the coin is biased.

Answer 43

You can tell that this question is related to Bayesian theory because of the last statement which essentially follows the structure, “What is the probability A is true given B is true?” Therefore we need to know the probability of it raining in London on a given day. Let’s assume it’s 25%. P(A) = probability of it raining = 25% P(B) = probability of all 3 friends say that it’s raining P(A|B) probability that it’s raining given they’re telling that it is raining P(B|A) probability that all 3 friends say that it’s raining given it’s raining = (2/3)³ = 8/27 Step 1: Solve for P(B) P(A|B) = P(B|A) \* P(A) / P(B), can be rewritten as P(B) = P(B|A) \* P(A) + P(B|not A) \* P(not A) P(B) = (2/3)³ \* 0.25 + (1/3)³ \* 0.75 = 0.25\*8/27 + 0.75\*1/27 Step 2: Solve for P(A|B) P(A|B) = 0.25 \* (8/27) / ( 0.25\*8/27 + 0.75\*1/27) P(A|B) = 8 / (8 + 3) = 8/11 Therefore, if all three friends say that it’s raining, then there’s an 8/11 chance that it’s actually raining.

Answer 44

General Binomial Probability formula: ``` p = 0.8 n = 5 k = 3,4,5 ``` P(3 or more heads) = P(3 heads) + P(4 heads) + P(5 heads) = 0.94 or 94%

Answer 45

Since these events are not independent, we can use the rule: P(A and B) = P(A) \* P(B|A) ,which is also equal to P(not A and not B) = P(not A) \* P(not B | not A) For example: P(not 4 and not yellow) = P(not 4) \* P(not yellow | not 4) P(not 4 and not yellow) = (36/39) \* (27/36) P(not 4 and not yellow) = 0.692 Therefore, the probability that the cards picked are not the same number and the same color is 69.2%.

Answer 46

* Let’s assume that it costs $5 every time you want to play. * There are 36 possible combinations with two dice. * Of the 36 combinations, there are 4 combinations that result in rolling a five (see blue). This means that there is a 4/36 or 1/9 chance of rolling a 5. * A 1/9 chance of winning means you’ll lose eight times and win once (theoretically). * Therefore, your expected payout is equal to $10.00 \* 1 — $5.00 \* 9= -$35.00.

Answer 47

(note: I don't agree, need to do this myself) * Any die has six sides from 1-6. There is no way to get seven equal outcomes from a single rolling of a die. If we roll the die twice and consider the event of two rolls, we now have 36 different outcomes. * To get our 7 equal outcomes we have to reduce this 36 to a number divisible by 7. We can thus consider only 35 outcomes and exclude the other one. * A simple scenario can be to exclude the combination (6,6), i.e., to roll the die again if 6 appears twice. * All the remaining combinations from (1,1) till (6,5) can be divided into 7 parts of 5 each. This way all the seven sets of outcomes are equally likely.

Answer 48

The box with 24 red cards and 24 black cards has a higher probability of getting two cards of the same color. Let’s walk through each step. Let’s say the first card you draw from each deck is a red Ace. This means that in the deck with 12 reds and 12 blacks, there’s now 11 reds and 12 blacks. Therefore your odds of drawing another red are equal to 11/(11+12) or 11/23. In the deck with 24 reds and 24 blacks, there would then be 23 reds and 24 blacks. Therefore your odds of drawing another red are equal to 23/(23+24) or 23/47. Since 23/47 \> 11/23, the second deck with more cards has a higher probability of getting the same two cards

Answer 49

There are two ways of choosing the coin. One is to pick a fair coin and the other is to pick the one with two heads. Probability of selecting fair coin = 999/1000 = 0.999 Probability of selecting unfair coin = 1/1000 = 0.001 Selecting 10 heads in a row = Selecting fair coin \* Getting 10 heads + Selecting an unfair coin P (A) = 0.999 \* (1/2)^5 = 0.999 \* (1/1024) = 0.000976 P (B) = 0.001 \* 1 = 0.001 P( A / A + B ) = 0.000976 / (0.000976 + 0.001) = 0.4939 P( B / A + B ) = 0.001 / 0.001976 = 0.5061 Probability of selecting another head = P(A/A+B) \* 0.5 + P(B/A+B) \* 1 = 0.4939 \* 0.5 + 0.5061 = 0.7531

Answer 50

Using the General Addition Rule in probability: P(mother or father) = P(mother) + P(father) — P(mother and father) P(mother) = P(mother or father) + P(mother and father) — P(father) P(mother) = 0.17 + 0.06–0.12 P(mother) = 0.11

Answer 51

(billy's note: not sure i agree, need to do this myself) Probability of not seeing any shooting star in 15 minutes is = 1 – P( Seeing one shooting star ) = 1 – 0.2 = 0.8 Probability of not seeing any shooting star in the period of one hour = (0.8) ^ 4 = 0.4096 Probability of seeing at least one shooting star in the one hour = 1 – P( Not seeing any star ) = 1 – 0.4096 = 0.5904

Answer 52

* There are 4 combinations of rolling a 4 (1+3, 3+1, 2+2): P(rolling a 4) = 3/36 = 1/12 * There are combinations of rolling an 8 (2+6, 6+2, 3+5, 5+3, 4+4): P(rolling an 8) = 5/36

Answer 53

Since a coin flip is a binary outcome, you can make an unfair coin fair by flipping it twice. If you flip it twice, there are two outcomes that you can bet on: heads followed by tails or tails followed by heads. *P(heads) \* P(tails) = P(tails) \* P(heads)* This makes sense since each coin toss is an independent event. This means that if you get heads → heads or tails → tails, you would need to reflip the coin.

Answer 54

Eight rules of probability: 1. For any event A, 0 ≤ P(A) ≤ 1; in other words, the **probability** of an event can **range from 0 to 1**. 2. The **sum of the probabilities** of all possible outcomes always **equals 1.** 3. **P(not A) = 1 — P(A)**; This rule explains the relationship between the probability of an event and its complement event. A complement event is one that includes all possible outcomes that aren’t in A. 4. If **A and B are disjoint** events (mutually exclusive), then **P(A or B) = P(A) + P(B)**; this is called the addition rule for disjoint events 5. **P(A or B) = P(A) + P(B) — P(A and B)**; this is called the general addition rule. 6. If **A and B are two independent events**, then **P(A and B) = P(A) \* P(B)**; this is called the multiplication rule for independent events. 7. The conditional **probability of event B given event A** is **P(B|A) = P(A and B) / P(A)** 8. For any two events A and B, **P(A and B) = P(A) \* P(B|A);** this is called the general multiplication rule * **Factorial Formula: n! = n x (n -1) x (n — 2) x … x 2 x 1** Use when the number of items is equal to the number of places available. *Eg. Find the total number of ways 5 people can sit in 5 empty seats. = 5 x 4 x 3 x 2 x 1 = 120* * **Fundamental Counting Principle (multiplication)** This method should be used when repetitions are allowed and the number of ways to fill an open place is not affected by previous fills. *Eg. There are 3 types of breakfasts, 4 types of lunches, and 5 types of desserts. The total number of combinations is = 5 x 4 x 3 = 60* * **Permutations: P(n,r)= n! / (n−r)!** This method is used when replacements are not allowed and order of item ranking matters. *Eg. A code has 4 digits in a particular order and the digits range from 0 to 9. How many permutations are there if one digit can only be used once? P(n,r) = 10!/(10–4)! = (10x9x8x7x6x5x4x3x2x1)/(6x5x4x3x2x1) = 5040* * **Combinations Formula: C(n,r)=(n!)/[(n−r)!r!]** This is used when replacements are not allowed and the order in which items are ranked does not mater. *Eg. To win the lottery, you must select the 5 correct numbers in any order from 1 to 52. What is the number of possible combinations? C(n,r) = 52! / (52–5)!5! = 2,598,960*

Answer 55

Formulas: * p(A **and** B) = p(A) \* p(B) __ independent events * p(A **and** B) = p(A) \* p(B|A) __ dependent events * p(A **or** B) = p(A) + p(B) __ mutually exclusive events * p(A **or** B) = p(A) + p(B) - p(A **and** B) __ not mutually exclusive events Given: * P(A) = 0.6 * P(B) = 0.8 Therefore: * P(A or B) = P(A) + P(B) — P(A and B) * P(A or B) = 0.6 + 0.8 — (0.6\*0.8) * P(A or B) = 0.92

Answer 56

Using Excel… p =1-norm.dist(1200, 1020, 50, true) p= 0.000159

Answer 57

``` x = 3 mean = 2.5\*4 = 10 ``` using Excel… ``` p = poisson.dist(3,10,true) p = 0.010336 ```

Answer 58

**Quality assurance**: an activity or set of activities focused on maintaining a desired level of quality by minimizing mistakes and defects. **Six sigma**: a specific type of quality assurance methodology composed of a set of techniques and tools for process improvement. A six sigma process is one in which 99.99966% of all outcomes are free of defects.

Answer 59

**Random forests** are an * *ensemble learning* technique that builds off of decision trees. Random forests involve creating multiple decision trees using bootstrapped datasets of the original data and randomly selecting a subset of variables at each step of the decision tree. * The model then selects the mode of all of the predictions of each decision tree. By relying on a “majority wins” model, it reduces the risk of error from an individual tree. * Random forests offer several other benefits including **strong performance**, can model n**on-linear boundaries**, **no cross-validation needed**, and gives **feature importance**.

Answer 60

* Random forests allow you to determine the **feature importance**. SVM’s can’t do this. * Random forests are much **quicker and simpler to build** than an SVM. * For multi-class classification problems, SVMs require a one-vs-rest method, which **is less scalable and more memory intensive**.

Answer 61

Resampling is done in any of these cases: * Estimating the accuracy of sample statistics by using subsets of accessible data or drawing randomly with replacement from a set of data points * Substituting labels on data points when performing significance tests * Validating models by using random subsets (bootstrapping, cross-validation)

Answer 62

* **R-squared**/**Adjusted R-squared**: Relative measure of fit. *This was explained in a previous answer* * **F1 Score**: Evaluates the null hypothesis that all regression coefficients are equal to zero vs the alternative hypothesis that at least one doesn’t equal zero * **RMSE**: Absolute measure of fit.

Answer 63

Regularisation is the **process of adding tuning parameter to a model to induce smoothness** in order **to prevent overfitting**. This is most often done by adding a constant multiple to an existing weight vector. This constant is often the L1(Lasso) or L2(ridge). The model predictions should then minimize the loss function calculated on the regularized training set.

Answer 64

* Both L1 and L2 regularization are **methods used to reduce the overfitting** of training data. Least Squares minimizes the sum of the squared residuals, which can result in low bias but high variance. * L2 Regularization, also called **ridge regression**, minimizes the sum of the squared residuals plus lambda times the slope squared. This additional term is called the Ridge Regression Penalty. This increases the bias of the model, making the fit worse on the training data, but also decreases the variance. * If you take the ridge regression penalty and replace it with the absolute value of the slope, then you get Lasso regression or L1 regularization. * L2 is less robust but has a stable solution and always one solution. L1 is more robust but has an unstable solution and can possibly have multiple solutions.

Answer 65

The ROC curve is a **graphical representation of the contrast between true positive rates and false-positive rates at various thresholds.** It is often used as a proxy for the trade-off between the sensitivity(true positive rate) and false-positive rate.

Answer 66

**Root cause analysis**: a method of problem-solving used for identifying the root cause(s) of a problem **Correlation** measures the relationship between two variables, range from -1 to 1. Causation is when a first event appears to have caused a second event. Causation essentially looks at direct relationships while correlation can look at both direct and indirect relationships. *Example*: a higher crime rate is associated with higher sales in ice cream in Canada, aka they are positively correlated. However, this doesn’t mean that one causes another. Instead, it’s because both occur more when it’s warmer outside. You can test for causation using *hypothesis testing or A/B testing.*

Answer 67

You can use the margin of error (ME) formula to determine the desired sample size. * t/z = t/z score used to calculate the confidence interval * ME = the desired margin of error * S = sample standard deviation

Answer 68

**Selection bias** is the phenomenon of selecting individuals, groups or data for analysis in such a way that proper randomization is not achieved, ultimately resulting in a sample that is not representative of the population. Understanding and identifying selection bias is important because it can significantly skew results and provide false insights about a particular population group. Types of selection bias include: * **sampling bias**: a biased sample caused by non-random sampling * **time interval**: selecting a specific time frame that supports the desired conclusion. e.g. conducting a sales analysis near Christmas. * **exposure**: includes clinical susceptibility bias, protopathic bias, indication bias. * **data**: includes cherry-picking, suppressing evidence, and the fallacy of incomplete evidence. specific subsets of data are chosen to support a conclusion or rejection of bad data on arbitrary grounds * **attrition**: attrition bias is similar to survivorship bias, where only those that ‘survived’ a long process are included in an analysis, or failure bias, where those that ‘failed’ are only included * **observer selection**: related to the Anthropic principle, which is a philosophical consideration that any data we collect about the universe is filtered by the fact that, in order for it to be observable, it must be compatible with the conscious and sapient life that observes it. Handling missing data can make selection bias worse because different methods impact the data in different ways. For example, if you replace null values with the mean of the data, you adding bias in the sense that you’re assuming that the data is not as spread out as it might actually be.

Answer 69

Sensitivity is commonly used to validate the accuracy of a classifier (Logistic, SVM, Random Forest etc.). Sensitivity is nothing but “Predicted True events/ Total events”. True events here are the events which were true and model also predicted them as true. Calculation of seasonality is pretty straightforward. Seasonality = ( True Positives ) / ( Positives in Actual Dependent Variable )

Answer 70

It is because it takes in a vector of real numbers and returns a probability distribution. It should be clear that the output is a probability distribution: each element is non-negative and the sum over all components is 1.

Answer 71

Causes: * New feature added * Improvement in environment (features are more user friendly) * Viral Social media movement * National holliday/event Testing: * Hypothesis test to determine inferred cause to actual cause

Answer 72

**SOLUTION A**: Using IFNULL, OFFSET IFNULL(expression, alt) : ifnull() returns the specified value if null, otherwise returns the expected value. We’ll use this to return null if there’s no second-highest salary. OFFSET : offset is used with the ORDER BY clause to disregard the top n rows that you specify. This will be useful as you’ll want to get the second row (2nd highest salary) SELECT IFNULL( (SELECT DISTINCT Salary FROM Employee ORDER BY Salary DESC LIMIT 1 OFFSET 1 ), null) as SecondHighestSalary FROM Employee LIMIT 1 **SOLUTION B**: Using MAX() This query says to choose the MAX salary that isn’t equal to the MAX salary, which is equivalent to saying to choose the second-highest salary! SELECT MAX(salary) AS SecondHighestSalary FROM Employee WHERE salary != (SELECT MAX(salary) FROM Employee)

Answer 73

**SOLUTION**: CASE WHEN Think of a CASE WHEN THEN statement like an IF statement in coding. The first WHEN statement checks to see if there’s an odd number of rows, and if there is, ensure that the id number does not change. The second WHEN statement adds 1 to each id (eg. 1,3,5 becomes 2,4,6) Similarly, the third WHEN statement subtracts 1 to each id (2,4,6 becomes 1,3,5) SELECT CASE WHEN((SELECT MAX(id) FROM seat)%2 = 1) AND id = (SELECT MAX(id) FROM seat) THEN id WHEN id%2 = 1 THEN id + 1 ELSE id - 1 END AS id, student FROM seat ORDER BY id

Answer 74

**SOLUTION A**: COUNT() in a Subquery First, a subquery is created to show the count of the frequency of each email. Then the subquery is filtered WHERE the count is greater than 1. SELECT Email FROM ( SELECT Email, count(Email) AS count FROM Person GROUP BY Email ) as email\_count WHERE count \> 1 **SOLUTION B**: HAVING Clause HAVING is a clause that essentially allows you to use a WHERE statement in conjunction with aggregates (GROUP BY). SELECT Email FROM Person GROUP BY Email HAVING count(Email) \> 1

Answer 75

**SOLUTION**: IN Clause * The IN clause allows you to use multiple OR clauses in a WHERE statement. For example WHERE country = ‘Canada’ or country = ‘USA’ is the same as WHERE country IN (‘Canada’, ’USA’). * In this case, we want to filter the Department table to only show the highest Salary per Department (i.e. DepartmentId). Then we can join the two tables WHERE the DepartmentId and Salary is in the filtered Department table. SELECT Department.name AS 'Department', Employee.name AS 'Employee', Salary FROM Employee INNER JOIN Department ON Employee.DepartmentId = Department.Id WHERE (DepartmentId , Salary) IN ( SELECT DepartmentId, MAX(Salary) FROM Employee GROUP BY DepartmentId )

Answer 76

**SOLUTION**: DATEDIFF() DATEDIFF calculates the difference between two dates and is used to make sure we’re comparing today’s temperature to yesterday’s temperature. In plain English, the query is saying, Select the Ids where the temperature on a given day is greater than the temperature yesterday. SELECT DISTINCT a.Id FROM Weather a, Weather b WHERE a.Temperature \> b.Temperature AND DATEDIFF(a.Recorddate, b.Recorddate) = 1

Answer 77

It is a traditional database schema with a central table. Satellite tables map IDs to physical names or descriptions and can be connected to the central fact table using the ID fields; these tables are known as lookup tables and are principally useful in real-time applications, as they save a lot of memory. Sometimes star schemas involve several layers of summarization to recover information faster.

Answer 78

Statistical power - power of a binary hypothesis, which is the probability that the test rejects the null hypothesis given that the alternative hypothesis is true.

Answer 79

**Hypothesis testing** to determine statistical significance: 1. State the null hypothesis and alternative hypothesis. 2. Calculate the p-value, the probability of obtaining the observed results of a test assuming that the null hypothesis is true. 3. Set the level of the significance (alpha) and if the p-value is less than the alpha, you would reject the null — in other words, the result is statistically significant.

Answer 80

**Supervised Learning** * Learning a function that maps an input to an output based on example input-output pairs. **Unsupervised Learning** * used to draw inferences and find patterns from input data without references to labeled outcomes. * A common use of unsupervised learning is grouping by behavior to find target.

Answer 81

Systematic sampling is a statistical technique where **elements are selected from an ordered sampling frame**. In systematic sampling, the **list is progressed in a circular manner** so once you reach the end of the list, it is progressed from the top again. The best example of systematic sampling is equal probability method.

Answer 82

We will prefer Python because of the following reasons: * Python would be the best option because it has Pandas library that provides easy to use data structures and high-performance data analysis tools. * R is more suitable for machine learning than just text analysis. * Python performs faster for all types of text analytics.

Answer 83

TF–IDF is short for **term frequency-inverse document frequency**, is a numerical statistic that is intended to reflect **how important a word is to a document in a collection or corpus**. It is often **used as a weighting factor** **in** information retrieval and **text mining**. The TF–IDF value **increases proportionally to the number of times a word appears in the document** but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general.

Answer 84

* **Univariate** analyses are descriptive statistical analysis techniques which can be **differentiated based on the number of variables involved at a given point of time**. For example, the pie charts of sales based on territory involve only one variable and can the analysis can be referred to as univariate analysis. * The **bivariate** analysis attempts to u**nderstand the difference between two variables at a time as in a scatterplot**. For example, analyzing the volume of sale and spending can be considered as an example of bivariate analysis. * **Multivariate** analysis deals with the study of more than two variables to **understand the effect of variables on the responses**.

Answer 85

2 main ways: * Adjusted R-squared. R Squared is a measurement that tells you to what extent the proportion of variance in the dependent variable is explained by the variance in the independent variables. In simpler terms, while the coefficients estimate trends, R-squared represents the scatter around the line of best fit. However, every additional independent variable added to a model always increases the R-squared value — therefore, a model with several independent variables may seem to be a better fit even if it isn’t. This is where adjusted R² comes in. The adjusted R² compensates for each additional independent variable and only increases if each given variable improves the model above what is possible by probability. This is important since we are creating a multiple regression model. * Cross-Validation A method common to most people is cross-validation, splitting the data into two sets: training and testing data.

DS interview questions Flashcards

Source: https://www.edureka.co/blog/interview-questions/data-science-interview-questions/ https://towardsdatascience.com/over-100-data-scientist-interview-questions-and-answers-c5a66186769a (109 cards)