DS Interview Qs Flashcards
List the difference between supervised vs unsupervised learning
Supervised Learning: Uses known and labeled data as input, has a feedback mechanism, and most commonly are decision trees, logistic regression, support vector machine
Unsupervised Learning: Uses unlabeled data as input, has no feedback mechanism, and most commonly are k-means clustering, hierarchical clustering, and priori algorithms
How is logistic regression done?
logistic regression measures the relationship between the dependent variable (our label, what we want to predict) and the one or more independent variables (our features), by estimating probabilities using it’s underlying logistic function (sigmoid)
Explain the steps in making a decision tree
- Take the entire dataset as input
- Calculate entropy of target variable as well as predictor attributes
- Calculate information gain of all attributes
- Choose the attribute with highest information gain as the root node
- Repeat process on every branch till the decision node of each branch is finalized.
How do you build a random forest model?
- Randomly select k features from total m features where k < m
- Among the k features, calculate the node d using the best split point
- Split the node into daughter nodes using the best split
- Repeat steps 2 and 3 until leaf nodes are finalized
- Build forest by repeating steps 1 to 4 for n number times to create n number trees
How can you avoid overfitting your model?
Three main methods to avoid overfitting:
- Keep the model simple, take into account fewer variables, thereby removing some of the noise in the training data
- Use cross-validation techniques such as k-folds cross-validation (pre-data)
- Use regularization techniques such as LASSO that penalize certain model parameters if they’re likely to cause overfitting(during the process)
Differentiate between univariate, bivariate, and multivariate analysis
Univariate: contains only one variable, purpose of univariate analysis is to describe the data and find patterns that exist within it, can draw conclusions using mean, median, mode, min, max, etc.
Bivariate: contains two variables, bivariate analysis deals with causes and relationships, purpose of analysis is to find out the relationship between the two variables, find proportions of one variable to another, used for description and predictions
Multivariate: contains multiple variables, purpose of multivariate analysis to do the same as bivariate but with more variables, example: data about house to predict price, descriptive, predictive, and postscriptive (change the variables to guess what the outcome is)
What are the feature selection methods to select the right variables?
Two main methods for feature selection:
1. Filter Method (bad data in bad answer out, cleaning the data, preprocessing)
- Linear Discriminant Analysis
- ANOVA
- Chi-Square (most common)
2. Wrapper Method (labor intensive)
- Forward Selection (features off to side, test one feature at a time add one in until we get a fit)
- Backward selection (all features, run test, remove one at a time til fit)
- Recursive Feature Elimination (recursively looks through all features and how they pair together)
Write a program that prints the number 1-50. For multiples of 3 print Fizz, for multiples of 5 print Buzz, and multiples of both 3 and 5 print FizzBuzz
for i in range(1, 51):
if (I%3) == 0 and (I%5) == 0:
print (“FizzBuzz”)
elif (I%3) == 0:
print(“Fizz”)
elif(I%5) == 0:
print(“Buzz”)
else:
print(i)
You are given a dataset consisting of variables having more than 30% missing values? How will you deal with them?
Ways to handle missing data:
1. If the dataset is huge, we can simply remove the rows with missing data values. Its the quickest way and we can use the rest of the data to predict values
2. We can substitute missing values with mean of rest of the data using pandas dataframe in python (df.mean() df.fillna(mean))
For the given points, how will you calculate the Eucledian Distance in python?
euclidean_distance = sqrt( (plot1[0] - plot2[0]))2 + (plot1[1] - plot2[1])2)
What is the angle between the hour and minute hands of a clock when the time is half past 6
360 / 24 = 15
Explain dimensionality reduction and list its benefits
Def: Dimension Reduction refers to the process of converting a set of data having vast dimensions into data with lesser dimensions(fields) to convey similar information concisely
Benefits:
1. It helps in data compressing and reducing the storage space
2. It reduces computation time as less dimensions lead to less computing
3. It removes redundant features for example: there is no point in storing a value in two different units (inches and feet)
How will you calculate Eigen values and Eigen vectors of a 3x3 matrix?
Look up eigenvalues and eigenvectors video and do a practice question
How should you maintain your deployed model?
Steps:
1. Monitor: constant monitoring of all the models is needed to determine the performance accuracy of the model
2. Evaluate: evaluation metrics of the current model is calculated to determine if new algorithm is needed
3. Compare: the new models are compared against each other to determine which model performs the best
4. Rebuild: the best performing model is re-built on current state of data
What are recommender systems?
A recommender system predicts the “rating” or “preference” a user would give to a product
There are two types:
1. Collaborative Filtering: example is a Last.fm recommends tracks that are often played by other users with similar interests
2. Content-based Filtering: Pandora uses the properties of a song to recommend music with similar properties.
How to find RMSE and MSE in linear regression model
MSE = E((Y-Y_hat)**2)
RMSE = sqrt(MSE)
Expectation meaning the sum over all Y divided by N
If it rains on Saturday with probability 0.6 and it rains on Sunday with probability 0.2 what is the probability that it rains this weekend
Total probability - P(not rain on Saturday) * P(not rain on Sunday) = 1-(1-0.6)(1-0.2) = 0.68
How can you select k for k-means?
We most commonly use the “Elbow Method”:
- The idea of the elbow method is to run k-means clustering on the dataset where k is the number of clusters
- Within sum of squares (WSS) is defined as the sum of the squared distance between each member of the cluster and its centroids
What is the significance of p-value
p-value typically <= 0.05: indicates strong evidence against the null hypothesis, so you reject the null hypothesis
p-value typically > 0.05: indicates weak evidence against the null hypothesis, so you fail to reject the null hypothesis
p-value cut-off 0.05: considered to be marginal (could go either way)
How can outlier values be treated?
- You can drop outliers only if it is a garbage value
- ex: height of adult = ‘abc’ - If the outlier have extreme values, they can be removed
- if most values are 0-10 but we have an outlier of 100
If you cannot drop outliers, try the following:
1. Try a different model, data detected as outliers by linear models can be fit by non-linear models
2. Try normalizing the data, this way the extreme data points are pulled to a similar range
3. You can use algorithms which are less affected by outliers, example: random forest
How can you say that a time series data is stationary?
We can say that a time-series is stationary when the variance and mean of the series is constant with time (imagine a consistent wavelength on the x -axis)
How can you calculate accuracy using confusion matrix?
Accuracy = (True positive + True Negative)/ Total Observations
Write the equation and calculate precision and recall rate
Precision = True Positive/ (True Positive + False Positive)
Recall Rate = True Positive/ Total Positive + False Negative)
If a drawer contains 12 red socks, 16 blue socks, and 20 white socks, how many must you pull out to be sure of a matching pair?
must pick 4 because there’s 100% chance of a match, so the most is 4
‘People who bought this, also bought…” recommendations seen on Amazon is a result of which algorithm?
Recommendation engine is done using Collaborative Filtering not Content Filtering
Collaborative Filtering: exploits the behavior of other users and their purchase history in terms of ratings, selection, etc. It makes predictions on what might interest a person based on preferences of many other users. In this algorithm, features of the items are not known.
Write a SQL query to list all order with customer information
Given ORDERTABLE which contains Ordeid, Customerid, OrderNumber, TotalAmount
Given CUSTOMERTABLE which contains Id, FirstName, LastName, City, Country
SELECT OrderNumber, TotalAmount, FirstName, LastName, City, Country
FROM Order
JOIN Customer
ON Order.CustomerId = Customer.Id
What is the SQL query order?
Fill in
You are given a dataset on cancer detection. You’ve built a classification model and achieved an accuracy of 96% Why shouldn’t you be happy with your model performance? What can you do about it?
Cancer detection results in IMBALANCED DATA
In an imbalanced dataset, accuracy should not be used as a measure of performance because it is important to focus on the remaining 4%, which are the people who are wrongly diagnosed. Wrong diagnosis is of major concern because there can be people who have cancer but were not predicted so.
Which of the following machine learning algorithms can be used for inputing missing values of both categorical and continuous variables?
1. K-means clustering
2. Linear regression
3. K-NN
4. Decision trees
K-NN
Given a box of matches and two ropes, not necessarily identical, measure a period of 45 minutes
Light A from both ends and B from one end
When A is finished burning, we know that 30 minutes have elapsed and B has 30 minutes remaining. Light B from the other side and it will take 15 minutes to burn adding to 45 minutes.
Below are the 8 actual values of target variable in the trial file. [0,0,0,1,1,1,1,1], what is the entropy of the target variable?
-(5/8 log(5/8) + 3/8 log(3/8))
We want to predict the probability of death from heart disease based on three risk factors: age, gender, blood cholesterol. What is the most appropriate algorithm for this use case?
Logistic Regression
After studying the behavior of a population, you have identified four specific individual types who are valuable to your study. You would like to find all users who are most similar to each individual type. What algorithm is most appropriate for this study?
K-means clustering
We are looking for grouping people together specifically by four different similarities, indicating the value of k
You have run the association rules algorithm on your dataset and the two rules {banana, apple} => {grape} and {apple, orange} => {grape} have been found to be relevant. What else must be true?
{grape, apple} must be a frequent item set
Your organization has website where visitors randomly receive one of two coupons. It is also possible that visitors to the website will not receive a coupon. You have been asked to determine if offering a coupon to visitors to your website has any impact on their purchase decision. Which analysis method should you use?
One way ANOVA
What do you understand about true positive rate and false positive rate?
The True Positive Rate TPR defines the probability that an actual positive will turn out to be positive and is calculated by taking the ratio of the [True Positives] and [True Positives and False Negatives] aka TPR = TP / TP + FN
The False Positive Rate FPR defines the probability that an actual negative result will be shown as a positive one aka a false alarm and is calculated by taking the ration of [False Positives’ and [True Positives and False Positives] aka FPR = FP / TN + FP
What is the ROC curve?
The graph between the True Positive Rate on the y axis and the False Positive Rate on the x axis is called the ROC curve and is used in binary classification. The area range under the ROC curve has a range between 0 and 1. A completely random model which is represented by a straight line has a 0.5 ROC. The amount of deviation a ROC has from this straight line denotes the efficiency of the model.
What is a Confusion Matrix?
The Confusion Matrix is the summary of prediction results of a particular problem. It is a table that is used to describe the performance of the model. The Confusion Matrix is an n*n metric that evaluates the performance of the classification model.
What do you understand about the true positive rate and false positive rate?
The true positive rate gives the proportion of correct predictions of the positive class. It is also used to measure the percentage of actual positives that are accurately verified.
The false positive rate gives the proportion of incorrect predictions of the positive class. A false positive determines something is true when that is initially false.