DS Interview Qs Flashcards

Question

'People who bought this, also bought..." recommendations seen on Amazon is a result of which algorithm?

Answer 1

Recommendation engine is done using Collaborative Filtering not Content Filtering Collaborative Filtering: exploits the behavior of other users and their purchase history in terms of ratings, selection, etc. It makes predictions on what might interest a person based on preferences of many other users. In this algorithm, features of the items are not known.

Answer 2

Given ORDERTABLE which contains Ordeid, Customerid, OrderNumber, TotalAmount Given CUSTOMERTABLE which contains Id, FirstName, LastName, City, Country SELECT OrderNumber, TotalAmount, FirstName, LastName, City, Country FROM Order JOIN Customer ON Order.CustomerId = Customer.Id

Answer 3

Cancer detection results in IMBALANCED DATA In an imbalanced dataset, accuracy should not be used as a measure of performance because it is important to focus on the remaining 4%, which are the people who are wrongly diagnosed. Wrong diagnosis is of major concern because there can be people who have cancer but were not predicted so.

Answer 4

Light A from both ends and B from one end When A is finished burning, we know that 30 minutes have elapsed and B has 30 minutes remaining. Light B from the other side and it will take 15 minutes to burn adding to 45 minutes.

Answer 5

-(5/8 log(5/8) + 3/8 log(3/8))

Answer 6

Logistic Regression

Answer 7

K-means clustering We are looking for grouping people together specifically by four different similarities, indicating the value of k

Answer 8

{grape, apple} must be a frequent item set

Answer 9

One way ANOVA

Answer 10

The True Positive Rate TPR defines the probability that an actual positive will turn out to be positive and is calculated by taking the ratio of the [True Positives] and [True Positives and False Negatives] aka TPR = TP / TP + FN The False Positive Rate FPR defines the probability that an actual negative result will be shown as a positive one aka a false alarm and is calculated by taking the ration of [False Positives' and [True Positives and False Positives] aka FPR = FP / TN + FP

Answer 11

The graph between the True Positive Rate on the y axis and the False Positive Rate on the x axis is called the ROC curve and is used in binary classification. The area range under the ROC curve has a range between 0 and 1. A completely random model which is represented by a straight line has a 0.5 ROC. The amount of deviation a ROC has from this straight line denotes the efficiency of the model.

Answer 12

The Confusion Matrix is the summary of prediction results of a particular problem. It is a table that is used to describe the performance of the model. The Confusion Matrix is an n*n metric that evaluates the performance of the classification model.

Answer 13

The true positive rate gives the proportion of correct predictions of the positive class. It is also used to measure the percentage of actual positives that are accurately verified. The false positive rate gives the proportion of incorrect predictions of the positive class. A false positive determines something is true when that is initially false.

Answer 14

The primary and vital difference between data science and traditional application programming is that in traditional programming, one has to create rule to translate the input to output. In data science the rules are automatically produced from the data.

Answer 15

Long format data: contains values that repeat in the first column. In this format, each row is a one-time pointer per subject Wide format data: the data's repeated responses will be in a single row, and each response can be recorded in separate columns

Answer 16

Sampling is the selection of individual members or a subset of the population to estimate the characters of the whole population. There are two types of sampling, namely probability and non probability sampling

Answer 17

Data scientists and technical analysts must convert a huge amount of data into effective ones. Data cleaning includes removing malware records, outliners, inconsistent values, redundant formatting, etc. Matplotlib, Pandas, etc, are the most used python data cleaners

Answer 18

Tensor Flow, Pandas, NumPy, SciPy, Scrapy, Libra, MatPlotLib

Answer 19

Variance is the value that depicts the individual figures in a set of data which distributes themselves about the mean and describes the difference of each value from the mean value. Data scientists use variance to understand the distribution of a data set

Answer 20

In data science and machine learning, pruning is a technique which is related to decision trees. Pruning simplifies the decision tree by reducing the rules. Pruning helps to avoid complexity and improves accuracy. Reduced error pruning, cost complexity pruning, etc. are the different types of pruning.

Answer 21

Entropy is the measure of randomness or disorder in the group of observations. It also determines how a decision tree switches to split data. Entropy is also used to check the homogeneity of the given data. If the entropy is zero, then the sample of data is entirely homogeneous, and if the entropy is one, then it indicates that the sample is equally divided

Answer 22

Information gain is the expected reduction in entropy. Information gain makes the decision tree smarter. Information gain includes parent node R and a set oE of K training examples. It calculates the difference between entropy before and after the split.

Answer 23

The k-fold cross validation is a procedure used to estimate the model's skill in new data. In k-fold cross validation, every observation from the original dataset may appear in the training and testing set. K-fold cross validation estimates the accuracy but does not help you to improve the accuracy.

Answer 24

Normal distribution is also known as the Gaussian Distribution. The normal distribution shows the data near the mean and the frequency of that particular data. When represented in graphical form, the normal distribution appears like a bell curve. The parameters included in the normal distribution are mean, standard deviation, median, etc.

Answer 25

Deep learning is one of the essential factors in data science, including statistics. Deep learning makes us work more closely with the human brain and reliable with human thoughts. The algorithms are sincerely created to resemble the human brain. In deep learning, multiple layers are formed from the raw input to extract the high-level layer with the best features.

Answer 26

RNN is an algorithm that uses sequential data. RNN is used in language translation, voice recognition, image capturing, etc. There are different types of RNN networks such as one-to-one, one-to-many, many-to-one, and many-to-many. RNN is used in Google's Voice search and Apple's Siri.

Answer 27

A feature vector is an n-dimensional vector of numerical features that represent an object. IN machine learning, feature vectors are used to represent numeric or symbolic characteristics (called features) of an object in a mathematical way that's easy to analyze.

Answer 28

1. Take the entire data set as input 2. Look for a split that maximizes the separation of the classes. A split is any test that divides the data into two sets 3. Apply the split to the input data 4. Re-apply steps one and two to the divided data 5. Stop when you meet any stopping criteria 6. This step is called pruning, clean up the tree if you went too far doing splits.

Answer 29

Root cause analysis was initially developed to analyze industrial accidents but it now widely used in other areas. It is a problem-solving technique used for isolating the root causes of faults or problems. A factor is called a root cause if its deduction from the problem fault sequence averts the final undesirable event from recurring.

Answer 30

Logistic regression is also known as the logic model. It is a technique used to forecast the binary outcome from a linear combination of predictor variables.

Answer 31

Recommender systems are a subclass of information filtering systems that are meant to predict the preferences or ratings that a user would give to a product.

Answer 32

Cross validation is a model validation technique for evaluating how the outcomes of a statistical analysis will generalize to an independent data set. It is mainly used in background where the objective is to forecast and one wants to estimate how accurately a model will accomplish in practice. The goal of cross validation is to term a data set to test the model in the training phase (i.e. validation data set) to limit problems like overfitting and gain insight into how the model will generalize to an independent data set.

Answer 33

Most recommender systems use this filtering process to find patterns and information by collaborating perspectives, numerous data sources, and several agents.

Answer 34

They do not, because in some cases, they reach a local minima or a local optima point. You would no reach the global optima point. This is governed by the data and the starting conditions.

Answer 35

This is statistical hypothesis testing for randomized experiments with two variables, A and B. The objective of A/B testing is to detect any changes to a web page to maximize or increase the outcome of a strategy.

Answer 36

The assumption of linearity of the errors It can't be used for count outcomes or binary outcomes There are overfitting problems that it can't solve

Answer 37

It is a theorem that describes the result of performing the same experiment very frequently. This theorem forms the basis of frequency-style thinking. It states that the sample mean, sample variance, and sample standard deviation converge to what they are trying to estimate.

Answer 38

There are extraneous variables in a statistical model that correlates directly or inversely with both the dependent and the independent variable. The estimate fails to account for the confounding factor.

Answer 39

It is a traditional database schema with a central table. Satellite table map IDs to physical names or descriptions and can be connected to central fact table using the ID fields; these tables are known as lookup tables and are principally useful in real-time application, as they save a lot of memory. Sometimes, star schemas involve several layers of summarization to recover information faster.

Answer 40

You will want to update an algorithm when: 1. You want the model to evolve as data streams through infrastructure 2. The underlying data source is changed 3. There is a case of non-stationarity

Answer 41

Eigenvalues are the directions along which a particular linear transformation acts by flipping, compressing, or stretching Eigenvectors are for understanding linear transformations. In data analysis, we usually calculate the Eigenvectors for a correlation or covariance matrix

Answer 42

Resampling is done in any of these cases: 1. Estimating the accuracy of sample statistics by using subsets of accessible data, or drawing randomly with replacement from a set of data points 2. Substituting labels on data points when performing significance tests 3. Validating models by using random subsets aka bootstrapping, cross-validation

Answer 43

Selection bias, in general, is a problematic situation in which error is introduced due to a non-random population sample

Answer 44

Selection bias, undercoverage bias, survivorship bias

Answer 45

Survivorship bias is the logical error of focusing on aspects that support surviving a process and casually overlooking those that did not because of their lack of prominence. This can lead to wrong conclusions in numerous ways.

Answer 46

The underlying principle of this technique is that several weak learners combine to provide a strong learner. The steps involved are: 1. Build several decision trees on bootstrapped training sample of the data 2. On each tree, each time a split is considered, a random sample of mm predictors is chosen as split candidates out of all pp predictors 3. Rule of thumb: at each split m = p * sqrt(m) = p 4. Predictions: at the majority rule

Answer 47

While trying to get over bias in our model, we try to increase complexity of the machine learning algorithm. Though it helps in reducing the bias, after a certain points, it generates an overfitting effect on the model hence resulting in hyper sensitivity and high variance. To achieve the best performance, the main target of a supervised machine learning algorithm is to have low variance and bias.

Answer 48

Markov chains defines that a state's future probability depends only on its current state. Markov chains belong to the stochastic process type category. Give example.

Answer 49

R is widely used in data visualizations for the following reasons 1. We can create almost any type of graph using R 2. R has multiple libraries like lattice, ggplots2, leaflet, etc., and so many inbuilt functions as well. 3. It is easier to customize graphics in R compared to Python 4. R is used in feature engineering and in exploratory data analysis as well

Answer 50

The frequency of a certain feature's values is denoted visually by both box plots and histograms. Boxplots are more often used in comparing several datasets and compared to histogram, take less space and contain fewer details. Histograms are used to know and understand the probability distribution underlying a dataset.

Answer 51

NLP is short for Natural Language Processing. It deals with the study of how computer learn a massive amount of textual data through programming. A few popular examples of NLP are stemming, sentimental analysis, tokenization, removal of stop words, etc.

Answer 52

Error: The difference between the actual value and the predicted value is called an error. Some of the popular means of calculating data science errors are RMSE, MAE, MSE, an error is generally unobservable, an error is how actual population data and observed data differ from one another. Residual Error: The difference between the arithmetic mean of a group of values and the observed group of values is called a residual error. A residual can be represented using a graph. A residual error is used to show how the sample population data and the observed data differ from one another.

Answer 53

Standardization: The technique of converting data in such a way that it is normally distributed and has a standard deviation of 1 and a mean of 0. Standardization takes care that the standard normal distribution is followed by the data. Formula: X' = (X-mu)/sigma Normalization: The technique of converting all data values to lie between 1 and 0 is know as normalization. This is also known as min-max scaling. Formula: X' = (X-Xmin)/(Xmax - Xmin)

Answer 54

Confidence Intervals: A range of values likely containing the population parameter is given by the confidence interval. Further, it even tells us how likely that particular interval can contain the population parameter. The Confidence Coefficient is denoted by 1-alpha, which gives the probability or likeliness. The level of significance is given by alpha. Point Estimates: An estimate of the population parameter is given by a particular values called the point estimate. Some popular methods used to derive population parameters point estimator are: maximum likelihood estimator MLE and the Method of Moments To conclude, the bias and variance are inversely proportional to one another, an increase in bias results in a decrease in variance and an increase in variance results in a decrease in bias.

Answer 55

Olympic volleyball line judge

Answer 56

Like Lightning McQueen

DS Interview Qs Flashcards

(99 cards)