Top 50 Questions Flashcards

Question

A p-value at cutoff 0.05 can be interruptible as?

Answer 1

marginal, meaning it could go either way.

Answer 2

You can drop outliers only if it is a garbage value. If the outliers have extreme values, they can be removed. For example, if all the data points are clustered between zero to 10, but one point lies at 100, then we can remove this point. If you cannot drop outliers, you can try the following: * Try a different model. Data detected as outliers by linear models can be fit by nonlinear models. Therefore, be sure you are choosing the correct model. * Try normalizing the data. This way, the extreme data points are pulled to a similar range. * You can use algorithms that are less affected by outliers; an example would be random forests.

Answer 3

It is stationary when the variance and mean of the series are constant with time.

Answer 4

You can see the values for total data, actual values, and predicted values. The formula for accuracy is: **Accuracy = (True Positive + True Negative) / Total Observations** = (262 + 347) / 650 = 609 / 650 = 0.93

Answer 5

**Precision = (True positive) / (True Positive + False Positive)** = 262 / 277 = 0.94 **Recall Rate = (True Positive) / (Total Positive + False Negative)** = 262 / 288 = 0.90

Answer 6

The recommendation engine is accomplished with collaborative filtering. Collaborative filtering explains the behavior of other users and their purchase history in terms of ratings, selection, etc. The engine makes predictions on what might interest a person based on the preferences of other users. In this algorithm, item features are unknown.

Answer 7

Cancer detection results in imbalanced data. In an imbalanced dataset, accuracy should not be based as a measure of performance. It is important to focus on the remaining four percent, which represents the patients who were wrongly diagnosed. Early diagnosis is crucial when it comes to cancer detection, and can greatly improve a patient's prognosis. Hence, to evaluate model performance, we should use Sensitivity (True Positive Rate), Specificity (True Negative Rate), F measure to determine the class wise performance of the classifier.

Answer 8

* K-means clustering * Linear regression * K-NN (k-nearest neighbor) * Decision trees The K nearest neighbor algorithm can be used because it can compute the nearest neighbor and if it doesn't have a value, it just computes the nearest neighbor based on all the other features. When you're dealing with K-means clustering or linear regression, you need to do that in your pre-processing, otherwise, they'll crash. Decision trees also have the same problem, although there is some variance.

Answer 9

The formula for calculating the entropy is: Putting p=5 and n=8, we get Entropy = A = -(5/8 log(5/8) + 3/8 log(3/8))

Answer 10

The most appropriate algorithm for this case is logistic regression.

Answer 11

As we are looking for grouping people together specifically by four different similarities, it indicates the value of k. Therefore, K-means clustering (answer A) is the most appropriate algorithm for this study.

Answer 12

The answer is {grape, apple} must be a frequent itemset

Answer 13

The answer is: One-way ANOVA

Answer 14

A feature vector is an n-dimensional vector of numerical features that represent an object. In machine learning, feature vectors are used to represent numeric or symbolic characteristics (called features) of an object in a mathematical way that's easy to analyze.

Answer 15

1. Take the entire data set as input. 2. Look for a split that maximizes the separation of the classes. A split is any test that divides the data into two sets. 3. Apply the split to the input data (divide step). 4. Re-apply steps one and two to the divided data. 5. Stop when you meet any stopping criteria. 6. This step is called pruning. Clean up the tree if you went too far doing splits.

Answer 16

Root cause analysis was initially developed to analyze industrial accidents but is now widely used in other areas. It is a problem-solving technique used for isolating the root causes of faults or problems. A factor is called a root cause if its deduction from the problem-fault-sequence averts the final undesirable event from recurring.

Answer 17

Logistic regression is also known as the logit model. It is a technique used to forecast the binary outcome from a linear combination of predictor variables.

Answer 18

Recommender systems are a subclass of information filtering systems that are meant to predict the preferences or ratings that a user would give to a product.

Answer 19

*Cross-validation* is a model validation technique for evaluating how the outcomes of a statistical analysis will generalize to an independent data set. It is mainly used in backgrounds where the objective is to forecast and one wants to estimate how accurately a model will accomplish in practice. The goal of cross-validation is to term a data set to test the model in the training phase (i.e. validation data set) to limit problems like overfitting and gain insight into how the model will generalize to an independent data set.

Answer 20

Most recommender systems use this filtering process to find patterns and information by collaborating perspectives, numerous data sources, and several agents.

Answer 21

They do not, because in some cases, they reach a local minima or a local optima point. You would not reach the global optima point. This is governed by the data and the starting conditions.

Answer 22

This is statistical hypothesis testing for randomized experiments with two variables, A and B. The objective of A/B testing is to detect any changes to a web page to maximize or increase the outcome of a strategy.

Answer 23

* The assumption of linearity of the errors * It can't be used for count outcomes or binary outcomes * There are overfitting problems that it can't solve

Answer 24

It is a theorem that describes the result of performing the same experiment very frequently. This theorem forms the basis of frequency-style thinking. It states that the sample mean, sample variance, and sample standard deviation converge to what they are trying to estimate.

Answer 25

These are extraneous variables in a statistical model that correlates directly or inversely with both the dependent and the independent variable. The estimate fails to account for the confounding factor.

Answer 26

It is a traditional database schema with a central table. Satellite tables map IDs to physical names or descriptions and can be connected to the central fact table using the ID fields; these tables are known as lookup tables and are principally useful in real-time applications, as they save a lot of memory. Sometimes, star schemas involve several layers of summarization to recover information faster.

Answer 27

You will want to update an algorithm when: * You want the model to evolve as data streams through infrastructure * The underlying data source is changing * There is a case of non-stationarity

Answer 28

Eigenvalues are the directions along which a particular linear transformation acts by flipping, compressing, or stretching.

Answer 29

Eigenvectors are for understanding linear transformations. In data analysis, we usually calculate the eigenvectors for a correlation or covariance matrix.

Answer 30

Resampling is done in any of these cases: * Estimating the accuracy of sample statistics by using subsets of accessible data, or drawing randomly with replacement from a set of data points * Substituting labels on data points when performing significance tests * Validating models by using random subsets (bootstrapping, cross-validation)

Answer 31

Selection bias, in general, is a problematic situation in which error is introduced due to a non-random population sample.

Answer 32

1. Selection bias 2. Undercoverage bias 3. Survivorship bias

Answer 33

Survivorship bias is the logical error of focusing on aspects that support surviving a process and casually overlooking those that did not because of their lack of prominence. This can lead to wrong conclusions in numerous ways.

Answer 34

The underlying principle of this technique is that several weak learners combine to provide a strong learner. The steps involved are: 1. Build several decision trees on bootstrapped training samples of data 2. On each tree, each time a split is considered, a random sample of mm predictors is chosen as split candidates out of all pp predictors 3. Rule of thumb: At each split m=p√m=p 4. Predictions: At the majority rule

Top 50 Questions Flashcards

(58 cards)