Top 50 Questions Flashcards
What are the differences between supervised and unsupervised learning?
Supervised:
- uses known and labeled data as input
- has a feedback mechanism
Unsupervised:
- uses unlabeled data as input
- has no feedback mechanism
What are the most commonly used learning algorithms for supervised learning?
decision trees, logistic regression, and support vector machines (SVM)
What are the most commonly used learning algorithms for unsupervised learning?
k-means clustering, hierarchical clustering, and apriori algorithm.
How is logistic regression done?
Logistic regression measures the relationship between the dependent variable (our label of what we want to predict) and one or more independent variables (our features) by estimating probability using its underlying logistic function (sigmoid).
Explain the steps in making a decision tree?
- Take the entire data set as input
- Calculate entropy of the target variable, as well as the predictor attributes
- Calculate your information gain of all attributes (we gain information on sorting different objects from each other)
- Choose the attribute with the highest information gain as the root node
- Repeat the same procedure on every branch until the decision node of each branch is finalized
What is clear from the decision tree that an offer is accepted if?
- Salary is greater than $50,000
- The commute is less than an hour
- Incentives are offered
How do you build a random forest model?
Steps to build a random forest model:
- Randomly select ‘k’ features from a total of ‘m’ features where k << m
- Among the ‘k’ features, calculate the node D using the best split point
- Split the node into daughter nodes using the best split
- Repeat steps two and three until leaf nodes are finalized
- Build forest by repeating steps one to four for ‘n’ times to create ‘n’ number of trees
How can you avoid the overfitting of a random forest model?
- Keep the model simple—take fewer variables into account, thereby removing some of the noise in the training data
- Use cross-validation techniques, such as k folds cross-validation
- Use regularization techniques, such as LASSO, that penalize certain model parameters if they’re likely to cause overfitting
_______ data contains only one variable. The purpose of the _________ analysis is to describe the data and find patterns that exist within it.
Univariate
Example: height of students
The patterns can be studied by drawing conclusions using mean, median, mode, dispersion or range, minimum, maximum, etc.
_______ data involves two different variables. The analysis of this type of data deals with causes and relationships and the analysis is done to determine the relationship between the two variables.
Bivariate
Example: temperature and ice cream sales in the summer season.
Here, the relationship is visible from the table that temperature and sales are directly proportional to each other. The hotter the temperature, the better the sales.
_________ data involves three or more variables, it is categorized under __________. It is similar to a bivariate but contains more than one dependent variable.
Multivariate
Example: data for house price prediction
The patterns can be studied by drawing conclusions using mean, median, and mode, dispersion or range, minimum, maximum, etc. You can start describing the data and using it to guess what the price of the house will be.
What are the feature selection methods used to select the right variables?
There are two main methods for feature selection, i.e, filter, and wrapper methods.
Cite three filter feature selection methods?
- Linear discrimination analysis
- ANOVA
- Chi-Square
The best analogy for selecting features is “bad data in, bad answer out.” When we’re limiting or selecting the features, it’s all about cleaning up the data coming in.
Cite three wrapper feature selection methods?
- Forward Selection: We test one feature at a time and keep adding them until we get a good fit
- Backward Selection: We test all the features and start removing them to see what works better
- Recursive Feature Elimination: Recursively looks through all the different features and how they pair together
Wrapper methods are very labor-intensive, and high-end computers are needed if a lot of data analysis is performed with the wrapper method.
In your choice of language, write a program that prints the numbers ranging from one to 16.
But for multiples of three, print “Fizz” instead of the number, and for the multiples of five, print “Buzz.” For numbers which are multiples of both three and five, print “FizzBuzz”
fizzbuzz
1
2
fizz
4
buzz
5
fizz
7
8
fizz buzz
10
11
fizz
13
14
fizzbuzz
You are given a data set consisting of variables with more than 30 percent missing values. How will you deal with them?
- If the data set is large, we can just simply remove the rows with missing data values.
- For smaller data sets, we can substitute missing values with the mean or average of the rest of the data using the pandas’ data frame in python. There are different ways to do so, such as df.mean( ), df.fillna(mean).
For the given points, how will you calculate the Euclidean distance in Python?
plot1 = [1,3]
plot2 = [2,5]
The Euclidean distance can be calculated as follows:
euclidean_distance = sqrt( (plot1[0]-plot2[0])**2 + (plot1[1]-plot2[1])**2 )
What are dimensionality reduction and its benefits?
Dimensionality reduction refers to the process of converting a data set with vast dimensions into data with fewer dimensions (fields) to convey similar information concisely.
This reduction helps in compressing data and reducing storage space. It also reduces computation time as fewer dimensions lead to less computing. It removes redundant features; for example, there’s no point in storing a value in two different units (meters and inches).
How should you maintain a deployed model?
- monitor
- evaluate
- compare
- rebuild
What are recommender systems?
A recommender system predicts what a user would rate a specific product based on their preferences. It can be split into two different areas:
- Collaborative Filtering
As an example, Last.fm recommends tracks that other users with similar interests play often. This is also commonly seen on Amazon after making a purchase; customers may notice the following message accompanied by product recommendations: “Users who bought this also bought…”
- Content-based Filtering
As an example: Pandora uses the properties of a song to recommend music with similar properties. Here, we look at content, instead of looking at who else is listening to music.
How do you find RMSE and MSE in a linear regression model?
RMSE indicates the Root Mean Square Error.
MSE indicates the Mean Square Error.
How can you select k for k-means?
We use the elbow method to select k for k-means clustering. The idea of the elbow method is to run k-means clustering on the data set where ‘k’ is the number of clusters.
Within the sum of squares (WSS), it is defined as the sum of the squared distance between each member of the cluster and its centroid.
What is the significance of p-value of less than 0.05?
p-value typically ≤ 0.05
This indicates strong evidence against the null hypothesis; so you reject the null hypothesis.