Data Science basics Flashcards

Question 1

Q

Feature scaling

Answer

A

In practice, we often encounter different types of variables in the same dataset. A significant issue is that the range of the variables may differ a lot. Using the original scale may put more weights on the variables with a large range.

The goal of applying Feature Scaling is to make sure features are on almost the same scale so that each feature is equally important and make it easier to process by most ML algorithms.

Question 2

Q

Standardization

Answer

A

aka Z-score normalizations is that features will be scaled to ensure that the mean is 0 and standard deviation is 1
can be done with sklearn library

assumes that your data has a gaussian distribution is useful when your data has varying scales and the algo you are using does not make assumptions about your data having a gaussian distributions ec) linear and logistic regression

Question 3

Q

Normalization

Answer

A

the goal of normalization is to change the values of numeric columns in the dataset to a common scale, without distorting differences in the ranges of values.

Normalization is useful when your data has varying scales and the algorithm you are using does not make assumptions about the distribution of your data, such as k-nearest neighbors and artificial neural networks.

Question 4

Q

feature selection

Answer

A

Feature Selection is the process where you automatically or manually select those features which contribute most to your prediction variable or output in which you are interested in.

Reduces Overfitting: Less redundant data means less opportunity to make decisions based on noise.
· Improves Accuracy: Less misleading data means modeling accuracy improves.
· Reduces Training Time: fewer data points reduce algorithm complexity and algorithms train faster.

Univariate Selection
Statistical tests can be used to select those features that have the strongest relationship with the output variable.

Feature Importance
Feature importance gives you a score for each feature of your data, the higher the score more important or relevant is the feature towards your output variable.
built in with Tree Based algos

Correlation Matrix with Heatmap
Correlation states how the features are related to each other or the target variable.

Question 5

Q

Supervised learning

Answer

A

is a machine learning approach that’s defined by its use of labeled datasets. These datasets are designed to train or “supervise” algorithms into classifying data or predicting outcomes accurately. Using labeled inputs and outputs, the model can measure its accuracy and learn over time.

ex) decision trees, random forest, linear regression, logistic regression

Question 6

Q

unsupervised learning

Answer

A

uses machine learning algorithms to analyze and cluster unlabeled data sets. These algorithms discover hidden patterns in data without the need for human intervention (hence, they are “unsupervised”)

ex) k-means clustering, association (customers who bought this also bought)

Question 7

Q

Null hypothesis

Answer

A

A statement in which no difference or effect is expected. Assumes no meaningful relationship between two variables

Question 8

Q

Alternate Hypothesis

Answer

A

A statement that some difference or effect is expected. Accepting the alternative hypothesis will lead to changes in opinions or actions. It is the opposite of the null hypothesis.

Question 9

Q

type-I error

Answer

A

error occurs when the sample results, lead to the rejection of the null hypothesis when it is in fact true. Type-I errors are equivalent to false positives.

Question 10

Q

type II error

Answer

A

error occurs when based on the sample results, the null hypothesis is not rejected when it is in fact false. Type-II errors are equivalent to false negatives.

Data Science basics Flashcards

(10 cards)