Data Science basics Flashcards
Feature scaling
In practice, we often encounter different types of variables in the same dataset. A significant issue is that the range of the variables may differ a lot. Using the original scale may put more weights on the variables with a large range.
The goal of applying Feature Scaling is to make sure features are on almost the same scale so that each feature is equally important and make it easier to process by most ML algorithms.
Standardization
aka Z-score normalizations is that features will be scaled to ensure that the mean is 0 and standard deviation is 1
can be done with sklearn library
assumes that your data has a gaussian distribution is useful when your data has varying scales and the algo you are using does not make assumptions about your data having a gaussian distributions ec) linear and logistic regression
Normalization
the goal of normalization is to change the values of numeric columns in the dataset to a common scale, without distorting differences in the ranges of values.
Normalization is useful when your data has varying scales and the algorithm you are using does not make assumptions about the distribution of your data, such as k-nearest neighbors and artificial neural networks.
feature selection
Feature Selection is the process where you automatically or manually select those features which contribute most to your prediction variable or output in which you are interested in.
Reduces Overfitting: Less redundant data means less opportunity to make decisions based on noise.
· Improves Accuracy: Less misleading data means modeling accuracy improves.
· Reduces Training Time: fewer data points reduce algorithm complexity and algorithms train faster.
Univariate Selection
Statistical tests can be used to select those features that have the strongest relationship with the output variable.
Feature Importance
Feature importance gives you a score for each feature of your data, the higher the score more important or relevant is the feature towards your output variable.
built in with Tree Based algos
Correlation Matrix with Heatmap
Correlation states how the features are related to each other or the target variable.
Supervised learning
is a machine learning approach that’s defined by its use of labeled datasets. These datasets are designed to train or “supervise” algorithms into classifying data or predicting outcomes accurately. Using labeled inputs and outputs, the model can measure its accuracy and learn over time.
ex) decision trees, random forest, linear regression, logistic regression
unsupervised learning
uses machine learning algorithms to analyze and cluster unlabeled data sets. These algorithms discover hidden patterns in data without the need for human intervention (hence, they are “unsupervised”)
ex) k-means clustering, association (customers who bought this also bought)
Null hypothesis
A statement in which no difference or effect is expected. Assumes no meaningful relationship between two variables
Alternate Hypothesis
A statement that some difference or effect is expected. Accepting the alternative hypothesis will lead to changes in opinions or actions. It is the opposite of the null hypothesis.
type-I error
error occurs when the sample results, lead to the rejection of the null hypothesis when it is in fact true. Type-I errors are equivalent to false positives.
type II error
error occurs when based on the sample results, the null hypothesis is not rejected when it is in fact false. Type-II errors are equivalent to false negatives.