Machine Learning revision Flashcards
What are the benefits of preprocessing data / data wrangling?
- Transform raw data into a state which the machine can understand and interpret easily.
- Remove redundant information ~noise.
- Spot outliers and deal with them to make sure the training is effective and not skewed.
- Data in real world is not perfect.
What is a feature?
A feature is an individual measurable property or characteristic of a phenomenon being observed. Features can be categorical or numerical.
Name some common steps for data preprocessing in DS/ML?
- Taking care of the missing data
- Encoding Categorical Data
- Feature Scaling
Which library to use for taking care of missing data? Recall the entire code.
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(
missing_values = np.nan,
strategy = ‘mean’)
imputer.fit(X[:, a:b])
X[:, a:b] = imputer.transform(X[X:, 1:3])
What are some strategies to take care of missing data apart from averaging?
- Mean, median, Mode
- Deleting missing data (delete full row, deleting the variable (not recommended)).
- Time Series Specific Methods (Last Observation Carried Forward (LOCF) & Next Observation Carried Backward (NOCB), Linear Interpolation (not good for data with seasonal oscillations), Seasonal Adjustment + Linear Interpolation).
- Use Regression
- Multiple Imputation (question to explain what it is.)
- Use k-NN Classification to impute categorial variables.
- Imputation using Deep Learning - Datawig
What is multiple imputation method for taking care of missing data?
Multiple ImputationImputation:
- Impute the missing entries of the incomplete data sets m times (m=3 in the figure). Note that imputed values are drawn from a distribution. Simulating random draws doesn’t include uncertainty in model parameters. Better approach is to use Markov Chain Monte Carlo (MCMC) simulation. This step results in m complete data sets.
- Analysis: Analyze each of the m completed data sets.
- Pooling: Integrate the m analysis results into a final result
Types of missing data?
3- MCAR, MAR, NMAR :-
Missing completely at random (MCAR): a certain missing value has nothing to do with its hypothetical value and with the values of other variables.
Missing at random (MAR): Missing at random means that the propensity for a data point to be missing is not related to the missing data, but it is related to some of the observed data
Not missing at random (NMAR): missing value is dependent on some other variable’s value ex. People with high salaries don’t always want to reveal their salaries.
Difference between MCAR and MAR:
For example, if high school GPA data is missing randomly across all schools in a district, that data will be considered MCAR. However, if data is randomly missing for students in specific schools of the district, then the data is MAR.
How to encode categorical data? Which library to use? Provide code:-
OneHotEncoding- to transform n categories to n-dimensional vector.
Encoding independent variables:-
import sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(
transformers = [(‘encoder’,
OneHotEncoder(), [specific the columns])],
remainder = ‘passthrough’)
X = np.array(ct.fit_transform(X))
Encoding yes-no variable:-
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)
What are the two types of feature scaling? Give their formulas. Code one of them.
Standardisation and normalisation
Standardisation: X = (X - mean(X))/s.d.(X)
Normalisation: X = (X- min(X))/(max(X) - min(X))
Standardisation:
from sklearn.preprocessing import StandardScaler
sc = StandardScalar()
X = sc.fot_transform(X)
Benefits of Splitting the dataset to train and test set
To make sure that the hasn’t been overfitting.
Code to split the dataset to train and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_testsplit(X, y, test_size = between 0 and 1, random_state = optional)
What are some types of Regression?
- Simple Linear
- Multiple Linear
- Polynomial Linear
- Support Vector
- Decision Tree
- Random Forest
Train Linear regression
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
NOTE: Almost same as Linear Regression but not this uses LinearRegression and not Linear Regressor.
What are the five methods of building models?
- All in
- Backward Elimination
- Forward Selection
- Bidirectional Elimination
- Score Comparison
Steps for Backward Elimination
- Select a significance level to stay in the model (Default: 5%)
- Fit the full model with all possible predictors.
- Consider the predictor with the highest P-value. If, p>sig. level, go to step 4, otherwise finish.
- Remove the predictor.
- Fit the model without this variable. GO back to step 3.
Train Multiple linear Regression
from sklearn.linear_model import LinearRegressor
regressor = LinearRegressor()
regressor.fit(X_train, y_train)
NOTE: Almost same as Linear Regression but not this uses LinearRegressor and not Linear Regression.
Train Polynomial Linear Regression
Idea: treat polynomial as multiple linear regression with x2 = x^2, x3 = x^3 ….. So, we can create a matrix for powered features.
from sklearn.preprocessing import LinearRegression, PolynomialFeatures
poly_reg = PolynomialFeatures(degree = 2)
X_poly = poly_reg.fit_transform(X)
lin_reg_2 = LinearRegression()
lin_reg_2.fit(X_poly, y)
reshape a 1D array to a 2D array
Let y be a 1D array.
y = y.reshape(len(y), 1)
Train the Support Vector Regression Model
Note: It is needed to use Standard Scaling in SVR.
from sklearn.svm import SVR
regressor = SVR(kernel = ‘rbf’) # select your own ketnel
regressor.fit(X, y)
train the Decision Tree Regression model
from sklearn.tree import DecisionTreeRegressor
regressor = DecisionTreeRegressor()
regressor.fit(X, y)
Which regression model(s) requires the data to be standardised?
SVR
train the Random Forest Regression model
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators = 10)
regressor.fit(X, y)
Evaluating the Regression models
R squared (R^2) A better one is -> Adjested (R^2) code:- from sklearn.metrics import r2_score r2_Score(y_test, y_pred)
Train Logistic Regression
data should be scaled.
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
classifier.fit(X_train, y_train)