4 Flashcards
SVM: SVM maximizes the
Margin from both clusters
SVM: the margin is
The space between the closest data point and the line
SVM: SVM is an
Algorithm that separates data with a line
SVM: SVM prioritizes
Correct classification over maximizing margin
SVM: To create your SVM classifier, type
from sklearn.svm import SVC
clf = SVC(kernel=linear)
SVM: SVM might work poorly if
There are more features than samples
SVM: Can an SVM create a non linear decision boundary?
Yes, using the kernel trick
SVM: SVM stands for
Support vector machine
Python: To run a script from within a script but pull in variables from one script to the next, the best way is to
Use import
datetime: To switch the position of month and day by converting a date string and then replace slashes with dashes, type
from datetime import datetime
date = datetime.strptime(“05/01/15”, “%m/%d/%y”)
date.strftime(“%d-%m-%y”)
SVM: For a linear kernel, a gamma of 1.0 will produce a
Straight decision boundary
SVM: Some important parameters for SVM are
Kernel
Gamma
C
SVM: Some common kernels are
rbf and linear
SVM: the c parameter controls
The degree to which the decision boundary will curve to contain all of the training data
SVM: increasing the value of the c parameter will
Increase the number of data points correctly in their decision boundary but may over compensate to the point that it is no longer predictive
ML: For a very large data set with a lot of noise and without a clear decision boundary, the better model than SVM is
Naive bayes
Pandas: the Series.replace({“value”:”value2”}) replaces
every occurance
ML: Accuracy improves proportional to
Traning data size with diminishing returns beginning after 700 samples.
ML: To see if your models prediction accuracy would benefit from more samples you could
Slice you training data into 4 sections and test the accuracy of each cumulative sum. 200, 400, 600, 800. The improvement in accuracy should start suffering diminishing returns at higher sample sizes.
Python: One way to schedule a script is to
while True: if int(datetime.now().strftime("%M")) == 30: print("30 Mintues!") time.sleep(61) time.sleep(2)
ML: Usually, to increase accuracy it is better to have
More data rather than a more finely tuned algorithm.
datetime: To return the current date, type
import datetime
datetime.datetime.now().strftime(“%d/%m/%y”)
ML: A high information gain feature is one that is
very common in one classification and not in others
Pandas: To change all of a newly uploaded df’s dates to proper format and also turn numbers into numeric types, type
df = df.convert_objects(convert_numeric=True)
Python: To stop a for loop, use the command
break
Python: To test if a number is even, type
number % 2 == 0
ML: The main types of feature data are
Numerical, Categorical, Time series, Text
ML: Continuous supervised learning means that
The output is not binary or categorical but a part of a range
ML: Discrete supervised learning means that
The output is binary or categorical but not a part of a range
Math: The intercept is
Value of the y axis when the x axis is 0
Math: The formula for a regression is
y = m*x+b
Math: Slope can be equated by
change in y divided by change in x
Math: A coefficient is
The number that multiplies a variable.
Math: With regards to regressions, the coefficient is the
number before the x axis that determines the slope
Math: A scalar value is a
position on a scale. A quantity.
Math: The regression line is the
line that minimizes the sum of the squared distances between the line and the data points.
LinearRegression: When you ask a fitted LinearRegression model for a prediction, you must pass it a
List, even if it’s a list with one value
LinearRegression: To return the coefficient and the intercept of a fitted LinearRegression model, type
model. coef_
model. intercept_
Python: To switch the keys and values of a dictionary to be the values and keys using a dict comprehension, type
{value: key for key, value in my_dict.items()}
LinearRegression: To return the r-squared of a LinearRegression, type
model.score(x_test, y_test)
sklearn: Predictions are returned in an array,
so you should type
model.predict([27])[0]
Numpy: To return the number from array([[25]]), type
array([[25]])[0][0]
LinearRegression: For LinearRegression, “Error” means
The actual y minus the predicted y.
LinearRegression: The algorithm used by LinearRegression is called
ordinary least squares
LinearRegression: Sum of squared error is not a good evaluation metric for a regression because
Is increases when more data points are added, despite the fit being the same.
LinearRegression: The evaluation metric most used to describe the goodness of fit of a regression is called
r-squared
LinearRegression: The r-squared explains
How much of the change in y is explained by the change in x
Pandas: df.values returns a
numpy 2D array
Pandas: To have a df return a 2D numpy array, type
df.values
sklearn: To scale a pandas df to between 0 and 1, type
from sklearn.preprocessing import MinMaxScaler import numpy x = df.values scaler = MinMaxScaler() x_scaled = scaler.fit_transform(x) df2 = pandas.DataFrame(x_scaled)
SQL: After importing a new sql script, make sure to
right click and refresh on the left side in the schemas panel.
SQL: To return two specific columns from a db, type
SELECT table.column, table.column2 FROM table;
SQL: Before querying a db, make sure that the
Correct database is selected and bolded in the panel.
SQL: To return data that meets a criteria, type
SELECT * FROM table WHERE column = 2000;
SQL: In SQL an equivalence test only uses
one equals sign.
SQL: To return data that doesn’t meet a certain criteria, type
SELECT * FROM table WHERE column != 1000;
SQL: The test operators you can use in SQL are
=, !=, greater than, less than, greater or equal, lesser or equal
SQL: To query for two criteria, type
SELECT * FROM table WHERE column = 1000 AND column2 = “Value”;
SQL: To query for two optional criteria, type
SELECT * FROM table WHERE column = 1000 OR column2 = “String”;
SQL: To return rows for a range between two values, type
SELECT * FROM table WHERE column BETWEEN 1000 AND 2000;
SQL: The wildcard symbol is
%
SQL: The LIKE clause is
not case sensitive
SQL: To return only rows that have a partial string in them, type
SELECT * FROM table WHERE column LIKE “%string%”;
SQL: To sort columns before returning them, type
SELECT * FROM table ORDER BY column ASC;
SQL: To sort by two columns before returning them, type
SELECT * FROM table ORDER BY column DESC, column2 ASC;
SQL: Databases start counting at row
Zero
SQL: To return only the first 10 rows of a table, type
SELECT * FROM table LIMIT 10;
SQL: To return only 10 rows of a table starting at the 20th row, type
SELECT * FROM table LIMIT 10 OFFSET 21;
SQL: Sometimes clients implicitly
Add a limit of 1000 rows to your queries. This would not occur using a programming language.
SQL: A shorter way to return only the first 10 rows of a table starting at the 20th row is
SELECT * FROM table LIMIT 21, 10;
SQL: To filter for rows with the value NULL, type
SELECT * FROM table WHERE column IS NULL;
SQL: To sort a column you are returning and also filter out any NULL values, type
SELECT * FROM table WHERE column IS NOT NULL ORDER BY column ASC;
ML: The output type of a supervised classification algorithm is
Binary or categorical (Discreet)
ML: The output type of a regression algorithm is
A number that is part of a range (Continuous)
ML: A regression with multiple input variables is called a
Multi-variate regression
LinearRegression: To remove outliers, follow the procedure
Train, Remove ~10% data points with highest error, Re-train remaining points
ML: The most popular clustering algorithm is called
k-means
k-means: k-means first assigns the centroids
randomly
k-means: k-means requires you to tell it the
number of centroids
k-means: k-means places the centroid by
trying to reduce the total distance from half the points of a until it is in the middle
k-means: Make sure to adjust the default
n_clusters parameter
k-means: To control the number of iterations of assigning points to a centroid and then moving it to the middle, change the
max_iter parameter
k-means: To control how many times k-means initializes the algorithm to avoid bad clustering due to the randomness of the initial plot, change the
n_init parameter
sklearn: The algorithms that require rescaling are
K-means and SVM(kernel=”rbf”). For the rest don’t bother.
sklearn: The algorithm that counts the frequency of words is called
CountVectorizer
sklearn: bag of words will generally count plurals
as the root word
Bag of words: There is a bias for long texts because
they have the opportunity to have a higher frequency of words.
Bag of words: To process text into a bag of words, type
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
text_list = [text_var, text_var2, text_var3]
bag_of_words = vectorizer.fit(text_list)
bag_of_words = vectorizer.transform(text_list)
Bag of words: The CountVectorizer.transform method
counts the occurrences of the words.
Bag of words: To return the number of occurrences of a specific word, type
vectorizer.vocabulary_.get(“word”)
Bag of words: To reduce noise in a bag of words, you should
remove stopwords
Bag of words: A stopword is
a high frequency, low information word e.g. “the”
nltk: To get a list of stopwords, type
import nltk
nltk.download() #Download all the files
from nltk.corpus import stopwords
sw = stopwords.words(“english”)
Bag of words: To bundle words that have the same root, type
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer(“english”)
Bag of words: Before applying CountVectorizer of TfidfVectorizer, you should
use a stemmer
Bag of words: The TFIDF representation
weighs rare words that happen less frequently in the corpus as a whole more highly.
Pandas: When you use a list to return a column, df[[“Column”]], it returns a
DataFrame
Pandas: To delete a column, type
del df[“Column”]
Pandas: To return a boolean index based on values containing a partial string, type
df[“Column name”].str.contains(“string”)
sklearn: Everything that goes to the estimators must be
a float
Pandas: To efficiently remove the labels column from the feature columns, type
y = df.pop(“Column name”)
Pandas: To automatically create dummy variables for any string in a df, type
x = pandas.get_dummies(df)
Pandas: To filter for column names based on partial string, type
df.filter(regex=”Column String”)
Pandas: Sometimes to replace a numpy boolean column with an integers, you need to
turn it into a string first.
df[“Boolean Column”].astype(str).replace({“True”:1})