4 Flashcards

1
Q

SVM: SVM maximizes the

A

Margin from both clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

SVM: the margin is

A

The space between the closest data point and the line

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

SVM: SVM is an

A

Algorithm that separates data with a line

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

SVM: SVM prioritizes

A

Correct classification over maximizing margin

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

SVM: To create your SVM classifier, type

A

from sklearn.svm import SVC

clf = SVC(kernel=linear)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

SVM: SVM might work poorly if

A

There are more features than samples

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

SVM: Can an SVM create a non linear decision boundary?

A

Yes, using the kernel trick

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

SVM: SVM stands for

A

Support vector machine

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Python: To run a script from within a script but pull in variables from one script to the next, the best way is to

A

Use import

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

datetime: To switch the position of month and day by converting a date string and then replace slashes with dashes, type

A

from datetime import datetime
date = datetime.strptime(“05/01/15”, “%m/%d/%y”)
date.strftime(“%d-%m-%y”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

SVM: For a linear kernel, a gamma of 1.0 will produce a

A

Straight decision boundary

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

SVM: Some important parameters for SVM are

A

Kernel
Gamma
C

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

SVM: Some common kernels are

A

rbf and linear

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

SVM: the c parameter controls

A

The degree to which the decision boundary will curve to contain all of the training data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

SVM: increasing the value of the c parameter will

A

Increase the number of data points correctly in their decision boundary but may over compensate to the point that it is no longer predictive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

ML: For a very large data set with a lot of noise and without a clear decision boundary, the better model than SVM is

A

Naive bayes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Pandas: the Series.replace({“value”:”value2”}) replaces

A

every occurance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

ML: Accuracy improves proportional to

A

Traning data size with diminishing returns beginning after 700 samples.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

ML: To see if your models prediction accuracy would benefit from more samples you could

A

Slice you training data into 4 sections and test the accuracy of each cumulative sum. 200, 400, 600, 800. The improvement in accuracy should start suffering diminishing returns at higher sample sizes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Python: One way to schedule a script is to

A
while True:
    if int(datetime.now().strftime("%M")) == 30:
        print("30 Mintues!")
        time.sleep(61)
    time.sleep(2)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

ML: Usually, to increase accuracy it is better to have

A

More data rather than a more finely tuned algorithm.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

datetime: To return the current date, type

A

import datetime

datetime.datetime.now().strftime(“%d/%m/%y”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

ML: A high information gain feature is one that is

A

very common in one classification and not in others

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Pandas: To change all of a newly uploaded df’s dates to proper format and also turn numbers into numeric types, type

A

df = df.convert_objects(convert_numeric=True)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Python: To stop a for loop, use the command

A

break

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Python: To test if a number is even, type

A

number % 2 == 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

ML: The main types of feature data are

A

Numerical, Categorical, Time series, Text

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

ML: Continuous supervised learning means that

A

The output is not binary or categorical but a part of a range

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

ML: Discrete supervised learning means that

A

The output is binary or categorical but not a part of a range

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Math: The intercept is

A

Value of the y axis when the x axis is 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Math: The formula for a regression is

A

y = m*x+b

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Math: Slope can be equated by

A

change in y divided by change in x

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Math: A coefficient is

A

The number that multiplies a variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Math: With regards to regressions, the coefficient is the

A

number before the x axis that determines the slope

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Math: A scalar value is a

A

position on a scale. A quantity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Math: The regression line is the

A

line that minimizes the sum of the squared distances between the line and the data points.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

LinearRegression: When you ask a fitted LinearRegression model for a prediction, you must pass it a

A

List, even if it’s a list with one value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

LinearRegression: To return the coefficient and the intercept of a fitted LinearRegression model, type

A

model. coef_

model. intercept_

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Python: To switch the keys and values of a dictionary to be the values and keys using a dict comprehension, type

A

{value: key for key, value in my_dict.items()}

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

LinearRegression: To return the r-squared of a LinearRegression, type

A

model.score(x_test, y_test)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

sklearn: Predictions are returned in an array,

so you should type

A

model.predict([27])[0]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

Numpy: To return the number from array([[25]]), type

A

array([[25]])[0][0]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

LinearRegression: For LinearRegression, “Error” means

A

The actual y minus the predicted y.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

LinearRegression: The algorithm used by LinearRegression is called

A

ordinary least squares

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

LinearRegression: Sum of squared error is not a good evaluation metric for a regression because

A

Is increases when more data points are added, despite the fit being the same.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

LinearRegression: The evaluation metric most used to describe the goodness of fit of a regression is called

A

r-squared

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

LinearRegression: The r-squared explains

A

How much of the change in y is explained by the change in x

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

Pandas: df.values returns a

A

numpy 2D array

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

Pandas: To have a df return a 2D numpy array, type

A

df.values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

sklearn: To scale a pandas df to between 0 and 1, type

A
from sklearn.preprocessing import MinMaxScaler
import numpy
x = df.values
scaler = MinMaxScaler()
x_scaled = scaler.fit_transform(x)
df2 = pandas.DataFrame(x_scaled)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

SQL: After importing a new sql script, make sure to

A

right click and refresh on the left side in the schemas panel.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

SQL: To return two specific columns from a db, type

A

SELECT table.column, table.column2 FROM table;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

SQL: Before querying a db, make sure that the

A

Correct database is selected and bolded in the panel.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

SQL: To return data that meets a criteria, type

A

SELECT * FROM table WHERE column = 2000;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

SQL: In SQL an equivalence test only uses

A

one equals sign.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

SQL: To return data that doesn’t meet a certain criteria, type

A

SELECT * FROM table WHERE column != 1000;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

SQL: The test operators you can use in SQL are

A

=, !=, greater than, less than, greater or equal, lesser or equal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

SQL: To query for two criteria, type

A

SELECT * FROM table WHERE column = 1000 AND column2 = “Value”;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

SQL: To query for two optional criteria, type

A

SELECT * FROM table WHERE column = 1000 OR column2 = “String”;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
60
Q

SQL: To return rows for a range between two values, type

A

SELECT * FROM table WHERE column BETWEEN 1000 AND 2000;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
61
Q

SQL: The wildcard symbol is

A

%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
62
Q

SQL: The LIKE clause is

A

not case sensitive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
63
Q

SQL: To return only rows that have a partial string in them, type

A

SELECT * FROM table WHERE column LIKE “%string%”;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
64
Q

SQL: To sort columns before returning them, type

A

SELECT * FROM table ORDER BY column ASC;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
65
Q

SQL: To sort by two columns before returning them, type

A

SELECT * FROM table ORDER BY column DESC, column2 ASC;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
66
Q

SQL: Databases start counting at row

A

Zero

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
67
Q

SQL: To return only the first 10 rows of a table, type

A

SELECT * FROM table LIMIT 10;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
68
Q

SQL: To return only 10 rows of a table starting at the 20th row, type

A

SELECT * FROM table LIMIT 10 OFFSET 21;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
69
Q

SQL: Sometimes clients implicitly

A

Add a limit of 1000 rows to your queries. This would not occur using a programming language.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
70
Q

SQL: A shorter way to return only the first 10 rows of a table starting at the 20th row is

A

SELECT * FROM table LIMIT 21, 10;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
71
Q

SQL: To filter for rows with the value NULL, type

A

SELECT * FROM table WHERE column IS NULL;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
72
Q

SQL: To sort a column you are returning and also filter out any NULL values, type

A

SELECT * FROM table WHERE column IS NOT NULL ORDER BY column ASC;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
73
Q

ML: The output type of a supervised classification algorithm is

A

Binary or categorical (Discreet)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
74
Q

ML: The output type of a regression algorithm is

A

A number that is part of a range (Continuous)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
75
Q

ML: A regression with multiple input variables is called a

A

Multi-variate regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
76
Q

LinearRegression: To remove outliers, follow the procedure

A

Train, Remove ~10% data points with highest error, Re-train remaining points

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
77
Q

ML: The most popular clustering algorithm is called

A

k-means

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
78
Q

k-means: k-means first assigns the centroids

A

randomly

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
79
Q

k-means: k-means requires you to tell it the

A

number of centroids

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
80
Q

k-means: k-means places the centroid by

A

trying to reduce the total distance from half the points of a until it is in the middle

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
81
Q

k-means: Make sure to adjust the default

A

n_clusters parameter

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
82
Q

k-means: To control the number of iterations of assigning points to a centroid and then moving it to the middle, change the

A

max_iter parameter

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
83
Q

k-means: To control how many times k-means initializes the algorithm to avoid bad clustering due to the randomness of the initial plot, change the

A

n_init parameter

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
84
Q

sklearn: The algorithms that require rescaling are

A

K-means and SVM(kernel=”rbf”). For the rest don’t bother.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
85
Q

sklearn: The algorithm that counts the frequency of words is called

A

CountVectorizer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
86
Q

sklearn: bag of words will generally count plurals

A

as the root word

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
87
Q

Bag of words: There is a bias for long texts because

A

they have the opportunity to have a higher frequency of words.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
88
Q

Bag of words: To process text into a bag of words, type

A

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
text_list = [text_var, text_var2, text_var3]
bag_of_words = vectorizer.fit(text_list)
bag_of_words = vectorizer.transform(text_list)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
89
Q

Bag of words: The CountVectorizer.transform method

A

counts the occurrences of the words.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
90
Q

Bag of words: To return the number of occurrences of a specific word, type

A

vectorizer.vocabulary_.get(“word”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
91
Q

Bag of words: To reduce noise in a bag of words, you should

A

remove stopwords

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
92
Q

Bag of words: A stopword is

A

a high frequency, low information word e.g. “the”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
93
Q

nltk: To get a list of stopwords, type

A

import nltk
nltk.download() #Download all the files
from nltk.corpus import stopwords
sw = stopwords.words(“english”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
94
Q

Bag of words: To bundle words that have the same root, type

A

from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer(“english”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
95
Q

Bag of words: Before applying CountVectorizer of TfidfVectorizer, you should

A

use a stemmer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
96
Q

Bag of words: The TFIDF representation

A

weighs rare words that happen less frequently in the corpus as a whole more highly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
97
Q

Pandas: When you use a list to return a column, df[[“Column”]], it returns a

A

DataFrame

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
98
Q

Pandas: To delete a column, type

A

del df[“Column”]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
99
Q

Pandas: To return a boolean index based on values containing a partial string, type

A

df[“Column name”].str.contains(“string”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
100
Q

sklearn: Everything that goes to the estimators must be

A

a float

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
101
Q

Pandas: To efficiently remove the labels column from the feature columns, type

A

y = df.pop(“Column name”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
102
Q

Pandas: To automatically create dummy variables for any string in a df, type

A

x = pandas.get_dummies(df)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
103
Q

Pandas: To filter for column names based on partial string, type

A

df.filter(regex=”Column String”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
104
Q

Pandas: Sometimes to replace a numpy boolean column with an integers, you need to

A

turn it into a string first.

df[“Boolean Column”].astype(str).replace({“True”:1})

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
105
Q

sklearn: To select only powerful features, use the

A

SelectPercentile method

106
Q

sklearn: Tf-Idf stands for

A

term frequency–inverse document frequency

107
Q

Bag of words: When using bag of words it is recommended to use the vectorizer

A

from sklearn.feature_extraction.text import TfidfVectorizer

TfidfVectorizer(stop_words=”english”)

108
Q

sklearn: The easiest model to get a good result from without much tuning is

A

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()

109
Q

sklearn: To make the TfidfVectorizer ignore all words that appear in 50% of all corpus samples, type

A

TfidfVectorizer(stop_words=”english”, max_df=0.5)

110
Q

sklearn: High bias means

A

the model doesnt fit the training data well

111
Q

sklearn: High variance means

A

the model fits the training data too well and poorly predicts the test data due to overfitting

112
Q

sklearn: automatically finding the optimal number of features is called

A

regularization

113
Q

sklearn: the tradeoff between number of features and accuracy is calculated using the

A

lasso regression

114
Q

sklearn: All supervised learning algorithms require both

A

samples by features and labels

115
Q

Pandas: To see if there is a difference between the training and testing df’s columns, type

A

df_train.columns.difference(df_test.columns)

116
Q

sklearn: To binarize categorical data from a df, type

A

from sklearn.feature_extraction import DictVectorizer
vectorizer = DictVectorizer(sparse = False)
x_train = vectorizer.fit_transform(df_train.T.to_dict().values())
x_test = vectorizer.transform(df_test.T.to_dict().values())

117
Q

sklearn: To turn a new test data sample (that is in df form) into the binarized format after it has been fitted, type

A

vectorizer.transform(df_test.T.to_dict().values())

118
Q

DictVectorizer: DictVectorizer does not change

A

Numerical values. Only categorical.

119
Q

DictVectorizer: You should use the DictVectorizer

A

after scaling the numerical data, since it will not alter numbers and will fix the categorical data into binary.

120
Q

sklearn: To scale certain df columns to between 0 and 1, type

A

from sklearn.preprocessing import MinMaxScaler

dfTest[[‘A’,’B’]] = dfTest[[‘A’,’B’]].apply(lambda x: MinMaxScaler().fit_transform(x))

121
Q

DictVectorizer: To return a list of all the columns of a fitted DictVectorizer, type

A

vectorizer.get_feature_names()

122
Q

Python: When writing files to computer, to avoid overwriting it

A

Add the current datetime to the title

123
Q

PCA: PCA obtains it’s new coordinate system from only

A

Translation and rotation

124
Q

sklearn: To automatically test several different parameters and choose the best based on cross-validation, use

A

from sklearn import grid_search

125
Q

SQL: SQL syntax is split into the two data types

A

DDL and DML

126
Q

SQL: DDL stands for

A

data definition language

127
Q

SQL: DDL deals with

A

schema, the structure of the tables

128
Q

SQL: DML stands for

A

data manipulation language

129
Q

SQL: DML deals with

A

creating, reading, updating, and deleting data.

130
Q

SQL: A database is a

A

container for groups of tables.

131
Q

SQL: to create a database, type

A

CREATE SCHEMA IF NOT EXISTS my_database_name DEFAULT CHARACTER SET utf8;

132
Q

SQL: To specify which database you want the following query to be run on, type

A

USE my_database_name;

133
Q

SQL: If a script trips an error it will

A

stop running, while a warning will allow it to continue

134
Q

sklearn: After model.predict() returns a number label, to get the name it was previously, I can

A

get the dict that replaced the values with numbers
use dict comprehension to reverse the keys with values
my_new_dict.get the key from the reversed dict.

135
Q

PCA: PCA finds the

A

Centre of the data and the principal axis of variation

136
Q

Pandas: To print all of the unique values and their frequencies for every column, type

A

for column in df:

print(df[column].value_counts())

137
Q

Pandas: To filter to see all of the rows that contain a numpy.nan, type

A

mask = df.isnull().any(axis=1)

df[mask]

138
Q

Pandas: To use .apply on multiple columns at once, type

A

df[[“Column”,”Column 2”]] = df[[“Column”,”Column 2”]].apply(lambda x: x*5)

139
Q

MinMaxScale: To undo scaling use

A

inverse_transform

140
Q

cross_validation: If using cross validation, check accuracy score with

A

model.score(x_test, y_test)

141
Q

ML: The labels column for supervised classification can be

A

Text format as well as number

142
Q

ML: The sample you use for the prediction (if you are using DataFrameMapper) must be

A

a df with column labels, so that the DictVectorizer can know which columns each value belongs to

143
Q

ML: When feeding a sample to model.predict(), it is unnecessary to

A

Have all of the columns data present, because the DictVectorizer will transform the the format of the columns you feed it to the same format of the fitted training set.

144
Q

Pandas: When creating a DataFrame from a dict, remember to

A

put the values in a list.

145
Q

sklearn: Any columns that are not present when you enter the prediction will be

A

assumed to have a value of zero.

146
Q

sklearn: Before binarizing the training samples with DictVectorizer, remember to

A

fill any nan cells with something, otherwise the resulting array won’t work

147
Q

Pandas: When using .repace({}), remember to

A

either reassign the variable or add the parameter inplace=True

148
Q

Python: To find the nth occurrence of a character in a string, type

A

my_string.replace(“”, “”, 2).find(“”)

149
Q

sklearn: It is appropriate to use a regression model when

A

the output is continuous

150
Q

PCA: With regards to PCA, variance means

A

how spread out the data is.

151
Q

PCA: What PCA does is it

A
  • compresses data points onto the lines of maximal variance, to minimize information loss, and uses them as a principal components, which are ranked by maximal variance
152
Q

ML: To see the two underlying factors driving the data you can use

A

PCA

153
Q

PCA: PCA when applied to faces it also called

A

eigenfaces

154
Q

PCA: The format for a PCA model is

A

from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(x_train)

155
Q

PCA: The eigen values show

A

how much of the variation each principal component has.

156
Q

PCA: To return the eigen values, type

A

pca.explained_variance_ratio_

157
Q

PCA: To return the principal components, the number of which you specified for the model, type

A

pca.components_

158
Q

PCA: PCA can be used for

A

facial recognition, by using it as pre-processing for an SVN, as it finds the trends in the images.

159
Q

PCA: The format for using PCA as preprocessing for images to be used in an SVM, type

A

pca = RandomizedPCA(n_components=150, whiten=True)
pca.fit(x_train)
x_train_pca = pca.transform(x_train)
x_test_pca = pca.transform(x_test)

160
Q

ML: To feed a range of model parameters to model and have it self optimize using cross validation, use

A

GridSearchCV

161
Q

GridSearchCV: The format for using GridSearchCV is

A

from sklearn.grid_search import GridSearchCV
param_grid = {“C”:[0.1,0.5,1.0],”Gamma”:[“5,8,9”], “kernel”:(‘linear’, ‘rbf’)}
model = GridSearchCV(SVC(), param_grid)
model.fit(x_train, y_train)
model.best_estimator_

162
Q

GridSearchCV: model.grid_scores_ returns

A

a triple with the parameter combination, it’s average score, and it’s separated scores by fold.

163
Q

GridSearchCV: GridSearchCV automatically

A

refits to the best parameters

164
Q

Pandas: To rename all the columns by adding a string to the end of the original name, type

A

df.columns = df.columns + “my_string”

165
Q

datetime: The symbol for 2015 is

A

%Y

166
Q

datetime: The symbol for zero padded day of month 01 is

A

%d

167
Q

datetime: The symbol for zero padded month is

A

%m

168
Q

datetime: The symbol for locale’s month text abreviation is

A

%b

169
Q

Pandas: To replace a string portion the values of a series, type

A

df[“Column”] = df[“Column”].str.replace(“before”,”after”)

170
Q

GridSearchCV: To return a list of all of the combinations of parameters GridSearchCV tried, type

A

for item in model.grid_scores_:

print(item)

171
Q

GridSearchCV: When passing the grid_params specifically the {“kernel”:.., the testing values can be

A

either in a list or in parentheses, but the rest must be in a list.

172
Q

GridSearchCV: In model.grid_scores_, the “mean” values stands for

A

average validation score

173
Q

GridSearchCV: GridSearchCV performs its cross validation on

A

folds within training set.

174
Q

SQL: To create a table with a text column that cannot insert null values, type

A

CREATE TABLE IF NOT EXISTS tablename (columnname TEXT NOT NULL);

175
Q

SQL: The default engine for mysql now is

A

InnoDB

176
Q

SQL: To specify which engine you want to use, type

A

CREATE TABLE IF NOT EXISTS table (columnname INTEGER) ENGINE InnoDB;

177
Q

SQL: To a new row into a table but list the values in the wrong order, type

A

INSERT INTO tablename (column_2, column_1) VALUES (“String”, 100);

178
Q

SQL: To insert multiple rows into a table at once, type

A

INSERT INTO tablename (column, column_2) VALUES (“String”, 100), (NULL, 2000);

179
Q

sklearn: You should only fit the preprocessing/feature extraction and fit the model to the

A

x_train. x_test should only receive a transform in preprocessing/feature extraction and predict/validate from the model.

180
Q

sklearn_pandas: To feed all of columns to sklearn_pandas, type

A

mapper = DataFrameMapper([
(“column1”, sklearn.preprocessing.LabelBinarizer()),
(“column2”, sklearn.preprocessing.MinMaxScaler()),
(“column3”, sklearn.feature_extraction.text.TfidfVectorizer()),
(“column4”, none)
])

mapper.fit_transform(df.copy())

181
Q

To import sklearn_pandas, type

A

from sklearn_pandas import DataFrameMapper

from sklearn_pandas import cross_val_score

182
Q

sklearn_pandas: mapper.fit_transform(df.copy()) only transforms

A

The columns that you passed into the DataFrameMapper

183
Q

sklearn_pandas: If sklearn_pandas is asked to transform a sample with features that were not in the original fit, it will

A

ignore them, since there is no column to put them in.

184
Q

sklearn_pandas: When transforming a new df sample to be used for a prediction

A

all columns must be present, but the values can be “” so it will be ignored by the mapper.

185
Q

sklearn_pandas: The columns are ordered according to

A

the order given to the mapper.

186
Q

sklearn_pandas: For transformations that require multiple columns, type

A

mapper = DataFrameMapper([
([“column1”, “column2”], sklearn.decomposition.PCA(1))
])

187
Q

Pandas: When switching a column from int to float use

A

df[“Numeric”] = df[“Numeric”].apply(lambda x: x.astype(float)))
instead of
df[“Numeric”] = df[“Numeric”].astype(float)

188
Q

sklearn: Do not fit the preprocessing or model on

A

the testing data, because new samples will not have the luxury of being fitted.

189
Q

sklearn: Since I am not supposed to fit preprocessing to the test data, the SelectPercentile and DataFrameMapper must happen

A

after the train_test_split, on just the training data

190
Q

sklearn_pandas: Remember to end each line with a

A

comma

191
Q

sklearn_pandas: When creating a pipeline with the mapper, make sure

A

the mapper is first so it transforms the columns into a format that is compatible with the following transformers

192
Q

sklearn: To return the accuracy score of a model or pipeline, type

A

model.score(x_test, y_test)

193
Q

sklearn: To create a pipeline, type

A

from sklearn.pipeline import Pipeline

pipeline = Pipeline([
  ('column transformations', mapper),
  ('percenterizer', SelectPercentile()),
  ('classifier', MultinomialNB())
])
194
Q

K-Fold CV is

A

A way to cross validate while using all your data rather than losing some data to the testing set. It involves creating multiple bins of samples, one of which is kept out of training to be tested against later, but then it alternates to allow the testing bin to be used for training while another bin becomes the testing one. It then averages the results of the tests on each bin.

195
Q

sklearn: to import K-Fold CV, type

A

from sklearn.cross_validation import KFold

196
Q

K-Fold CV requires the two arguments,

A

number of samples, and how many folds

197
Q

Python: __init__ is a

A

method that runs right when a class is instantiated

198
Q

Python: Instead of putting all of the parameters and defaults in def __init__(my_attribute=1, my_attribute2=2): you can just type

A
class Myclass:
    def \_\_init\_\_(self, **args):
        self.my_attribute = args.get("Name", "Bill")
        self.my_attribute2 = args.get("Age", 20)
199
Q

Python: Every method in a class must at least take the

A

self argument. def my_method(self):

200
Q

Python: The (self) argument represents

A

the data from the instance you are calling the method on.

201
Q

Python: Using self. in a class method allows you to

A

access the attributes of the instance you are calling the method on.

202
Q

Python: To create and call a basic class method, type

A
(in a file called my_class.py)
class Myclass:  
    def my_method(self):
        return "hi"

(console)
from my_class import Myclass
my_instance = Myclass()
my_instance.my_method()

203
Q

Python: To return an attribute of the current instance by calling a method, type

A
class Myclass: 
    my_attribute = 10 
    def my_method(self):
        return self.my_attribute
from my_class import Myclass:
my_instance = Myclass()
my_instance.my_method()
204
Q

K-Fold CV: Does not

A

shuffle the samples.

205
Q

Python: To create a class that takes two arguments upon instantiation, and has a method, type

A
class Myclass:
    def \_\_init\_\_(self, arg_1="default1", arg_2="default2"):
        self.arg_1 = arg_1
        self.arg_2 = arg_2
    def my_method(self):
        return self.arg_one*2
206
Q

Python: To return a random integer between 1 and 10, type

A

import random

random.randint(1,10)

207
Q

Selenium: To enlarge the browser window, type

A

my_browser.maximize_window()

208
Q

Python: a class is

A

an object with attributes and methods

209
Q

Python: When a class uses __init__ the arguments must be passed

A

at instantiation.

210
Q

Pandas: To split a df’s rows into training and testing sets, type

A
mask = numpy.random.rand(len(df)) - 0.8
x_train = df[mask]
x_test = df[numpy.invert(mask)]
211
Q

Pandas: To remove the labels column from a df and place it into the y variable, type

A

y = df.pop(“Labels”)

212
Q

sklearn: To return the classes of a LabelBinarizer(), type

A

lb.classes_

213
Q

Pandas: To check what columns two df’s do not share in common, type

A

df.columns.difference(df2.columns)

214
Q

Pandas: The first few lines I should write when checking out a new DataFrame are

A

df.head()
df.tail()
df.info()
df.describe()
for item in df.columns:
print(df[item].value_counts())
df[df.isnull().any(axis=1)].sort_index(ascending=True, by=”Column name”)
df.corr()
df.columns

215
Q

Python: To return a random list item, type

A

import random

random.choice(my_list)

216
Q

Python: To capture any parameters that are passed into a class when being instantiated, but were not defined inside the class beforehand, type

A
class Myclass:
    def \_\_init\_\_(self, **args)
        self.my_attribute = args.get("arg", "default")
        for key, value in args.items():
            setattr(self, key, value)
217
Q

Python: To create a subclass that extends and takes the attributes from a parent class, type

A

class Mysubclass(Myparentclass):

218
Q

Python: To create a subclass that extends a parent class but has a new attributes and overrides one of the parents attributes, type

A
class Mysubclass(Myparentclass):
    new_attribute = 10
    overwritten_attribute  = "new value"
219
Q

Pandas: To upload a csv with no header and set it upon upload, type

A

df = pandas.read_csv(“/path/file.csv”, header=None, names=[“Column_name”])

220
Q

Pandas: To upload a csv with no header and set it upon upload, type

A

df = pandas.read_csv(“/users/student/desktop/neg_list.csv”, header=None, names=[“URL”])

221
Q

Python: To use an attribute within a class you must

A

start with self.my_attribute

222
Q

Python: When passing values to a custom transformer,

A
do it in the instantiation or in the pipeline, not in the methods. Then in the class type:
class Mytransformer(TransformerMixin):
    def \_\_init\_\_(self, **args)
        for key, value in args.items():
            setattr(self, key, value)

in the transformer type:
my_instance = Mytransformer(param=”value”)

223
Q

Python: When creating a Pipeline of DataFrameMapper

A

end lines with a comma

224
Q

Python: To create an alarm that raises an exception after a certain number of seconds, type

A

import signal

def signal_handler(signum, frame):
    raise Exception("Timed out!")
signal.signal(signal.SIGALRM, signal_handler)

try:
signal.alarm(5)
except:
continue

signal.alarm(0)

225
Q

Pandas: To use the map function and a dictionary to change categorical values to 0 and 1, type

A

df[‘column’] = df[‘column’].map({‘category1’: 0, ‘Category2’:1})

226
Q

numpy: To concatenate two 2d arrays across the second axis, type

A

numpy.c_[my_numpy_array, my_numpy_array2]

227
Q

ml: The general type of algorithm that counts the frequency of unique words is called

A

bag of words

228
Q

python: if “key_name” does not exist, executing my_dict.get(“key_name”) will

A

return none

229
Q

python: To make an anonymous function, type

A

my_func = lambda x: x*2
my_func(2)
=> 4

230
Q

python: This reduce function…

reduce(lambda x,y: x+y, [7,1,2,5])

A

uses the first two item in the iterable as the functions arguments and then uses the return of them as one value which is then used as the first argument of same function and makes the second argument the next item in the iterable.

231
Q

ml: A hidden neuron is

A

a neuron that is not the input and not the output. Usually an inner neuron that recieves output from another neuron.

232
Q

ml: Recurrent neural nets are

A

models of artificial neural networks in which feedback loops are possible by having neurons which fire for some limited duration of time, before becoming quiescent.

233
Q

ml: feedforward neural networks are

A

neural networks where the output from one layer is used as input to the next layer

234
Q

ml: neural networks where the output from one layer is used as input to the next layer are called

A

feedforward neural networks

235
Q

ml: Sigmoid neurons are similar to perceptrons, but

A

modified so that small changes in their weights and bias cause only a small change in their output

236
Q

ml: A perceptron is

A

an artificial neuron that takes a number of inputs and returns an output. The output depends on if the weighted sum of each input passes a set threshold.

237
Q

ml: The most common modern artificial neuron used in deep neural networks is

A

sigmoid neuron

238
Q

Pandas: The symbology that signifies a tab delimiter is

A

\t

239
Q

Pandas: To import a csv into a variable, type

A

data_frame = pandas.read_csv(“pathfromcurrent/directory.csv”)

Note: Do not start the path with a slash.

240
Q

Pandas: To return the first n rows of a DataFrame, type

A

df.head(4)

241
Q

Pandas: To check the type of data something is, type

A

type(var_name)

242
Q

Pandas: To print the key and value for every item in a groupby using a for loop, type

A

for key, value in groupby_var_name:
print(key)
print(value)

243
Q

Pandas: A groupby is a

A

Dictionary like structure wherein…

244
Q

Pandas: To create an empty dataframe in a variable, type

A

my_dataframe = pandas.DataFrame()

245
Q

Pandas: To have a data frame print to a csv, type

A

data_frame_var.to_csv(“myfolder/name.csv”, header=0)

246
Q

IPYNB: To turn a notebook into an executable python script, type

A

ipython nbconvert –to python ~/sup.ipynb into the console

247
Q

Pandas: A pandas Series exhibits both list and dictionary behavior because

A

It’s data can be accessed through index and key.

248
Q

Pandas: To create a pandas Series, type

A

series_var = pandas.Series(my_list_or_dict)

249
Q

Pandas: To return the values of a Series, type

A

my_series.values

250
Q

Pandas: To return the indexes of a Series, type

A

my_series.index

251
Q

Pandas: To use a list variable as the values of a Series and another list variable as the index of a Series, type

A

my_series = pandas.Series(my_values_list, index=list_idx)

252
Q

Pandas: Remember when you use the drop method to

A

Slice for only what you want to drop, not keep

253
Q

ipython: To make a script run before every opening of a notebook

A

place it in /users/username/.ipython/profile_default/start_up

254
Q

javascript: Using session cookies gives your browser

A

ambient authority. Every request from your browser to the bank’s server automatically carries your cookie. Even when you are visiting another site.

255
Q

python: To turn a string from python into an html file, type

A

open(“filename.html”,”w”)
html_file.write(“string”)
html_file.close()

256
Q

xml: To parse an XML string that came from a request, type

A

import xml.etree.ElementTree as ET

root = ET.fromstring(r.text)

then parse it like an nd_array

nd_array = []
for calls in root:
    for call in calls:
        array = [attr.text for attr in call]
        nd_array.append(array)
257
Q

xml: Each xml object has the attributes

A

.text
.tag
.attrib

258
Q

python: To create a virtual environment with python2, type

A

virtualenv -p /usr/bin/python2.7 path/to/env_folder/

to activate:
source path/to/env_folder/bin/activate

to deactivate:
deactivate

259
Q

python: To accept arguments into a script from the command line as a list, type

A

import sys

sys. argv
note: To, use sys.argv[1]

260
Q

python: To find all the matches of a filename in a directory recursively, type

A

import os
import fnmatch
matches = []
for root, dirnames, filenames in os.walk(“/users/student/Downloads/www.dmv.org”):
for filename in fnmatch.filter(filenames, ‘*.css’):
matches.append(os.path.join(root, filename))