4 Flashcards

1
Q

SVM: SVM maximizes the

A

Margin from both clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

SVM: the margin is

A

The space between the closest data point and the line

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

SVM: SVM is an

A

Algorithm that separates data with a line

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

SVM: SVM prioritizes

A

Correct classification over maximizing margin

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

SVM: To create your SVM classifier, type

A

from sklearn.svm import SVC

clf = SVC(kernel=linear)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

SVM: SVM might work poorly if

A

There are more features than samples

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

SVM: Can an SVM create a non linear decision boundary?

A

Yes, using the kernel trick

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

SVM: SVM stands for

A

Support vector machine

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Python: To run a script from within a script but pull in variables from one script to the next, the best way is to

A

Use import

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

datetime: To switch the position of month and day by converting a date string and then replace slashes with dashes, type

A

from datetime import datetime
date = datetime.strptime(“05/01/15”, “%m/%d/%y”)
date.strftime(“%d-%m-%y”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

SVM: For a linear kernel, a gamma of 1.0 will produce a

A

Straight decision boundary

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

SVM: Some important parameters for SVM are

A

Kernel
Gamma
C

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

SVM: Some common kernels are

A

rbf and linear

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

SVM: the c parameter controls

A

The degree to which the decision boundary will curve to contain all of the training data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

SVM: increasing the value of the c parameter will

A

Increase the number of data points correctly in their decision boundary but may over compensate to the point that it is no longer predictive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

ML: For a very large data set with a lot of noise and without a clear decision boundary, the better model than SVM is

A

Naive bayes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Pandas: the Series.replace({“value”:”value2”}) replaces

A

every occurance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

ML: Accuracy improves proportional to

A

Traning data size with diminishing returns beginning after 700 samples.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

ML: To see if your models prediction accuracy would benefit from more samples you could

A

Slice you training data into 4 sections and test the accuracy of each cumulative sum. 200, 400, 600, 800. The improvement in accuracy should start suffering diminishing returns at higher sample sizes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Python: One way to schedule a script is to

A
while True:
    if int(datetime.now().strftime("%M")) == 30:
        print("30 Mintues!")
        time.sleep(61)
    time.sleep(2)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

ML: Usually, to increase accuracy it is better to have

A

More data rather than a more finely tuned algorithm.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

datetime: To return the current date, type

A

import datetime

datetime.datetime.now().strftime(“%d/%m/%y”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

ML: A high information gain feature is one that is

A

very common in one classification and not in others

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Pandas: To change all of a newly uploaded df’s dates to proper format and also turn numbers into numeric types, type

A

df = df.convert_objects(convert_numeric=True)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Python: To stop a for loop, use the command
break
26
Python: To test if a number is even, type
number % 2 == 0
27
ML: The main types of feature data are
Numerical, Categorical, Time series, Text
28
ML: Continuous supervised learning means that
The output is not binary or categorical but a part of a range
29
ML: Discrete supervised learning means that
The output is binary or categorical but not a part of a range
30
Math: The intercept is
Value of the y axis when the x axis is 0
31
Math: The formula for a regression is
y = m*x+b
32
Math: Slope can be equated by
change in y divided by change in x
33
Math: A coefficient is
The number that multiplies a variable.
34
Math: With regards to regressions, the coefficient is the
number before the x axis that determines the slope
35
Math: A scalar value is a
position on a scale. A quantity.
36
Math: The regression line is the
line that minimizes the sum of the squared distances between the line and the data points.
37
LinearRegression: When you ask a fitted LinearRegression model for a prediction, you must pass it a
List, even if it's a list with one value
38
LinearRegression: To return the coefficient and the intercept of a fitted LinearRegression model, type
model. coef_ | model. intercept_
39
Python: To switch the keys and values of a dictionary to be the values and keys using a dict comprehension, type
{value: key for key, value in my_dict.items()}
40
LinearRegression: To return the r-squared of a LinearRegression, type
model.score(x_test, y_test)
41
sklearn: Predictions are returned in an array, | so you should type
model.predict([27])[0]
42
Numpy: To return the number from array([[25]]), type
array([[25]])[0][0]
43
LinearRegression: For LinearRegression, "Error" means
The actual y minus the predicted y.
44
LinearRegression: The algorithm used by LinearRegression is called
ordinary least squares
45
LinearRegression: Sum of squared error is not a good evaluation metric for a regression because
Is increases when more data points are added, despite the fit being the same.
46
LinearRegression: The evaluation metric most used to describe the goodness of fit of a regression is called
r-squared
47
LinearRegression: The r-squared explains
How much of the change in y is explained by the change in x
48
Pandas: df.values returns a
numpy 2D array
49
Pandas: To have a df return a 2D numpy array, type
df.values
50
sklearn: To scale a pandas df to between 0 and 1, type
``` from sklearn.preprocessing import MinMaxScaler import numpy x = df.values scaler = MinMaxScaler() x_scaled = scaler.fit_transform(x) df2 = pandas.DataFrame(x_scaled) ```
51
SQL: After importing a new sql script, make sure to
right click and refresh on the left side in the schemas panel.
52
SQL: To return two specific columns from a db, type
SELECT table.column, table.column2 FROM table;
53
SQL: Before querying a db, make sure that the
Correct database is selected and bolded in the panel.
54
SQL: To return data that meets a criteria, type
SELECT * FROM table WHERE column = 2000;
55
SQL: In SQL an equivalence test only uses
one equals sign.
56
SQL: To return data that doesn't meet a certain criteria, type
SELECT * FROM table WHERE column != 1000;
57
SQL: The test operators you can use in SQL are
=, !=, greater than, less than, greater or equal, lesser or equal
58
SQL: To query for two criteria, type
SELECT * FROM table WHERE column = 1000 AND column2 = "Value";
59
SQL: To query for two optional criteria, type
SELECT * FROM table WHERE column = 1000 OR column2 = "String";
60
SQL: To return rows for a range between two values, type
SELECT * FROM table WHERE column BETWEEN 1000 AND 2000;
61
SQL: The wildcard symbol is
%
62
SQL: The LIKE clause is
not case sensitive
63
SQL: To return only rows that have a partial string in them, type
SELECT * FROM table WHERE column LIKE "%string%";
64
SQL: To sort columns before returning them, type
SELECT * FROM table ORDER BY column ASC;
65
SQL: To sort by two columns before returning them, type
SELECT * FROM table ORDER BY column DESC, column2 ASC;
66
SQL: Databases start counting at row
Zero
67
SQL: To return only the first 10 rows of a table, type
SELECT * FROM table LIMIT 10;
68
SQL: To return only 10 rows of a table starting at the 20th row, type
SELECT * FROM table LIMIT 10 OFFSET 21;
69
SQL: Sometimes clients implicitly
Add a limit of 1000 rows to your queries. This would not occur using a programming language.
70
SQL: A shorter way to return only the first 10 rows of a table starting at the 20th row is
SELECT * FROM table LIMIT 21, 10;
71
SQL: To filter for rows with the value NULL, type
SELECT * FROM table WHERE column IS NULL;
72
SQL: To sort a column you are returning and also filter out any NULL values, type
SELECT * FROM table WHERE column IS NOT NULL ORDER BY column ASC;
73
ML: The output type of a supervised classification algorithm is
Binary or categorical (Discreet)
74
ML: The output type of a regression algorithm is
A number that is part of a range (Continuous)
75
ML: A regression with multiple input variables is called a
Multi-variate regression
76
LinearRegression: To remove outliers, follow the procedure
Train, Remove ~10% data points with highest error, Re-train remaining points
77
ML: The most popular clustering algorithm is called
k-means
78
k-means: k-means first assigns the centroids
randomly
79
k-means: k-means requires you to tell it the
number of centroids
80
k-means: k-means places the centroid by
trying to reduce the total distance from half the points of a until it is in the middle
81
k-means: Make sure to adjust the default
n_clusters parameter
82
k-means: To control the number of iterations of assigning points to a centroid and then moving it to the middle, change the
max_iter parameter
83
k-means: To control how many times k-means initializes the algorithm to avoid bad clustering due to the randomness of the initial plot, change the
n_init parameter
84
sklearn: The algorithms that require rescaling are
K-means and SVM(kernel="rbf"). For the rest don't bother.
85
sklearn: The algorithm that counts the frequency of words is called
CountVectorizer
86
sklearn: bag of words will generally count plurals
as the root word
87
Bag of words: There is a bias for long texts because
they have the opportunity to have a higher frequency of words.
88
Bag of words: To process text into a bag of words, type
from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer() text_list = [text_var, text_var2, text_var3] bag_of_words = vectorizer.fit(text_list) bag_of_words = vectorizer.transform(text_list)
89
Bag of words: The CountVectorizer.transform method
counts the occurrences of the words.
90
Bag of words: To return the number of occurrences of a specific word, type
vectorizer.vocabulary_.get("word")
91
Bag of words: To reduce noise in a bag of words, you should
remove stopwords
92
Bag of words: A stopword is
a high frequency, low information word e.g. "the"
93
nltk: To get a list of stopwords, type
import nltk nltk.download() #Download all the files from nltk.corpus import stopwords sw = stopwords.words("english")
94
Bag of words: To bundle words that have the same root, type
from nltk.stem.snowball import SnowballStemmer | stemmer = SnowballStemmer("english")
95
Bag of words: Before applying CountVectorizer of TfidfVectorizer, you should
use a stemmer
96
Bag of words: The TFIDF representation
weighs rare words that happen less frequently in the corpus as a whole more highly.
97
Pandas: When you use a list to return a column, df[["Column"]], it returns a
DataFrame
98
Pandas: To delete a column, type
del df["Column"]
99
Pandas: To return a boolean index based on values containing a partial string, type
df["Column name"].str.contains("string")
100
sklearn: Everything that goes to the estimators must be
a float
101
Pandas: To efficiently remove the labels column from the feature columns, type
y = df.pop("Column name")
102
Pandas: To automatically create dummy variables for any string in a df, type
x = pandas.get_dummies(df)
103
Pandas: To filter for column names based on partial string, type
df.filter(regex="Column String")
104
Pandas: Sometimes to replace a numpy boolean column with an integers, you need to
turn it into a string first. | df["Boolean Column"].astype(str).replace({"True":1})
105
sklearn: To select only powerful features, use the
SelectPercentile method
106
sklearn: Tf-Idf stands for
term frequency–inverse document frequency
107
Bag of words: When using bag of words it is recommended to use the vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer | TfidfVectorizer(stop_words="english")
108
sklearn: The easiest model to get a good result from without much tuning is
from sklearn.ensemble import RandomForestClassifier | model = RandomForestClassifier()
109
sklearn: To make the TfidfVectorizer ignore all words that appear in 50% of all corpus samples, type
TfidfVectorizer(stop_words="english", max_df=0.5)
110
sklearn: High bias means
the model doesnt fit the training data well
111
sklearn: High variance means
the model fits the training data too well and poorly predicts the test data due to overfitting
112
sklearn: automatically finding the optimal number of features is called
regularization
113
sklearn: the tradeoff between number of features and accuracy is calculated using the
lasso regression
114
sklearn: All supervised learning algorithms require both
samples by features and labels
115
Pandas: To see if there is a difference between the training and testing df's columns, type
df_train.columns.difference(df_test.columns)
116
sklearn: To binarize categorical data from a df, type
from sklearn.feature_extraction import DictVectorizer vectorizer = DictVectorizer(sparse = False) x_train = vectorizer.fit_transform(df_train.T.to_dict().values()) x_test = vectorizer.transform(df_test.T.to_dict().values())
117
sklearn: To turn a new test data sample (that is in df form) into the binarized format after it has been fitted, type
vectorizer.transform(df_test.T.to_dict().values())
118
DictVectorizer: DictVectorizer does not change
Numerical values. Only categorical.
119
DictVectorizer: You should use the DictVectorizer
after scaling the numerical data, since it will not alter numbers and will fix the categorical data into binary.
120
sklearn: To scale certain df columns to between 0 and 1, type
from sklearn.preprocessing import MinMaxScaler | dfTest[['A','B']] = dfTest[['A','B']].apply(lambda x: MinMaxScaler().fit_transform(x))
121
DictVectorizer: To return a list of all the columns of a fitted DictVectorizer, type
vectorizer.get_feature_names()
122
Python: When writing files to computer, to avoid overwriting it
Add the current datetime to the title
123
PCA: PCA obtains it's new coordinate system from only
Translation and rotation
124
sklearn: To automatically test several different parameters and choose the best based on cross-validation, use
from sklearn import grid_search
125
SQL: SQL syntax is split into the two data types
DDL and DML
126
SQL: DDL stands for
data definition language
127
SQL: DDL deals with
schema, the structure of the tables
128
SQL: DML stands for
data manipulation language
129
SQL: DML deals with
creating, reading, updating, and deleting data.
130
SQL: A database is a
container for groups of tables.
131
SQL: to create a database, type
CREATE SCHEMA IF NOT EXISTS my_database_name DEFAULT CHARACTER SET utf8;
132
SQL: To specify which database you want the following query to be run on, type
USE my_database_name;
133
SQL: If a script trips an error it will
stop running, while a warning will allow it to continue
134
sklearn: After model.predict() returns a number label, to get the name it was previously, I can
get the dict that replaced the values with numbers use dict comprehension to reverse the keys with values my_new_dict.get the key from the reversed dict.
135
PCA: PCA finds the
Centre of the data and the principal axis of variation
136
Pandas: To print all of the unique values and their frequencies for every column, type
for column in df: | print(df[column].value_counts())
137
Pandas: To filter to see all of the rows that contain a numpy.nan, type
mask = df.isnull().any(axis=1) | df[mask]
138
Pandas: To use .apply on multiple columns at once, type
df[["Column","Column 2"]] = df[["Column","Column 2"]].apply(lambda x: x*5)
139
MinMaxScale: To undo scaling use
inverse_transform
140
cross_validation: If using cross validation, check accuracy score with
model.score(x_test, y_test)
141
ML: The labels column for supervised classification can be
Text format as well as number
142
ML: The sample you use for the prediction (if you are using DataFrameMapper) must be
a df with column labels, so that the DictVectorizer can know which columns each value belongs to
143
ML: When feeding a sample to model.predict(), it is unnecessary to
Have all of the columns data present, because the DictVectorizer will transform the the format of the columns you feed it to the same format of the fitted training set.
144
Pandas: When creating a DataFrame from a dict, remember to
put the values in a list.
145
sklearn: Any columns that are not present when you enter the prediction will be
assumed to have a value of zero.
146
sklearn: Before binarizing the training samples with DictVectorizer, remember to
fill any nan cells with something, otherwise the resulting array won't work
147
Pandas: When using .repace({}), remember to
either reassign the variable or add the parameter inplace=True
148
Python: To find the nth occurrence of a character in a string, type
my_string.replace("_", "", 2).find("_")
149
sklearn: It is appropriate to use a regression model when
the output is continuous
150
PCA: With regards to PCA, variance means
how spread out the data is.
151
PCA: What PCA does is it
- compresses data points onto the lines of maximal variance, to minimize information loss, and uses them as a principal components, which are ranked by maximal variance
152
ML: To see the two underlying factors driving the data you can use
PCA
153
PCA: PCA when applied to faces it also called
eigenfaces
154
PCA: The format for a PCA model is
from sklearn.decomposition import PCA pca = PCA(n_components=2) pca.fit(x_train)
155
PCA: The eigen values show
how much of the variation each principal component has.
156
PCA: To return the eigen values, type
pca.explained_variance_ratio_
157
PCA: To return the principal components, the number of which you specified for the model, type
pca.components_
158
PCA: PCA can be used for
facial recognition, by using it as pre-processing for an SVN, as it finds the trends in the images.
159
PCA: The format for using PCA as preprocessing for images to be used in an SVM, type
pca = RandomizedPCA(n_components=150, whiten=True) pca.fit(x_train) x_train_pca = pca.transform(x_train) x_test_pca = pca.transform(x_test)
160
ML: To feed a range of model parameters to model and have it self optimize using cross validation, use
GridSearchCV
161
GridSearchCV: The format for using GridSearchCV is
from sklearn.grid_search import GridSearchCV param_grid = {"C":[0.1,0.5,1.0],"Gamma":["5,8,9"], "kernel":('linear', 'rbf')} model = GridSearchCV(SVC(), param_grid) model.fit(x_train, y_train) model.best_estimator_
162
GridSearchCV: model.grid_scores_ returns
a triple with the parameter combination, it's average score, and it's separated scores by fold.
163
GridSearchCV: GridSearchCV automatically
refits to the best parameters
164
Pandas: To rename all the columns by adding a string to the end of the original name, type
df.columns = df.columns + "my_string"
165
datetime: The symbol for 2015 is
%Y
166
datetime: The symbol for zero padded day of month 01 is
%d
167
datetime: The symbol for zero padded month is
%m
168
datetime: The symbol for locale's month text abreviation is
%b
169
Pandas: To replace a string portion the values of a series, type
df["Column"] = df["Column"].str.replace("before","after")
170
GridSearchCV: To return a list of all of the combinations of parameters GridSearchCV tried, type
for item in model.grid_scores_: | print(item)
171
GridSearchCV: When passing the grid_params specifically the {"kernel":.., the testing values can be
either in a list or in parentheses, but the rest must be in a list.
172
GridSearchCV: In model.grid_scores_, the "mean" values stands for
average validation score
173
GridSearchCV: GridSearchCV performs its cross validation on
folds within training set.
174
SQL: To create a table with a text column that cannot insert null values, type
CREATE TABLE IF NOT EXISTS tablename (columnname TEXT NOT NULL);
175
SQL: The default engine for mysql now is
InnoDB
176
SQL: To specify which engine you want to use, type
CREATE TABLE IF NOT EXISTS table (columnname INTEGER) ENGINE InnoDB;
177
SQL: To a new row into a table but list the values in the wrong order, type
INSERT INTO tablename (column_2, column_1) VALUES ("String", 100);
178
SQL: To insert multiple rows into a table at once, type
INSERT INTO tablename (column, column_2) VALUES ("String", 100), (NULL, 2000);
179
sklearn: You should only fit the preprocessing/feature extraction and fit the model to the
x_train. x_test should only receive a transform in preprocessing/feature extraction and predict/validate from the model.
180
sklearn_pandas: To feed all of columns to sklearn_pandas, type
mapper = DataFrameMapper([ ("column1", sklearn.preprocessing.LabelBinarizer()), ("column2", sklearn.preprocessing.MinMaxScaler()), ("column3", sklearn.feature_extraction.text.TfidfVectorizer()), ("column4", none) ]) mapper.fit_transform(df.copy())
181
To import sklearn_pandas, type
from sklearn_pandas import DataFrameMapper | from sklearn_pandas import cross_val_score
182
sklearn_pandas: mapper.fit_transform(df.copy()) only transforms
The columns that you passed into the DataFrameMapper
183
sklearn_pandas: If sklearn_pandas is asked to transform a sample with features that were not in the original fit, it will
ignore them, since there is no column to put them in.
184
sklearn_pandas: When transforming a new df sample to be used for a prediction
all columns must be present, but the values can be "" so it will be ignored by the mapper.
185
sklearn_pandas: The columns are ordered according to
the order given to the mapper.
186
sklearn_pandas: For transformations that require multiple columns, type
mapper = DataFrameMapper([ (["column1", "column2"], sklearn.decomposition.PCA(1)) ])
187
Pandas: When switching a column from int to float use
df["Numeric"] = df["Numeric"].apply(lambda x: x.astype(float))) instead of df["Numeric"] = df["Numeric"].astype(float)
188
sklearn: Do not fit the preprocessing or model on
the testing data, because new samples will not have the luxury of being fitted.
189
sklearn: Since I am not supposed to fit preprocessing to the test data, the SelectPercentile and DataFrameMapper must happen
after the train_test_split, on just the training data
190
sklearn_pandas: Remember to end each line with a
comma
191
sklearn_pandas: When creating a pipeline with the mapper, make sure
the mapper is first so it transforms the columns into a format that is compatible with the following transformers
192
sklearn: To return the accuracy score of a model or pipeline, type
model.score(x_test, y_test)
193
sklearn: To create a pipeline, type
from sklearn.pipeline import Pipeline ``` pipeline = Pipeline([ ('column transformations', mapper), ('percenterizer', SelectPercentile()), ('classifier', MultinomialNB()) ]) ```
194
K-Fold CV is
A way to cross validate while using all your data rather than losing some data to the testing set. It involves creating multiple bins of samples, one of which is kept out of training to be tested against later, but then it alternates to allow the testing bin to be used for training while another bin becomes the testing one. It then averages the results of the tests on each bin.
195
sklearn: to import K-Fold CV, type
from sklearn.cross_validation import KFold
196
K-Fold CV requires the two arguments,
number of samples, and how many folds
197
Python: __init__ is a
method that runs right when a class is instantiated
198
Python: Instead of putting all of the parameters and defaults in def __init__(my_attribute=1, my_attribute2=2): you can just type
``` class Myclass: def __init__(self, **args): self.my_attribute = args.get("Name", "Bill") self.my_attribute2 = args.get("Age", 20) ```
199
Python: Every method in a class must at least take the
self argument. def my_method(self):
200
Python: The (self) argument represents
the data from the instance you are calling the method on.
201
Python: Using self. in a class method allows you to
access the attributes of the instance you are calling the method on.
202
Python: To create and call a basic class method, type
``` (in a file called my_class.py) class Myclass: def my_method(self): return "hi" ``` (console) from my_class import Myclass my_instance = Myclass() my_instance.my_method()
203
Python: To return an attribute of the current instance by calling a method, type
``` class Myclass: my_attribute = 10 def my_method(self): return self.my_attribute ``` ``` from my_class import Myclass: my_instance = Myclass() my_instance.my_method() ```
204
K-Fold CV: Does not
shuffle the samples.
205
Python: To create a class that takes two arguments upon instantiation, and has a method, type
``` class Myclass: def __init__(self, arg_1="default1", arg_2="default2"): self.arg_1 = arg_1 self.arg_2 = arg_2 def my_method(self): return self.arg_one*2 ```
206
Python: To return a random integer between 1 and 10, type
import random | random.randint(1,10)
207
Selenium: To enlarge the browser window, type
my_browser.maximize_window()
208
Python: a class is
an object with attributes and methods
209
Python: When a class uses __init__ the arguments must be passed
at instantiation.
210
Pandas: To split a df's rows into training and testing sets, type
``` mask = numpy.random.rand(len(df)) - 0.8 x_train = df[mask] x_test = df[numpy.invert(mask)] ```
211
Pandas: To remove the labels column from a df and place it into the y variable, type
y = df.pop("Labels")
212
sklearn: To return the classes of a LabelBinarizer(), type
lb.classes_
213
Pandas: To check what columns two df's do not share in common, type
df.columns.difference(df2.columns)
214
Pandas: The first few lines I should write when checking out a new DataFrame are
df.head() df.tail() df.info() df.describe() for item in df.columns: print(df[item].value_counts()) df[df.isnull().any(axis=1)].sort_index(ascending=True, by="Column name") df.corr() df.columns
215
Python: To return a random list item, type
import random | random.choice(my_list)
216
Python: To capture any parameters that are passed into a class when being instantiated, but were not defined inside the class beforehand, type
``` class Myclass: def __init__(self, **args) self.my_attribute = args.get("arg", "default") for key, value in args.items(): setattr(self, key, value) ```
217
Python: To create a subclass that extends and takes the attributes from a parent class, type
class Mysubclass(Myparentclass):
218
Python: To create a subclass that extends a parent class but has a new attributes and overrides one of the parents attributes, type
``` class Mysubclass(Myparentclass): new_attribute = 10 overwritten_attribute = "new value" ```
219
Pandas: To upload a csv with no header and set it upon upload, type
df = pandas.read_csv("/path/file.csv", header=None, names=["Column_name"])
220
Pandas: To upload a csv with no header and set it upon upload, type
df = pandas.read_csv("/users/student/desktop/neg_list.csv", header=None, names=["URL"])
221
Python: To use an attribute within a class you must
start with self.my_attribute
222
Python: When passing values to a custom transformer,
``` do it in the instantiation or in the pipeline, not in the methods. Then in the class type: class Mytransformer(TransformerMixin): def __init__(self, **args) for key, value in args.items(): setattr(self, key, value) ``` in the transformer type: my_instance = Mytransformer(param="value")
223
Python: When creating a Pipeline of DataFrameMapper
end lines with a comma
224
Python: To create an alarm that raises an exception after a certain number of seconds, type
import signal ``` def signal_handler(signum, frame): raise Exception("Timed out!") signal.signal(signal.SIGALRM, signal_handler) ``` try: signal.alarm(5) except: continue signal.alarm(0)
225
Pandas: To use the map function and a dictionary to change categorical values to 0 and 1, type
df['column'] = df['column'].map({'category1': 0, 'Category2':1})
226
numpy: To concatenate two 2d arrays across the second axis, type
numpy.c_[my_numpy_array, my_numpy_array2]
227
ml: The general type of algorithm that counts the frequency of unique words is called
bag of words
228
python: if "key_name" does not exist, executing my_dict.get("key_name") will
return none
229
python: To make an anonymous function, type
my_func = lambda x: x*2 my_func(2) => 4
230
python: This reduce function... reduce(lambda x,y: x+y, [7,1,2,5])
uses the first two item in the iterable as the functions arguments and then uses the return of them as one value which is then used as the first argument of same function and makes the second argument the next item in the iterable.
231
ml: A hidden neuron is
a neuron that is not the input and not the output. Usually an inner neuron that recieves output from another neuron.
232
ml: Recurrent neural nets are
models of artificial neural networks in which feedback loops are possible by having neurons which fire for some limited duration of time, before becoming quiescent.
233
ml: feedforward neural networks are
neural networks where the output from one layer is used as input to the next layer
234
ml: neural networks where the output from one layer is used as input to the next layer are called
feedforward neural networks
235
ml: Sigmoid neurons are similar to perceptrons, but
modified so that small changes in their weights and bias cause only a small change in their output
236
ml: A perceptron is
an artificial neuron that takes a number of inputs and returns an output. The output depends on if the weighted sum of each input passes a set threshold.
237
ml: The most common modern artificial neuron used in deep neural networks is
sigmoid neuron
238
Pandas: The symbology that signifies a tab delimiter is
\t
239
Pandas: To import a csv into a variable, type
data_frame = pandas.read_csv("pathfromcurrent/directory.csv") Note: Do not start the path with a slash.
240
Pandas: To return the first n rows of a DataFrame, type
df.head(4)
241
Pandas: To check the type of data something is, type
type(var_name)
242
Pandas: To print the key and value for every item in a groupby using a for loop, type
for key, value in groupby_var_name: print(key) print(value)
243
Pandas: A groupby is a
Dictionary like structure wherein...
244
Pandas: To create an empty dataframe in a variable, type
my_dataframe = pandas.DataFrame()
245
Pandas: To have a data frame print to a csv, type
data_frame_var.to_csv("myfolder/name.csv", header=0)
246
IPYNB: To turn a notebook into an executable python script, type
ipython nbconvert --to python ~/sup.ipynb into the console
247
Pandas: A pandas Series exhibits both list and dictionary behavior because
It's data can be accessed through index and key.
248
Pandas: To create a pandas Series, type
series_var = pandas.Series(my_list_or_dict)
249
Pandas: To return the values of a Series, type
my_series.values
250
Pandas: To return the indexes of a Series, type
my_series.index
251
Pandas: To use a list variable as the values of a Series and another list variable as the index of a Series, type
my_series = pandas.Series(my_values_list, index=list_idx)
252
Pandas: Remember when you use the drop method to
Slice for only what you want to drop, not keep
253
ipython: To make a script run before every opening of a notebook
place it in /users/username/.ipython/profile_default/start_up
254
javascript: Using session cookies gives your browser
ambient authority. Every request from your browser to the bank's server automatically carries your cookie. Even when you are visiting another site.
255
python: To turn a string from python into an html file, type
open("filename.html","w") html_file.write("string") html_file.close()
256
xml: To parse an XML string that came from a request, type
import xml.etree.ElementTree as ET root = ET.fromstring(r.text) then parse it like an nd_array ``` nd_array = [] for calls in root: for call in calls: array = [attr.text for attr in call] nd_array.append(array) ```
257
xml: Each xml object has the attributes
.text .tag .attrib
258
python: To create a virtual environment with python2, type
virtualenv -p /usr/bin/python2.7 path/to/env_folder/ to activate: source path/to/env_folder/bin/activate to deactivate: deactivate
259
python: To accept arguments into a script from the command line as a list, type
import sys sys. argv note: To, use sys.argv[1]
260
python: To find all the matches of a filename in a directory recursively, type
import os import fnmatch matches = [] for root, dirnames, filenames in os.walk("/users/student/Downloads/www.dmv.org"): for filename in fnmatch.filter(filenames, '*.css'): matches.append(os.path.join(root, filename))