4 Flashcards
SVM: SVM maximizes the
Margin from both clusters
SVM: the margin is
The space between the closest data point and the line
SVM: SVM is an
Algorithm that separates data with a line
SVM: SVM prioritizes
Correct classification over maximizing margin
SVM: To create your SVM classifier, type
from sklearn.svm import SVC
clf = SVC(kernel=linear)
SVM: SVM might work poorly if
There are more features than samples
SVM: Can an SVM create a non linear decision boundary?
Yes, using the kernel trick
SVM: SVM stands for
Support vector machine
Python: To run a script from within a script but pull in variables from one script to the next, the best way is to
Use import
datetime: To switch the position of month and day by converting a date string and then replace slashes with dashes, type
from datetime import datetime
date = datetime.strptime(“05/01/15”, “%m/%d/%y”)
date.strftime(“%d-%m-%y”)
SVM: For a linear kernel, a gamma of 1.0 will produce a
Straight decision boundary
SVM: Some important parameters for SVM are
Kernel
Gamma
C
SVM: Some common kernels are
rbf and linear
SVM: the c parameter controls
The degree to which the decision boundary will curve to contain all of the training data
SVM: increasing the value of the c parameter will
Increase the number of data points correctly in their decision boundary but may over compensate to the point that it is no longer predictive
ML: For a very large data set with a lot of noise and without a clear decision boundary, the better model than SVM is
Naive bayes
Pandas: the Series.replace({“value”:”value2”}) replaces
every occurance
ML: Accuracy improves proportional to
Traning data size with diminishing returns beginning after 700 samples.
ML: To see if your models prediction accuracy would benefit from more samples you could
Slice you training data into 4 sections and test the accuracy of each cumulative sum. 200, 400, 600, 800. The improvement in accuracy should start suffering diminishing returns at higher sample sizes.
Python: One way to schedule a script is to
while True: if int(datetime.now().strftime("%M")) == 30: print("30 Mintues!") time.sleep(61) time.sleep(2)
ML: Usually, to increase accuracy it is better to have
More data rather than a more finely tuned algorithm.
datetime: To return the current date, type
import datetime
datetime.datetime.now().strftime(“%d/%m/%y”)
ML: A high information gain feature is one that is
very common in one classification and not in others
Pandas: To change all of a newly uploaded df’s dates to proper format and also turn numbers into numeric types, type
df = df.convert_objects(convert_numeric=True)
Python: To stop a for loop, use the command
break
Python: To test if a number is even, type
number % 2 == 0
ML: The main types of feature data are
Numerical, Categorical, Time series, Text
ML: Continuous supervised learning means that
The output is not binary or categorical but a part of a range
ML: Discrete supervised learning means that
The output is binary or categorical but not a part of a range
Math: The intercept is
Value of the y axis when the x axis is 0
Math: The formula for a regression is
y = m*x+b
Math: Slope can be equated by
change in y divided by change in x
Math: A coefficient is
The number that multiplies a variable.
Math: With regards to regressions, the coefficient is the
number before the x axis that determines the slope
Math: A scalar value is a
position on a scale. A quantity.
Math: The regression line is the
line that minimizes the sum of the squared distances between the line and the data points.
LinearRegression: When you ask a fitted LinearRegression model for a prediction, you must pass it a
List, even if it’s a list with one value
LinearRegression: To return the coefficient and the intercept of a fitted LinearRegression model, type
model. coef_
model. intercept_
Python: To switch the keys and values of a dictionary to be the values and keys using a dict comprehension, type
{value: key for key, value in my_dict.items()}
LinearRegression: To return the r-squared of a LinearRegression, type
model.score(x_test, y_test)
sklearn: Predictions are returned in an array,
so you should type
model.predict([27])[0]
Numpy: To return the number from array([[25]]), type
array([[25]])[0][0]
LinearRegression: For LinearRegression, “Error” means
The actual y minus the predicted y.
LinearRegression: The algorithm used by LinearRegression is called
ordinary least squares
LinearRegression: Sum of squared error is not a good evaluation metric for a regression because
Is increases when more data points are added, despite the fit being the same.
LinearRegression: The evaluation metric most used to describe the goodness of fit of a regression is called
r-squared
LinearRegression: The r-squared explains
How much of the change in y is explained by the change in x
Pandas: df.values returns a
numpy 2D array
Pandas: To have a df return a 2D numpy array, type
df.values
sklearn: To scale a pandas df to between 0 and 1, type
from sklearn.preprocessing import MinMaxScaler import numpy x = df.values scaler = MinMaxScaler() x_scaled = scaler.fit_transform(x) df2 = pandas.DataFrame(x_scaled)
SQL: After importing a new sql script, make sure to
right click and refresh on the left side in the schemas panel.
SQL: To return two specific columns from a db, type
SELECT table.column, table.column2 FROM table;
SQL: Before querying a db, make sure that the
Correct database is selected and bolded in the panel.
SQL: To return data that meets a criteria, type
SELECT * FROM table WHERE column = 2000;
SQL: In SQL an equivalence test only uses
one equals sign.
SQL: To return data that doesn’t meet a certain criteria, type
SELECT * FROM table WHERE column != 1000;
SQL: The test operators you can use in SQL are
=, !=, greater than, less than, greater or equal, lesser or equal
SQL: To query for two criteria, type
SELECT * FROM table WHERE column = 1000 AND column2 = “Value”;
SQL: To query for two optional criteria, type
SELECT * FROM table WHERE column = 1000 OR column2 = “String”;
SQL: To return rows for a range between two values, type
SELECT * FROM table WHERE column BETWEEN 1000 AND 2000;
SQL: The wildcard symbol is
%
SQL: The LIKE clause is
not case sensitive
SQL: To return only rows that have a partial string in them, type
SELECT * FROM table WHERE column LIKE “%string%”;
SQL: To sort columns before returning them, type
SELECT * FROM table ORDER BY column ASC;
SQL: To sort by two columns before returning them, type
SELECT * FROM table ORDER BY column DESC, column2 ASC;
SQL: Databases start counting at row
Zero
SQL: To return only the first 10 rows of a table, type
SELECT * FROM table LIMIT 10;
SQL: To return only 10 rows of a table starting at the 20th row, type
SELECT * FROM table LIMIT 10 OFFSET 21;
SQL: Sometimes clients implicitly
Add a limit of 1000 rows to your queries. This would not occur using a programming language.
SQL: A shorter way to return only the first 10 rows of a table starting at the 20th row is
SELECT * FROM table LIMIT 21, 10;
SQL: To filter for rows with the value NULL, type
SELECT * FROM table WHERE column IS NULL;
SQL: To sort a column you are returning and also filter out any NULL values, type
SELECT * FROM table WHERE column IS NOT NULL ORDER BY column ASC;
ML: The output type of a supervised classification algorithm is
Binary or categorical (Discreet)
ML: The output type of a regression algorithm is
A number that is part of a range (Continuous)
ML: A regression with multiple input variables is called a
Multi-variate regression
LinearRegression: To remove outliers, follow the procedure
Train, Remove ~10% data points with highest error, Re-train remaining points
ML: The most popular clustering algorithm is called
k-means
k-means: k-means first assigns the centroids
randomly
k-means: k-means requires you to tell it the
number of centroids
k-means: k-means places the centroid by
trying to reduce the total distance from half the points of a until it is in the middle
k-means: Make sure to adjust the default
n_clusters parameter
k-means: To control the number of iterations of assigning points to a centroid and then moving it to the middle, change the
max_iter parameter
k-means: To control how many times k-means initializes the algorithm to avoid bad clustering due to the randomness of the initial plot, change the
n_init parameter
sklearn: The algorithms that require rescaling are
K-means and SVM(kernel=”rbf”). For the rest don’t bother.
sklearn: The algorithm that counts the frequency of words is called
CountVectorizer
sklearn: bag of words will generally count plurals
as the root word
Bag of words: There is a bias for long texts because
they have the opportunity to have a higher frequency of words.
Bag of words: To process text into a bag of words, type
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
text_list = [text_var, text_var2, text_var3]
bag_of_words = vectorizer.fit(text_list)
bag_of_words = vectorizer.transform(text_list)
Bag of words: The CountVectorizer.transform method
counts the occurrences of the words.
Bag of words: To return the number of occurrences of a specific word, type
vectorizer.vocabulary_.get(“word”)
Bag of words: To reduce noise in a bag of words, you should
remove stopwords
Bag of words: A stopword is
a high frequency, low information word e.g. “the”
nltk: To get a list of stopwords, type
import nltk
nltk.download() #Download all the files
from nltk.corpus import stopwords
sw = stopwords.words(“english”)
Bag of words: To bundle words that have the same root, type
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer(“english”)
Bag of words: Before applying CountVectorizer of TfidfVectorizer, you should
use a stemmer
Bag of words: The TFIDF representation
weighs rare words that happen less frequently in the corpus as a whole more highly.
Pandas: When you use a list to return a column, df[[“Column”]], it returns a
DataFrame
Pandas: To delete a column, type
del df[“Column”]
Pandas: To return a boolean index based on values containing a partial string, type
df[“Column name”].str.contains(“string”)
sklearn: Everything that goes to the estimators must be
a float
Pandas: To efficiently remove the labels column from the feature columns, type
y = df.pop(“Column name”)
Pandas: To automatically create dummy variables for any string in a df, type
x = pandas.get_dummies(df)
Pandas: To filter for column names based on partial string, type
df.filter(regex=”Column String”)
Pandas: Sometimes to replace a numpy boolean column with an integers, you need to
turn it into a string first.
df[“Boolean Column”].astype(str).replace({“True”:1})
sklearn: To select only powerful features, use the
SelectPercentile method
sklearn: Tf-Idf stands for
term frequency–inverse document frequency
Bag of words: When using bag of words it is recommended to use the vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
TfidfVectorizer(stop_words=”english”)
sklearn: The easiest model to get a good result from without much tuning is
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
sklearn: To make the TfidfVectorizer ignore all words that appear in 50% of all corpus samples, type
TfidfVectorizer(stop_words=”english”, max_df=0.5)
sklearn: High bias means
the model doesnt fit the training data well
sklearn: High variance means
the model fits the training data too well and poorly predicts the test data due to overfitting
sklearn: automatically finding the optimal number of features is called
regularization
sklearn: the tradeoff between number of features and accuracy is calculated using the
lasso regression
sklearn: All supervised learning algorithms require both
samples by features and labels
Pandas: To see if there is a difference between the training and testing df’s columns, type
df_train.columns.difference(df_test.columns)
sklearn: To binarize categorical data from a df, type
from sklearn.feature_extraction import DictVectorizer
vectorizer = DictVectorizer(sparse = False)
x_train = vectorizer.fit_transform(df_train.T.to_dict().values())
x_test = vectorizer.transform(df_test.T.to_dict().values())
sklearn: To turn a new test data sample (that is in df form) into the binarized format after it has been fitted, type
vectorizer.transform(df_test.T.to_dict().values())
DictVectorizer: DictVectorizer does not change
Numerical values. Only categorical.
DictVectorizer: You should use the DictVectorizer
after scaling the numerical data, since it will not alter numbers and will fix the categorical data into binary.
sklearn: To scale certain df columns to between 0 and 1, type
from sklearn.preprocessing import MinMaxScaler
dfTest[[‘A’,’B’]] = dfTest[[‘A’,’B’]].apply(lambda x: MinMaxScaler().fit_transform(x))
DictVectorizer: To return a list of all the columns of a fitted DictVectorizer, type
vectorizer.get_feature_names()
Python: When writing files to computer, to avoid overwriting it
Add the current datetime to the title
PCA: PCA obtains it’s new coordinate system from only
Translation and rotation
sklearn: To automatically test several different parameters and choose the best based on cross-validation, use
from sklearn import grid_search
SQL: SQL syntax is split into the two data types
DDL and DML
SQL: DDL stands for
data definition language
SQL: DDL deals with
schema, the structure of the tables
SQL: DML stands for
data manipulation language
SQL: DML deals with
creating, reading, updating, and deleting data.
SQL: A database is a
container for groups of tables.
SQL: to create a database, type
CREATE SCHEMA IF NOT EXISTS my_database_name DEFAULT CHARACTER SET utf8;
SQL: To specify which database you want the following query to be run on, type
USE my_database_name;
SQL: If a script trips an error it will
stop running, while a warning will allow it to continue
sklearn: After model.predict() returns a number label, to get the name it was previously, I can
get the dict that replaced the values with numbers
use dict comprehension to reverse the keys with values
my_new_dict.get the key from the reversed dict.
PCA: PCA finds the
Centre of the data and the principal axis of variation
Pandas: To print all of the unique values and their frequencies for every column, type
for column in df:
print(df[column].value_counts())
Pandas: To filter to see all of the rows that contain a numpy.nan, type
mask = df.isnull().any(axis=1)
df[mask]
Pandas: To use .apply on multiple columns at once, type
df[[“Column”,”Column 2”]] = df[[“Column”,”Column 2”]].apply(lambda x: x*5)
MinMaxScale: To undo scaling use
inverse_transform
cross_validation: If using cross validation, check accuracy score with
model.score(x_test, y_test)
ML: The labels column for supervised classification can be
Text format as well as number
ML: The sample you use for the prediction (if you are using DataFrameMapper) must be
a df with column labels, so that the DictVectorizer can know which columns each value belongs to
ML: When feeding a sample to model.predict(), it is unnecessary to
Have all of the columns data present, because the DictVectorizer will transform the the format of the columns you feed it to the same format of the fitted training set.
Pandas: When creating a DataFrame from a dict, remember to
put the values in a list.
sklearn: Any columns that are not present when you enter the prediction will be
assumed to have a value of zero.
sklearn: Before binarizing the training samples with DictVectorizer, remember to
fill any nan cells with something, otherwise the resulting array won’t work
Pandas: When using .repace({}), remember to
either reassign the variable or add the parameter inplace=True
Python: To find the nth occurrence of a character in a string, type
my_string.replace(“”, “”, 2).find(“”)
sklearn: It is appropriate to use a regression model when
the output is continuous
PCA: With regards to PCA, variance means
how spread out the data is.
PCA: What PCA does is it
- compresses data points onto the lines of maximal variance, to minimize information loss, and uses them as a principal components, which are ranked by maximal variance
ML: To see the two underlying factors driving the data you can use
PCA
PCA: PCA when applied to faces it also called
eigenfaces
PCA: The format for a PCA model is
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(x_train)
PCA: The eigen values show
how much of the variation each principal component has.
PCA: To return the eigen values, type
pca.explained_variance_ratio_
PCA: To return the principal components, the number of which you specified for the model, type
pca.components_
PCA: PCA can be used for
facial recognition, by using it as pre-processing for an SVN, as it finds the trends in the images.
PCA: The format for using PCA as preprocessing for images to be used in an SVM, type
pca = RandomizedPCA(n_components=150, whiten=True)
pca.fit(x_train)
x_train_pca = pca.transform(x_train)
x_test_pca = pca.transform(x_test)
ML: To feed a range of model parameters to model and have it self optimize using cross validation, use
GridSearchCV
GridSearchCV: The format for using GridSearchCV is
from sklearn.grid_search import GridSearchCV
param_grid = {“C”:[0.1,0.5,1.0],”Gamma”:[“5,8,9”], “kernel”:(‘linear’, ‘rbf’)}
model = GridSearchCV(SVC(), param_grid)
model.fit(x_train, y_train)
model.best_estimator_
GridSearchCV: model.grid_scores_ returns
a triple with the parameter combination, it’s average score, and it’s separated scores by fold.
GridSearchCV: GridSearchCV automatically
refits to the best parameters
Pandas: To rename all the columns by adding a string to the end of the original name, type
df.columns = df.columns + “my_string”
datetime: The symbol for 2015 is
%Y
datetime: The symbol for zero padded day of month 01 is
%d
datetime: The symbol for zero padded month is
%m
datetime: The symbol for locale’s month text abreviation is
%b
Pandas: To replace a string portion the values of a series, type
df[“Column”] = df[“Column”].str.replace(“before”,”after”)
GridSearchCV: To return a list of all of the combinations of parameters GridSearchCV tried, type
for item in model.grid_scores_:
print(item)
GridSearchCV: When passing the grid_params specifically the {“kernel”:.., the testing values can be
either in a list or in parentheses, but the rest must be in a list.
GridSearchCV: In model.grid_scores_, the “mean” values stands for
average validation score
GridSearchCV: GridSearchCV performs its cross validation on
folds within training set.
SQL: To create a table with a text column that cannot insert null values, type
CREATE TABLE IF NOT EXISTS tablename (columnname TEXT NOT NULL);
SQL: The default engine for mysql now is
InnoDB
SQL: To specify which engine you want to use, type
CREATE TABLE IF NOT EXISTS table (columnname INTEGER) ENGINE InnoDB;
SQL: To a new row into a table but list the values in the wrong order, type
INSERT INTO tablename (column_2, column_1) VALUES (“String”, 100);
SQL: To insert multiple rows into a table at once, type
INSERT INTO tablename (column, column_2) VALUES (“String”, 100), (NULL, 2000);
sklearn: You should only fit the preprocessing/feature extraction and fit the model to the
x_train. x_test should only receive a transform in preprocessing/feature extraction and predict/validate from the model.
sklearn_pandas: To feed all of columns to sklearn_pandas, type
mapper = DataFrameMapper([
(“column1”, sklearn.preprocessing.LabelBinarizer()),
(“column2”, sklearn.preprocessing.MinMaxScaler()),
(“column3”, sklearn.feature_extraction.text.TfidfVectorizer()),
(“column4”, none)
])
mapper.fit_transform(df.copy())
To import sklearn_pandas, type
from sklearn_pandas import DataFrameMapper
from sklearn_pandas import cross_val_score
sklearn_pandas: mapper.fit_transform(df.copy()) only transforms
The columns that you passed into the DataFrameMapper
sklearn_pandas: If sklearn_pandas is asked to transform a sample with features that were not in the original fit, it will
ignore them, since there is no column to put them in.
sklearn_pandas: When transforming a new df sample to be used for a prediction
all columns must be present, but the values can be “” so it will be ignored by the mapper.
sklearn_pandas: The columns are ordered according to
the order given to the mapper.
sklearn_pandas: For transformations that require multiple columns, type
mapper = DataFrameMapper([
([“column1”, “column2”], sklearn.decomposition.PCA(1))
])
Pandas: When switching a column from int to float use
df[“Numeric”] = df[“Numeric”].apply(lambda x: x.astype(float)))
instead of
df[“Numeric”] = df[“Numeric”].astype(float)
sklearn: Do not fit the preprocessing or model on
the testing data, because new samples will not have the luxury of being fitted.
sklearn: Since I am not supposed to fit preprocessing to the test data, the SelectPercentile and DataFrameMapper must happen
after the train_test_split, on just the training data
sklearn_pandas: Remember to end each line with a
comma
sklearn_pandas: When creating a pipeline with the mapper, make sure
the mapper is first so it transforms the columns into a format that is compatible with the following transformers
sklearn: To return the accuracy score of a model or pipeline, type
model.score(x_test, y_test)
sklearn: To create a pipeline, type
from sklearn.pipeline import Pipeline
pipeline = Pipeline([ ('column transformations', mapper), ('percenterizer', SelectPercentile()), ('classifier', MultinomialNB()) ])
K-Fold CV is
A way to cross validate while using all your data rather than losing some data to the testing set. It involves creating multiple bins of samples, one of which is kept out of training to be tested against later, but then it alternates to allow the testing bin to be used for training while another bin becomes the testing one. It then averages the results of the tests on each bin.
sklearn: to import K-Fold CV, type
from sklearn.cross_validation import KFold
K-Fold CV requires the two arguments,
number of samples, and how many folds
Python: __init__ is a
method that runs right when a class is instantiated
Python: Instead of putting all of the parameters and defaults in def __init__(my_attribute=1, my_attribute2=2): you can just type
class Myclass: def \_\_init\_\_(self, **args): self.my_attribute = args.get("Name", "Bill") self.my_attribute2 = args.get("Age", 20)
Python: Every method in a class must at least take the
self argument. def my_method(self):
Python: The (self) argument represents
the data from the instance you are calling the method on.
Python: Using self. in a class method allows you to
access the attributes of the instance you are calling the method on.
Python: To create and call a basic class method, type
(in a file called my_class.py) class Myclass: def my_method(self): return "hi"
(console)
from my_class import Myclass
my_instance = Myclass()
my_instance.my_method()
Python: To return an attribute of the current instance by calling a method, type
class Myclass: my_attribute = 10 def my_method(self): return self.my_attribute
from my_class import Myclass: my_instance = Myclass() my_instance.my_method()
K-Fold CV: Does not
shuffle the samples.
Python: To create a class that takes two arguments upon instantiation, and has a method, type
class Myclass: def \_\_init\_\_(self, arg_1="default1", arg_2="default2"): self.arg_1 = arg_1 self.arg_2 = arg_2 def my_method(self): return self.arg_one*2
Python: To return a random integer between 1 and 10, type
import random
random.randint(1,10)
Selenium: To enlarge the browser window, type
my_browser.maximize_window()
Python: a class is
an object with attributes and methods
Python: When a class uses __init__ the arguments must be passed
at instantiation.
Pandas: To split a df’s rows into training and testing sets, type
mask = numpy.random.rand(len(df)) - 0.8 x_train = df[mask] x_test = df[numpy.invert(mask)]
Pandas: To remove the labels column from a df and place it into the y variable, type
y = df.pop(“Labels”)
sklearn: To return the classes of a LabelBinarizer(), type
lb.classes_
Pandas: To check what columns two df’s do not share in common, type
df.columns.difference(df2.columns)
Pandas: The first few lines I should write when checking out a new DataFrame are
df.head()
df.tail()
df.info()
df.describe()
for item in df.columns:
print(df[item].value_counts())
df[df.isnull().any(axis=1)].sort_index(ascending=True, by=”Column name”)
df.corr()
df.columns
Python: To return a random list item, type
import random
random.choice(my_list)
Python: To capture any parameters that are passed into a class when being instantiated, but were not defined inside the class beforehand, type
class Myclass: def \_\_init\_\_(self, **args) self.my_attribute = args.get("arg", "default") for key, value in args.items(): setattr(self, key, value)
Python: To create a subclass that extends and takes the attributes from a parent class, type
class Mysubclass(Myparentclass):
Python: To create a subclass that extends a parent class but has a new attributes and overrides one of the parents attributes, type
class Mysubclass(Myparentclass): new_attribute = 10 overwritten_attribute = "new value"
Pandas: To upload a csv with no header and set it upon upload, type
df = pandas.read_csv(“/path/file.csv”, header=None, names=[“Column_name”])
Pandas: To upload a csv with no header and set it upon upload, type
df = pandas.read_csv(“/users/student/desktop/neg_list.csv”, header=None, names=[“URL”])
Python: To use an attribute within a class you must
start with self.my_attribute
Python: When passing values to a custom transformer,
do it in the instantiation or in the pipeline, not in the methods. Then in the class type: class Mytransformer(TransformerMixin): def \_\_init\_\_(self, **args) for key, value in args.items(): setattr(self, key, value)
in the transformer type:
my_instance = Mytransformer(param=”value”)
Python: When creating a Pipeline of DataFrameMapper
end lines with a comma
Python: To create an alarm that raises an exception after a certain number of seconds, type
import signal
def signal_handler(signum, frame): raise Exception("Timed out!") signal.signal(signal.SIGALRM, signal_handler)
try:
signal.alarm(5)
except:
continue
signal.alarm(0)
Pandas: To use the map function and a dictionary to change categorical values to 0 and 1, type
df[‘column’] = df[‘column’].map({‘category1’: 0, ‘Category2’:1})
numpy: To concatenate two 2d arrays across the second axis, type
numpy.c_[my_numpy_array, my_numpy_array2]
ml: The general type of algorithm that counts the frequency of unique words is called
bag of words
python: if “key_name” does not exist, executing my_dict.get(“key_name”) will
return none
python: To make an anonymous function, type
my_func = lambda x: x*2
my_func(2)
=> 4
python: This reduce function…
reduce(lambda x,y: x+y, [7,1,2,5])
uses the first two item in the iterable as the functions arguments and then uses the return of them as one value which is then used as the first argument of same function and makes the second argument the next item in the iterable.
ml: A hidden neuron is
a neuron that is not the input and not the output. Usually an inner neuron that recieves output from another neuron.
ml: Recurrent neural nets are
models of artificial neural networks in which feedback loops are possible by having neurons which fire for some limited duration of time, before becoming quiescent.
ml: feedforward neural networks are
neural networks where the output from one layer is used as input to the next layer
ml: neural networks where the output from one layer is used as input to the next layer are called
feedforward neural networks
ml: Sigmoid neurons are similar to perceptrons, but
modified so that small changes in their weights and bias cause only a small change in their output
ml: A perceptron is
an artificial neuron that takes a number of inputs and returns an output. The output depends on if the weighted sum of each input passes a set threshold.
ml: The most common modern artificial neuron used in deep neural networks is
sigmoid neuron
Pandas: The symbology that signifies a tab delimiter is
\t
Pandas: To import a csv into a variable, type
data_frame = pandas.read_csv(“pathfromcurrent/directory.csv”)
Note: Do not start the path with a slash.
Pandas: To return the first n rows of a DataFrame, type
df.head(4)
Pandas: To check the type of data something is, type
type(var_name)
Pandas: To print the key and value for every item in a groupby using a for loop, type
for key, value in groupby_var_name:
print(key)
print(value)
Pandas: A groupby is a
Dictionary like structure wherein…
Pandas: To create an empty dataframe in a variable, type
my_dataframe = pandas.DataFrame()
Pandas: To have a data frame print to a csv, type
data_frame_var.to_csv(“myfolder/name.csv”, header=0)
IPYNB: To turn a notebook into an executable python script, type
ipython nbconvert –to python ~/sup.ipynb into the console
Pandas: A pandas Series exhibits both list and dictionary behavior because
It’s data can be accessed through index and key.
Pandas: To create a pandas Series, type
series_var = pandas.Series(my_list_or_dict)
Pandas: To return the values of a Series, type
my_series.values
Pandas: To return the indexes of a Series, type
my_series.index
Pandas: To use a list variable as the values of a Series and another list variable as the index of a Series, type
my_series = pandas.Series(my_values_list, index=list_idx)
Pandas: Remember when you use the drop method to
Slice for only what you want to drop, not keep
ipython: To make a script run before every opening of a notebook
place it in /users/username/.ipython/profile_default/start_up
javascript: Using session cookies gives your browser
ambient authority. Every request from your browser to the bank’s server automatically carries your cookie. Even when you are visiting another site.
python: To turn a string from python into an html file, type
open(“filename.html”,”w”)
html_file.write(“string”)
html_file.close()
xml: To parse an XML string that came from a request, type
import xml.etree.ElementTree as ET
root = ET.fromstring(r.text)
then parse it like an nd_array
nd_array = [] for calls in root: for call in calls: array = [attr.text for attr in call] nd_array.append(array)
xml: Each xml object has the attributes
.text
.tag
.attrib
python: To create a virtual environment with python2, type
virtualenv -p /usr/bin/python2.7 path/to/env_folder/
to activate:
source path/to/env_folder/bin/activate
to deactivate:
deactivate
python: To accept arguments into a script from the command line as a list, type
import sys
sys. argv
note: To, use sys.argv[1]
python: To find all the matches of a filename in a directory recursively, type
import os
import fnmatch
matches = []
for root, dirnames, filenames in os.walk(“/users/student/Downloads/www.dmv.org”):
for filename in fnmatch.filter(filenames, ‘*.css’):
matches.append(os.path.join(root, filename))