4 Flashcards

Question

Python: To stop a for loop, use the command

Answer 1

number % 2 == 0

Answer 2

Numerical, Categorical, Time series, Text

Answer 3

The output is not binary or categorical but a part of a range

Answer 4

The output is binary or categorical but not a part of a range

Answer 5

Value of the y axis when the x axis is 0

Answer 6

change in y divided by change in x

Answer 7

The number that multiplies a variable.

Answer 8

number before the x axis that determines the slope

Answer 9

position on a scale. A quantity.

Answer 10

line that minimizes the sum of the squared distances between the line and the data points.

Answer 11

List, even if it's a list with one value

Answer 12

model. coef_ | model. intercept_

Answer 13

{value: key for key, value in my_dict.items()}

Answer 14

model.score(x_test, y_test)

Answer 15

model.predict([27])[0]

Answer 16

array([[25]])[0][0]

Answer 17

The actual y minus the predicted y.

Answer 18

ordinary least squares

Answer 19

Is increases when more data points are added, despite the fit being the same.

Answer 20

How much of the change in y is explained by the change in x

Answer 21

numpy 2D array

Answer 22

``` from sklearn.preprocessing import MinMaxScaler import numpy x = df.values scaler = MinMaxScaler() x_scaled = scaler.fit_transform(x) df2 = pandas.DataFrame(x_scaled) ```

Answer 23

right click and refresh on the left side in the schemas panel.

Answer 24

SELECT table.column, table.column2 FROM table;

Answer 25

Correct database is selected and bolded in the panel.

Answer 26

SELECT * FROM table WHERE column = 2000;

Answer 27

one equals sign.

Answer 28

SELECT * FROM table WHERE column != 1000;

Answer 29

=, !=, greater than, less than, greater or equal, lesser or equal

Answer 30

SELECT * FROM table WHERE column = 1000 AND column2 = "Value";

Answer 31

SELECT * FROM table WHERE column = 1000 OR column2 = "String";

Answer 32

SELECT * FROM table WHERE column BETWEEN 1000 AND 2000;

Answer 33

not case sensitive

Answer 34

SELECT * FROM table WHERE column LIKE "%string%";

Answer 35

SELECT * FROM table ORDER BY column ASC;

Answer 36

SELECT * FROM table ORDER BY column DESC, column2 ASC;

Answer 37

SELECT * FROM table LIMIT 10;

Answer 38

SELECT * FROM table LIMIT 10 OFFSET 21;

Answer 39

Add a limit of 1000 rows to your queries. This would not occur using a programming language.

Answer 40

SELECT * FROM table LIMIT 21, 10;

Answer 41

SELECT * FROM table WHERE column IS NULL;

Answer 42

SELECT * FROM table WHERE column IS NOT NULL ORDER BY column ASC;

Answer 43

Binary or categorical (Discreet)

Answer 44

A number that is part of a range (Continuous)

Answer 45

Multi-variate regression

Answer 46

Train, Remove ~10% data points with highest error, Re-train remaining points

Answer 47

number of centroids

Answer 48

trying to reduce the total distance from half the points of a until it is in the middle

Answer 49

n_clusters parameter

Answer 50

max_iter parameter

Answer 51

n_init parameter

Answer 52

K-means and SVM(kernel="rbf"). For the rest don't bother.

Answer 53

CountVectorizer

Answer 54

as the root word

Answer 55

they have the opportunity to have a higher frequency of words.

Answer 56

from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer() text_list = [text_var, text_var2, text_var3] bag_of_words = vectorizer.fit(text_list) bag_of_words = vectorizer.transform(text_list)

Answer 57

counts the occurrences of the words.

Answer 58

vectorizer.vocabulary_.get("word")

Answer 59

remove stopwords

Answer 60

a high frequency, low information word e.g. "the"

Answer 61

import nltk nltk.download() #Download all the files from nltk.corpus import stopwords sw = stopwords.words("english")

Answer 62

from nltk.stem.snowball import SnowballStemmer | stemmer = SnowballStemmer("english")

Answer 63

use a stemmer

Answer 64

weighs rare words that happen less frequently in the corpus as a whole more highly.

Answer 65

del df["Column"]

Answer 66

df["Column name"].str.contains("string")

Answer 67

y = df.pop("Column name")

Answer 68

x = pandas.get_dummies(df)

Answer 69

df.filter(regex="Column String")

Answer 70

turn it into a string first. | df["Boolean Column"].astype(str).replace({"True":1})

Answer 71

SelectPercentile method

Answer 72

term frequency–inverse document frequency

Answer 73

from sklearn.feature_extraction.text import TfidfVectorizer | TfidfVectorizer(stop_words="english")

Answer 74

from sklearn.ensemble import RandomForestClassifier | model = RandomForestClassifier()

Answer 75

TfidfVectorizer(stop_words="english", max_df=0.5)

Answer 76

the model doesnt fit the training data well

Answer 77

the model fits the training data too well and poorly predicts the test data due to overfitting

Answer 78

regularization

Answer 79

lasso regression

Answer 80

samples by features and labels

Answer 81

df_train.columns.difference(df_test.columns)

Answer 82

from sklearn.feature_extraction import DictVectorizer vectorizer = DictVectorizer(sparse = False) x_train = vectorizer.fit_transform(df_train.T.to_dict().values()) x_test = vectorizer.transform(df_test.T.to_dict().values())

Answer 83

vectorizer.transform(df_test.T.to_dict().values())

Answer 84

Numerical values. Only categorical.

Answer 85

after scaling the numerical data, since it will not alter numbers and will fix the categorical data into binary.

Answer 86

from sklearn.preprocessing import MinMaxScaler | dfTest[['A','B']] = dfTest[['A','B']].apply(lambda x: MinMaxScaler().fit_transform(x))

Answer 87

vectorizer.get_feature_names()

Answer 88

Add the current datetime to the title

Answer 89

Translation and rotation

Answer 90

from sklearn import grid_search

Answer 91

DDL and DML

Answer 92

data definition language

Answer 93

schema, the structure of the tables

Answer 94

data manipulation language

Answer 95

creating, reading, updating, and deleting data.

Answer 96

container for groups of tables.

Answer 97

CREATE SCHEMA IF NOT EXISTS my_database_name DEFAULT CHARACTER SET utf8;

Answer 98

USE my_database_name;

Answer 99

stop running, while a warning will allow it to continue

Answer 100

get the dict that replaced the values with numbers use dict comprehension to reverse the keys with values my_new_dict.get the key from the reversed dict.

Answer 101

Centre of the data and the principal axis of variation

Answer 102

for column in df: | print(df[column].value_counts())

Answer 103

mask = df.isnull().any(axis=1) | df[mask]

Answer 104

df[["Column","Column 2"]] = df[["Column","Column 2"]].apply(lambda x: x*5)

Answer 105

inverse_transform

Answer 106

model.score(x_test, y_test)

Answer 107

Text format as well as number

Answer 108

a df with column labels, so that the DictVectorizer can know which columns each value belongs to

Answer 109

Have all of the columns data present, because the DictVectorizer will transform the the format of the columns you feed it to the same format of the fitted training set.

Answer 110

put the values in a list.

Answer 111

assumed to have a value of zero.

Answer 112

fill any nan cells with something, otherwise the resulting array won't work

Answer 113

either reassign the variable or add the parameter inplace=True

Answer 114

my_string.replace("_", "", 2).find("_")

Answer 115

the output is continuous

Answer 116

how spread out the data is.

Answer 117

- compresses data points onto the lines of maximal variance, to minimize information loss, and uses them as a principal components, which are ranked by maximal variance

Answer 118

eigenfaces

Answer 119

from sklearn.decomposition import PCA pca = PCA(n_components=2) pca.fit(x_train)

Answer 120

how much of the variation each principal component has.

Answer 121

pca.explained_variance_ratio_

Answer 122

pca.components_

Answer 123

facial recognition, by using it as pre-processing for an SVN, as it finds the trends in the images.

Answer 124

pca = RandomizedPCA(n_components=150, whiten=True) pca.fit(x_train) x_train_pca = pca.transform(x_train) x_test_pca = pca.transform(x_test)

Answer 125

GridSearchCV

Answer 126

from sklearn.grid_search import GridSearchCV param_grid = {"C":[0.1,0.5,1.0],"Gamma":["5,8,9"], "kernel":('linear', 'rbf')} model = GridSearchCV(SVC(), param_grid) model.fit(x_train, y_train) model.best_estimator_

Answer 127

a triple with the parameter combination, it's average score, and it's separated scores by fold.

Answer 128

refits to the best parameters

Answer 129

df.columns = df.columns + "my_string"

Answer 130

df["Column"] = df["Column"].str.replace("before","after")

Answer 131

for item in model.grid_scores_: | print(item)

Answer 132

either in a list or in parentheses, but the rest must be in a list.

Answer 133

average validation score

Answer 134

folds within training set.

Answer 135

CREATE TABLE IF NOT EXISTS tablename (columnname TEXT NOT NULL);

Answer 136

CREATE TABLE IF NOT EXISTS table (columnname INTEGER) ENGINE InnoDB;

Answer 137

INSERT INTO tablename (column_2, column_1) VALUES ("String", 100);

Answer 138

INSERT INTO tablename (column, column_2) VALUES ("String", 100), (NULL, 2000);

Answer 139

x_train. x_test should only receive a transform in preprocessing/feature extraction and predict/validate from the model.

Answer 140

mapper = DataFrameMapper([ ("column1", sklearn.preprocessing.LabelBinarizer()), ("column2", sklearn.preprocessing.MinMaxScaler()), ("column3", sklearn.feature_extraction.text.TfidfVectorizer()), ("column4", none) ]) mapper.fit_transform(df.copy())

Answer 141

from sklearn_pandas import DataFrameMapper | from sklearn_pandas import cross_val_score

Answer 142

The columns that you passed into the DataFrameMapper

Answer 143

ignore them, since there is no column to put them in.

Answer 144

all columns must be present, but the values can be "" so it will be ignored by the mapper.

Answer 145

the order given to the mapper.

Answer 146

mapper = DataFrameMapper([ (["column1", "column2"], sklearn.decomposition.PCA(1)) ])

Answer 147

df["Numeric"] = df["Numeric"].apply(lambda x: x.astype(float))) instead of df["Numeric"] = df["Numeric"].astype(float)

Answer 148

the testing data, because new samples will not have the luxury of being fitted.

Answer 149

after the train_test_split, on just the training data

Answer 150

the mapper is first so it transforms the columns into a format that is compatible with the following transformers

Answer 151

model.score(x_test, y_test)

Answer 152

from sklearn.pipeline import Pipeline ``` pipeline = Pipeline([ ('column transformations', mapper), ('percenterizer', SelectPercentile()), ('classifier', MultinomialNB()) ]) ```

Answer 153

A way to cross validate while using all your data rather than losing some data to the testing set. It involves creating multiple bins of samples, one of which is kept out of training to be tested against later, but then it alternates to allow the testing bin to be used for training while another bin becomes the testing one. It then averages the results of the tests on each bin.

Answer 154

from sklearn.cross_validation import KFold

Answer 155

number of samples, and how many folds

Answer 156

method that runs right when a class is instantiated

Answer 157

``` class Myclass: def __init__(self, **args): self.my_attribute = args.get("Name", "Bill") self.my_attribute2 = args.get("Age", 20) ```

Answer 158

self argument. def my_method(self):

Answer 159

the data from the instance you are calling the method on.

Answer 160

access the attributes of the instance you are calling the method on.

Answer 161

``` (in a file called my_class.py) class Myclass: def my_method(self): return "hi" ``` (console) from my_class import Myclass my_instance = Myclass() my_instance.my_method()

Answer 162

``` class Myclass: my_attribute = 10 def my_method(self): return self.my_attribute ``` ``` from my_class import Myclass: my_instance = Myclass() my_instance.my_method() ```

Answer 163

shuffle the samples.

Answer 164

``` class Myclass: def __init__(self, arg_1="default1", arg_2="default2"): self.arg_1 = arg_1 self.arg_2 = arg_2 def my_method(self): return self.arg_one*2 ```

Answer 165

import random | random.randint(1,10)

Answer 166

my_browser.maximize_window()

Answer 167

an object with attributes and methods

Answer 168

at instantiation.

Answer 169

``` mask = numpy.random.rand(len(df)) - 0.8 x_train = df[mask] x_test = df[numpy.invert(mask)] ```

Answer 170

y = df.pop("Labels")

Answer 171

lb.classes_

Answer 172

df.columns.difference(df2.columns)

Answer 173

df.head() df.tail() df.info() df.describe() for item in df.columns: print(df[item].value_counts()) df[df.isnull().any(axis=1)].sort_index(ascending=True, by="Column name") df.corr() df.columns

Answer 174

import random | random.choice(my_list)

Answer 175

``` class Myclass: def __init__(self, **args) self.my_attribute = args.get("arg", "default") for key, value in args.items(): setattr(self, key, value) ```

Answer 176

class Mysubclass(Myparentclass):

Answer 177

``` class Mysubclass(Myparentclass): new_attribute = 10 overwritten_attribute = "new value" ```

Answer 178

df = pandas.read_csv("/path/file.csv", header=None, names=["Column_name"])

Answer 179

df = pandas.read_csv("/users/student/desktop/neg_list.csv", header=None, names=["URL"])

Answer 180

start with self.my_attribute

Answer 181

``` do it in the instantiation or in the pipeline, not in the methods. Then in the class type: class Mytransformer(TransformerMixin): def __init__(self, **args) for key, value in args.items(): setattr(self, key, value) ``` in the transformer type: my_instance = Mytransformer(param="value")

Answer 182

end lines with a comma

Answer 183

import signal ``` def signal_handler(signum, frame): raise Exception("Timed out!") signal.signal(signal.SIGALRM, signal_handler) ``` try: signal.alarm(5) except: continue signal.alarm(0)

Answer 184

df['column'] = df['column'].map({'category1': 0, 'Category2':1})

Answer 185

numpy.c_[my_numpy_array, my_numpy_array2]

Answer 186

bag of words

Answer 187

return none

Answer 188

my_func = lambda x: x*2 my_func(2) => 4

Answer 189

uses the first two item in the iterable as the functions arguments and then uses the return of them as one value which is then used as the first argument of same function and makes the second argument the next item in the iterable.

Answer 190

a neuron that is not the input and not the output. Usually an inner neuron that recieves output from another neuron.

Answer 191

models of artificial neural networks in which feedback loops are possible by having neurons which fire for some limited duration of time, before becoming quiescent.

Answer 192

neural networks where the output from one layer is used as input to the next layer

Answer 193

feedforward neural networks

Answer 194

modified so that small changes in their weights and bias cause only a small change in their output

Answer 195

an artificial neuron that takes a number of inputs and returns an output. The output depends on if the weighted sum of each input passes a set threshold.

Answer 196

sigmoid neuron

Answer 197

data_frame = pandas.read_csv("pathfromcurrent/directory.csv") Note: Do not start the path with a slash.

Answer 198

df.head(4)

Answer 199

type(var_name)

Answer 200

for key, value in groupby_var_name: print(key) print(value)

Answer 201

Dictionary like structure wherein...

Answer 202

my_dataframe = pandas.DataFrame()

Answer 203

data_frame_var.to_csv("myfolder/name.csv", header=0)

Answer 204

ipython nbconvert --to python ~/sup.ipynb into the console

Answer 205

It's data can be accessed through index and key.

Answer 206

series_var = pandas.Series(my_list_or_dict)

Answer 207

my_series.values

Answer 208

my_series.index

Answer 209

my_series = pandas.Series(my_values_list, index=list_idx)

Answer 210

Slice for only what you want to drop, not keep

Answer 211

place it in /users/username/.ipython/profile_default/start_up

Answer 212

ambient authority. Every request from your browser to the bank's server automatically carries your cookie. Even when you are visiting another site.

Answer 213

open("filename.html","w") html_file.write("string") html_file.close()

Answer 214

import xml.etree.ElementTree as ET root = ET.fromstring(r.text) then parse it like an nd_array ``` nd_array = [] for calls in root: for call in calls: array = [attr.text for attr in call] nd_array.append(array) ```

Answer 215

.text .tag .attrib

Answer 216

virtualenv -p /usr/bin/python2.7 path/to/env_folder/ to activate: source path/to/env_folder/bin/activate to deactivate: deactivate

Answer 217

import sys sys. argv note: To, use sys.argv[1]

Answer 218

import os import fnmatch matches = [] for root, dirnames, filenames in os.walk("/users/student/Downloads/www.dmv.org"): for filename in fnmatch.filter(filenames, '*.css'): matches.append(os.path.join(root, filename))

4 Flashcards

(260 cards)