random Flashcards
Python: When scraping a list from a site
remember that you need to loop each list item into a new list, and not use the soup_page to save into pandas.
Pandas: To reference a df column by its index rather than its name, type
df.columns[0]
Pandas: To filter a column by partial string, type
mask = df[“Column name”].str.contains(“string”)
Selenium: To scroll to the bottom of a page, type
my_browser.execute_script(“window.scrollTo(0, document.body.scrollHeight);”)
To import SelectPercentile, type
from sklearn.feature_selection import SelectPercentile
To set the SelectPercentile percentile, type
SelectPercentile(percentile=20)
sklearn: To create a transformer that turns a column from an integer to a float
from sklearn.base import TransformerMixin
class MyTransformer(TransformerMixin):
def transform(self, X, **transform_params): X["Numeric"] = X["Numeric"].apply(lambda x: x.astype(float)) return X
def fit(self, X, y=None, **fit_params): return self
sklearn: To import gradient boosting, type
from sklearn.ensemble import GradientBoostingClassifier
Pandas: To set the columns on read_csv and on a new DataFrame use
read_csv: names=[]
DataFrame: columns=[]
pyautogui: To click somewhere based on a screentshot’s center, type
pixel_x, pixel_y = pyautogui.locateCenterOnScreen(“screenshot.png”)
pyautogui.click(pixel_x, pixel_y)
pyautogui: To have a dialogue box pop up and confirm that you want to continue, type
pyautogui.confirm(“Proceed?”)
pyautogui: To find the pixel coordinates of the current mouse position, and then click them, type
current_x, current_y = pyautogui.position()
pyautogui.click(current_x, current_y)
pyautogui: To move the mouse, type
pyautogui.moveTo(100, 150)
pyautogui: To type characters, type
pyautogui.typewrite(“My String”, interval=0.25)
pyautogui: To take a screenshot and then save it, type
screenshot = pyautogui.screenshot()
screenshot.save(“path/screenshot.png”)
smtplib: To send a gmail email with an image attachment, type
import smtplib
from email.mime.multipart import MIMEMultipart
from email.mime.image import MIMEImage
from email.mime.text import MIMEText
my_msg = MIMEMultipart()
my_msg[“Subject”] = “My subject”
my_msg.attach(MIMEText(“My body message text”, “plain”))
fp = open(“file/path.png”, ‘rb’)
file = MIMEImage(fp.read())
fp.close()
my_msg.attach(file)
server = smtplib.SMTP(“smtp.gmail.com:587”)
server. ehlo()
server. starttls()
server. login(“me@gmail.com”, “password”)
server. sendmail(“from@gmail.com”, [“to@gmail.com”], my_msg.as_string())
server. quit()
smtplib: To send a gmail email with a csv attachment, type
import smtplib from email.mime.multipart import MIMEMultipart from email.mime.base import MIMEBase from email.mime.text import MIMETe xt from email import encoders
my_msg = MIMEMultipart()
my_msg[“Subject”] = “My subject”
my_msg.attach(MIMEText(“My body message text”, “plain”))
fp = open(“/path/filename.csv”, “rb”)
file = MIMEBase(“application”, “octet-stream”)
file.set_payload(fp.read())
fp.close()
encoders.encode_base64(file)
file.add_header(“Content-Disposition”, “file”, filename=”filename.csv”)
my_msg.attach(file)
server = smtplib.SMTP(“smtp.gmail.com:587”)
server. starttls()
server. login(“me@gmail.com”,”password”)
server. sendmail(“from@gmail.com”, [“to@gmail.com”], my_msg.as_string())
server. quit()
Math: ROI is
revenue divided by cost
Python: To create a special method that returns a string when print is called on a class instance, type
class Myclass(Parentclass): def \_\_str\_\_(self): return "This string is returned when print(my_instance) is called"
Python: To create an __init__ method that prompts an input upon instantiation, type
class Myclass: def \_\_init\_\_(self, **args): self.my_attribute = input("Prompt string")
Python: To create an __init__ method that prompts a method that then prompts two inputs, type
class Myclass: def \_\_init\_\_(self, **args): self.my_attribute = self.input_method()
def input_meth(self): my_attribute = input("Prompt string") return my_attribute
Python: To combine two columns together so their rows are both available in every iteration of a for loop, type
for item1, item2 in tuple(zip(df[“column”].tolist(), df[“column”].tolist())):
print(item1, item2)
Python: A generator expression is
the same as a list comprehension but can be passed into a function without turning it into a list.
Python: To write a generator expression, type
(item for item in my_list if item >5)
Python: To iteratively replace a list of characters with spaces, type
for item in [”.”, “?”, “!”]:
text = text.replace(item, “ “)
Python: To remove most of the html, style and scripts from a pages source, type
soup_page = BeautifulSoup(page, “html.parser”)
for script in soup_page.find_all([“script”, “style”]):
script.extract()
text = soup.get_text()
for item in [”.”, “?”, “!”, “,”, “ “]:
text = text.replace(item, “ “)
Pandas: To append a row of data to an existing df that is empty while also setting its column labels, type
df = df.append({“column1”:”value”, “column2”:”value”, “column3”:”value”}, ignore_index=True)
sklearn: A confusion matrix is a
2x2 matrix with the y index of actual class and x index of predicted class, that counts how many values of each class were correctly or incorrectly predicted. It is a measure of how many false positives and false negatives there are.
sklearn: Generally it seems for text it is best to
Not use select percentile
use a stemmer
remove irrelevant symbols and characters
use Tfidf instead of CountVect
sklearn: When choosing the training data it is pivotal to
not have mislabeled data
sklearn: When choosing the features, use
the all features you think have high information gain, and do whatever is necessary to get them into the dataset.
sklearn: It is very unlikely for NearestNeighbors
to outperform other models, and if it does it may be because the data has duplicates
Python: To open a file on windows in its default application, type
os.system(“start /file/path.csv”)
sklearn: The default settings for GridSearchCV should be
if using DataFrameMapper
sklearn_pandas.GridSearchCV(pipeline, param_grid=param_grid, verbose=3, scoring=”accuracy”, cv=10)
sklearn: When data is missing, it can be useful to
impute the data based on hints in the other columns. eg Mr. is associated with older age.
sklearn: To GridSearchCV the parameters of a model nested in a pipeline, type
import sklearn_pandas
param_grid = {“setname__parameter”:[10, 20, 30]}
grid_model = sklearn_pandas.GridSearchCV(pipeline, param_grid=param_grid, verbose=3, scoring=”accuracy”, cv=10)
sklearn: To use sklearn_pandas.GridSearchCV, you cannot
have any custom transformer (I think)
adwords: In upgraded URLs, curly brackets with an underscore in the tracking template means,
That it is a variable name that must be assigned in one of the custom parameters.
adwords: Always export reports as
a nomal CSV, not the Excel type
Pandas: To make the last column the first, type
df = df.reindex_axis([“Conversions”] + [item for item in df.columns if item !=”Conversions”], axis=1)
Python: To inherit from two parent classes, type
class Myclass(Parentclass1, Parentclass2):
GridSearchCV: To print all of the grid scores as GridSearchCV is working, type
verbose=3
sklearn_pandas.GridSearchCV(pipe, param_grid, verbose=3)
GridSearchCV: To choose the number of folds GridSearchCV creates, type
cv =10
sklearn_pandas.GridSearchCV(pipe, param_grid, verbose=3, scoring=”accuracy”, cv=10)
Python: To make a class that prompts for an input upon init and if it is not part of a list, it asks for the input again, type
class Myclass:
def __init__(self, **args):
def get_attribute(self):
my_attribute = input(“Attribute query?”)
if my_attribute != “chosen attribute”:
return get_attribute()
else:
get_attribute()
self.my_attribute = self.get_attribute()
Python: To perform addition on just one index of a list, type
my_list[2] += 1
or
my_list[2] = my_list[2] +1
Python: DRY means
grouping common operations into functions and common functionality in classes.
Python: To borrow classes from another python file in the same directory, type
from otherclassfile import Myotherclass
class Myclass(Myotherclass):
Python: To extend two classes, type
class Myclass(Otherclass, Otherclass2)
Python: To override a function that is inherited from a class just
create a new function in the current class with the same name
Python: Before inheriting a class make sure to
import it.
Python: In order to set attributes of a class they must be
in def __init__(self, **args):
Python: All classes implicitly extend from
the Object class
Python: The format for a set is
{1,2,3}
Python: A set automatically
removes any non unique values and order the rest in ascending order.
Python: Can a set use a comprehension?
yes
Python: When a tuple is a paramater into a class it requires
it’s own brackets
Pandas: To set the maximum rows and columns to display, type
pandas. set_option(“display.max_rows”, 1000)
pandas. set_option(“display.max_columns”, 1000)
Python: A dependency is
an external file that must be imported into the file you are running.
Python: To sum a list, type
sum(my_list)
sklearn: for RandomForestClassifer, the parameters you should GridSearchCV are
n_estimators, max_features, max_depth, min_samples_leaf
sklearn: If the desired output consists of one or more continuous variables
the task is called regression.
sklearn: In a confusion matrix it is ideal for the
main top left to bottom right has the highest numbers because that signifies correct classifications.
sklearn: Recall is
the rate of how often the algorithm misclassifies a sample that is in fact a certain class as another one. "This class only gets classified correctly x percent of the time." When a sample is in fact a certain class, how often is it classified correctly.
“in fact”
Measure of false negatives for a class.
true positives/(false negatives + true positives)
sklearn: Precision is
the rate of how often when a classification is made, how often it is correct.
“When a classification is finally made for this sample we are x sure that is was made correctly”
How often do samples of other classes get mistaken for this class.
“when classification is made”
Measure of true positives for a class.
true positives/(false positives + true positives)
sklearn: When using an unbalanced dataset with many samples of one class and few of another, it is better to use the the evaluation metrics of
precision, recall, or f1 score (which is both) on a class by class basis.
sklearn: f1 score is a combination of
precision score and recall score
GridSearchCV: To use GridSearchCV with the goal metric as f1 type
scoring=”f1”
udacity: To predict the time of arrival for at&t techs, udacity
binned the times into sections of the day and used a NearestNeighbors classifier to predict based on locality.
ML: A spurious attribute is an attribute that
should not have any bearing on the label so any features with an information gain lower than the spurious attribute can be ignored.
ML: A plot where the prediction line is very wiggly suggests
overfitting
ML: The definition of p-value is
idk
Stats: Stratified sampling is when
The population is grouped by a characteristic, and then a number of samples is pulled from each group to represent it.
Stats: Cluster sampling is when
you group samples based on a characteristic but then only pull samples from one of the groups.
Stats: Simple random sampling is when
All of the samples are grouped together and chosen chosen at random and then returned back into the pool at each draw.
Console: Vim is a… and the command is
A text editor and the command is: vi my_file.py
Python: To sort a list in place, type
my_list.sort()
Python: To reverse the elements of a list in place, type
my_list.reverse()
Python: To return a count the occurrences of a value in a list, type
my_list.count(“value”)
Python: To apply a lambda function to all of a lists items instead of using a list comprehension, type
list(map(lambda x: x*2, my_list))
Python: The map function returns
an object, not a list.
Python: To return a list of all of the
os.listdir(“/users/student/desktop”)
Python: When I see myself looping and appending, I should question whether
a list comprehension assigned to a variable would do.
sklearn: In order to optimize GridSeachCV towards f1, recall or precison, you must
make the labels binary (1 and 0) only.
sklearn: To make GridSearchCV run faster, add
n_jobs=-1 to the parameters
sklearn: To save the best GridSearchCV params to a variable, type
best_parameters = grid_search.best_estimator_.get_params()
Python: If you edit a class, make sure to
re-instantiate the instance afterwards so it can take on the new attributes.
Pandas: df[“column”].str.contains(“string”) is
case sensitive
sklearn: To reattach the predictions to the samples, type
df[“Prediction”] = pandas.Series(model_grid.predict(my_transformer(df_features)))
Selenium: to return the current url, type
my_browser.current_url
sklearn: for small datasets a classifier that can work well is
LinearSVC
sklearn: To return the best params and best score from a grid search, type
model_grid.get_params_
model_grid.best_score_
sklearn: To return the best params from a grid search, type
model_grid.get_params_
numpy: To transpose a numpy array, type
my_array = np.array([1,2], [3,4])
my_array.T
array([ [1, 3],
[2, 4] ])
numpy: To merge to sets of columns, type
numpy.concatenate((a, b), axis=1)
marketing: A burst is
usually incentivized traffic for a short time.
marketing: Incentivized traffic is usually,
a sign up or download in exchange for a bribe like in game currency.
HTTP stands for
HyperText Transfer Protocol
HyperText is
text with links in it
Transfer protocol is
rules for getting data from one place to another
REST API stands for
Representational State Transfer
A stateless API means that
all information necessary to respond to a request is available in each individual request; no data, or state, is held by the server from request to request
axis=1 means
columns
numpy: to check the type of object an arroy is, type
my_numpy_array.dtype
mysql: To delete a table, type
DROP TABLES tablename;
mysql: To delete multiple tables , type
DROP TABLES tablename1, tablename2;
mysql: To insert multiple rows into a table together, type
INSERT INTO tablename VALUES (“String 1”, “String 2”), (“String 1”, “String 2”);
mysql: Strings that you are inserting into a table must be
surrounded by quotes
re: To test if a regex matches a string in the python interpreter, type
import re
re.match(r’^org/?P\w+/$’, ‘org/companyA’)
re: To create the variable that will be parseable by re from a txt file, type
import re
file = open(“my_file.txt”, encoding=”utf-8”)
data = file.read()
file.close()
python: To chain multiple ands and ors into an if statement, type
if (True and True) and (False or True) or (False and False):
python: This returns
if True and False:
print(“Hi”)
nothing
python: This returns
if True or False:
print(“Hi”)
“Hi”
python: To turn [“A”] into [“A”, “A”, “A”, “A”], type
[“A”] * 4
pandas: For plots to display in the notebook, type
%pylab inline
Pandas: changes columns name uses the command
rename, not replace
Excel: When doing vlookup, set approx match to
0
Pandas: to remove the index when sending df to string, type
df.to_string(index=False)
flask: In order to use a file from the templates directory in a view function, type
import render_template
@app.route("/") def my_view_function(): return render_template("file.html")
flask: To open spots in the html template that are variable from the view, you need to
put {{ var_name }} in the template pass the variable into render_template like return render_template("file.html", var_name=var_name)
flask: when creating views remember to
set defaults for the variable that are supposed to be passed in.
flask: {{ var_name }} is used in
templates to pull a variable into it from the view.
flask: all html files must go in
the template directory
flask: the symbols for variable and blocks are
variable: {{ var_name }}
block: {%block my_block %}{% endblock %}
flask: the file html you are extending must
have quotes around it
flask: html pages with after extending from layout.html and removing repeated html look like
{% extends “layout.html” %}
{% block title %}{{ super() }} My Title Tag{% endblock %}
{% block body_content %}
<h1>This is the content of my body</h1>
{% endblock %}
flask: To have a views route redirect you to another view function, type
from flask import redirect
from flask import url_for
@app.rout("/save") def save(): return redirect(url_for("view_function"))
flask: to make a view function only allow post methods to access it, type
@app.route(“/save”, methods=[“POST”])
flask: to access the form data POSTed into a view, type
request.form
flask: To set a cookie by instantiating a make_response object, type
import json
import make_response
import redirect
import url_for
@app.route(“/save”, methods=[“POST”])
def save_view():
response = make_response(redirect(url_for(“index.html”)))
response.set_cookie(“cookie_name”, json.dumps(dict(request.form.items())))
return response
flask: to create a form thats action is to send a POST request to a view. (which is later made to set a cookie)
{% block my_body %}
- form action=”{{ url_for(“save”) }}” method=”POST”>
- label>Form title-/label>
- input type=”text” name=”name” value=”” autofocus>
- input type=”submit” value=”default!”>
- /form>
{% endblock%}
flask: To be able to accept a form POST request, you must first
import request in the app file
from flask import request
flask: In flask the cookie is set upon
the response to the browser
flask: When setting a cookie with a POST request from a form, the value of the cookie becomes
a dict with the key as the name from the name parameter in the form, and the value as the value inputted into the form.
flask: Cookies set on a browser have both
a name (which you give) and a value which is a dict with the name and value from the form field.
flask: To create a view meant for setting a cookie, type
def get_cookies(): try: cookie = json.loads(request.cookies.get("cookie1")) except: cookie = {} return data
@app.route(“/save”, methods=[“POST”])
def save():
response = make_response(redirect(url_for(“index”)))
cookie = get_cookies()
cookie.update(dict(request.form.items()))
response.set_cookie(“cookie1”, json.dumps(cookie))
return response
flask: To make the default value of a form the value from a cookie, type
create the function that returns the cookie in dict format. def get_saved_data(): try: data = json.loads(request.cookies.get("cookie name")) except: data = {} return data
Pass the cookie dict into the template. @app.route("/") def index(): data = get_saved_data() return render_template("index.html", data=data)
Set the value in the form.
flask: In the view that just received a POST request, to get all the form keys and values, type
request.form.items()
flask: To use a for loop in flask, type
{% for item in my_list %}
-li>-h2>item-/h2>-/li>
{% endfor %}
python: When you are inheriting from two parent classes, the order should be
most import parent class last.
flask: The three main files that go in the directory are
app.py, templates, static
flask: In a flask app the css file will be in
-link rel=”stylesheet” href=”../static/styles.css”>
flask: To create a form meant to file uploads, type
- form action=”” method=”” enctype=multipart/form-data>
- input type=”file” value=”value” name=”name”> - /form>
flask: When I reference images from within an html view, it assumes
I am already in my static directory, so I can reference the files directly without changing levels.
flask: I must keep all static files without exception in
static/
flask: To upload a file, type
ALLOWED_EXTENSIONS = set([‘txt’, ‘pdf’, ‘png’, ‘jpg’, ‘jpeg’, ‘gif’])
app = Flask(\_\_name\_\_) app.config['UPLOAD_FOLDER'] = '/home/alpalalpal/mysite/static'
@app.route(‘/4’, methods=[‘GET’, ‘POST’])
def upload_file():
if request.method == ‘POST’:
file = request.files[‘file’]
if file and allowed_file(file.filename):
filename = secure_filename(file.filename)
file.save(os.path.join(app.config[‘UPLOAD_FOLDER’], filename))
return redirect(url_for(‘uploaded_file’,
filename=filename))
return ‘’’
Upload new File <h1>Upload new File</h1> <p> </p> '''
flask: Loops inside flask blocks do not require
a colon
pyautogui: To scroll down, type
pyautogui.scroll(-10)
python: list(range(1,2)) has
1 item
Outbrain: The number of characters allowed in ads it
150
Hadoop is
an open-source software framework written in Java for distributed storage and distributed processing of huge data sets.
Statically typed programming languages
do type checking, which is verifying and enforcing the constraints of types at compile-time as opposed to run-time.
MapReduce is
an algorithm that allows you to query data in parallel on a distributed cluster of computers.
Big Data refers to at least
a terabyte of data
The four V’s of IBM’s definition of big data is
volume, variety, veracity and velocity
Apache Mahout is
library of scalable machine-learning algorithms, implemented on Apache Hadoop
Hadoop: HDFS stands for
Hadoop Distributed File System
Hadoop: A cluster usually has
one heavy duty computers and then 10-15 commodity computers
A node is a
single point in a network or single computer in a cluster.
os: To change your current working directory from within python, type
import os
os.chdir(“C:\folder”)
ipython: Sometimes when there is an unusual error it can be ameliorated by
restarting he kernel
cookies: For security purposes, cookies can only be accessed by
the site that placed them.
python: To create a decorator with no arguments, type
def log(func): def inner(): print("string") return func() return inner
@log def say_hello(): return "Hello there!"
say_hello()