misc learnings Flashcards
Python: use “_” to get the output of the last run command
Iterators: don’t wrap list() around them while actually iterating, only beforehand to check that it looks correct. Why?
Iterators don’t take up memory; they’re just one-use disposable instructions to get “the next item” using .next(), and as soon as it moves on to the next item, the previous item disappears from memory.
iter() # create iterator out of any iterable (e.g., out of a list)
“yield” is the iterator’s equivalent of “return”
What’s a function decorator and its syntax?
Meaning: modifies behavior of function in some way. E.g.:
from ediblepickle import checkpoint
@checkpoint # Modifies an API-reading function to instead read results from disk if it’s queried & saved-to-disk that exact thing before
Syntax: “@…” sits on top of a function definition, like literalleh (the line right above the “def” line)
Syntax to read lines from STDINPUT / “inputs with a handle” (like in many HackerRank problems)?
import sys
_ = sys.stdin.readline() # reads the first line; this works like an iterator, so the next identical call will actually read the 2nd line, etc.
sys.stdin.readlines() # reads all subsequent lines;
What’s a dictionary called in other languages?
A “hash map”. It points exactly to where in memory/disk a piece of info can be found so no search needs to take place to find it.
FLESH THIS OUT - IT’S IMPORTANT TO BE ABLE TO ANSWER HOW A HASH TABLE WORKS IN AN INTERVIEW. SEE TDI COURSE NOTES WEEK 3.1 + THE VIDEO OF DON BASICALLY CREATING A HASH PYTHON CLASS FROM SCRATCH.
BeautifulSoup package is simply an HTML parser, not a web scraper. Requests package is the web scraper.
Various Python syntax conventions:
“_” is a var that has to be assigned a value but isn’t used in subsequent code. E.g.:
for _,x in enumerate(my_list) if the number part of it won’t be used for anything.
Related: In sklearn, notation like .error_ indicates that these are things the model LEARNED and weren’t known before fitting it.
Use ALL CAPS for vars that are constants (e.g., a fixed URL, a fixed sleep time between get requests, etc)
pandas.read_html() !
Built-in HTML parser; returns list of dfs, each of which is supposed to be a table in the HTML page. Works well with realgm, actually.
What do df.filter(), df.query(), and df.assign() do?
df.filter() # another way to filter by column or index values, including regex, etc
df.filter(items=[‘one’, ‘three’], axis=0)
df.query() # SQL-like way to filter by values, with the expression completely in quotes
df.query(“CONTROL == 1 and MAIN == 1”)
df.assign() # another way to create a new column; can be chained with other commands
df = df.assign(new_col = lambda df: df.zip_code.str.extract(r”some_regex”))
Print(“A” “B” “C”) # WITHOUT commas to print a long string without going beyond the screen or triple quotes or line continuation chars
NetworkX standard commands
G = nx.Graph() # create empty graph
G.add_edges_from(edge_weight_tuples) # Sample tuple: (‘Alicia’, ‘Jerry Rey’, {‘weight’: 1})
# Watch out for adding five “A-B”s and 3 “B-A”s; the latter will overwrite the former, NOT add to it.
G.nodes
G.edges
G.edges[‘Andrew Galperin’] # dict-like syntax
“Heap” concept
Basically a tree branching downward; each set of child branches adds up to the parent’s total value. When a new element is introduced, it takes relatively little (~log) computing power to slot its magnitude in the correct place.
Python “heapq” package is for this. It “heapify”s lists, which remain list objects but become specially re-ordered.
What does “greedy” mean?
In regex context: an expression like “\d+” will match the “123” part of “abc123xyz”, not just the “1” part. It keeps going until it can’t no’ mo’, UNLESS you use a “lazy” modifier.
In other contexts: an algorithm that’s short-sighted and optimizes in a local way without considering the big picture.
Can use “long strings” (“”” “””) anywhere, even inside print() !
They’re also useful in re.compile()’s “verbose” regex where you can have new line separators without them counting as part of the regex.
What’s the StringIO package for?
Stands for String Input/Output. It’s for creating a file out of a (potentially super long) string. Can also extract the 2nd “column” from CSV data (similar to pandas’ .read_csv())
In ML, what is a “stateless” transformer?
One that transforms every input in the same predetermined way (e.g., Z-score it), without needing to really “fit” the data first. These are easy to create custom because they can be created from a function. (In contrast, a non-stateless transformer has to be written directly as a Class).
What does “non-parametric” mean?
Simply that the # of parameters isn’t known in advance. E.g., a Decision Tree regressor is non-parametric because the # of nodes isn’t known until the model is trained (fitted).
What are the 3 variations on hyperparameter tuning?
- Grid search: brute force - tests every single combo of hyperparameter values. Can quickly get out of hand
- Randomized grid search: a little better.
- Bayesian grid search: usually best! Uses the outcome of each hyperparameter combo test to help choose the next hyperparameter combo. Essentially learns from its experience.
Why sigmoid function used for logit? (at a high level)
It penalizes “confident wrongness” a lot more than, say, SSE. That molds the model in a way that it’s less likely to predict the wrong thing out of 0 and 1 with high confidence.
(remember that a logit model’s output is actually a confidence value between 0 and 1 that gets rounded to the nearest integer. E.g., .02 means it’s very confident that it’s probably a 0)
Python packages involved in creating an interactive web app from beginning (database) to end
Python-SQL interface: sqlalchemy and psycopg2 (in Jupyter magic: “%load_ext sql”, followed by “%%sql” at top of every cell that’ll be SQL code)
“.env” file: for storing & aliasing “secret” info while still sharing code, like the parts of DB or API connection URL that are usernames/pwords
Something about “SQL injection” attacks - I vaguely understood it at the time
Virtual environments: a collection of packages isolated from the rest of machine that are just for this project. “requirements.txt” stores the list of packages.
pip install -r requirements.txt
venv package: also for something virtual environment-related
Altair is essentially the best simulation of D3 in Python!
Very easy to have interactive charts controlling/filtering other charts.
Ingests and also generates JSONs for standalone HTML pages or embedding the viz HTML within a larger web page!
COULD BE SUPER USEFUL FOR US IN CONFLUENCE OR EVEN POTENTIALLY FOR REPLACING OAC ALTOGETHER.
What’s the idea behind Quicksort algorithm? (one of 3 well-known sorting algorithms)
Recursive algorithm
“pivot” is in rightmost position. Keep swapping items until pivot is somewhere in middle, w/everything to the left smaller than it and everything to the right larger than it (but not necessarily sorted within themselves). Keep repeating to each chunk left & right of the pivot until the chunks are 1 element.
Pretty-printing multiple objects (dfs) from one block of code using display(HTML())
(can’t do it with regular print() because it starts looking janky)
from IPython.display import display, HTML
display(HTML(df.to_html()))
How to print all defined variables?
whos