Machine Learning & AI Flashcards
9 breakthrough technologies 2022
- unhackable Internet: entangled particles transmitted end to end, unable to be read w/o disrupting content
- hyper-personalized medicine: takes large team to develop treatment for rare condition; solutions using digital speed
- digital currency: has potential if backed by real currency
- anti-aging drugs: senolytics - remove senescent cells that create low-level inflammation, creating toxicity, and also benefits of
- AI-discovered molecules: 10^60 chemical molecules possible; use machine learning to speed up process of finding possible drugs
- satellite mega-constellations: blanket the world with high speed internet or junk-ridden minefield
- tiny AI: lower carbon emission; increases speed; increases privacy (local storage)
- differential privacy: inject noise into user data to increase privacy
- climate change attribution: improving techniques to link weather to climate change; disentangling factors
adversarial inputs
- aim at misclassification in order to avaid detection
- e.g. - malicious documents designed to evade antivirus, and email attempting to evade spam filters
-
mutated input: attackers actively minimize classifers detection rate using undetectable payloads. To develop detection systems
- limit information leakage: don’t provide error codes or confidence values
- limit probing: limit how many paylods they can test (e.g. captha)
- ensemble learning: combing variable detection methods
self-attention

AutoML
- attempt to make machine learning available to people without strong expertise in the field,
- automate repetitive tasks which enables a data scientist to focus more on the problem rather than the models
- automate data pipeline components - helps to avoid errors that might slip in with manual processes
baseline models
- calculate
- accuracy
- efficiency
- cost
- examples
- naive bayes
- linear regression
- Markov Model (NLP)
batch normalization
- N ~ (0, 1)
- this is where activation are the most dynamic
- do this to every layer
- exponential-smoothed average
- removes the need for bias
- need batch and running meaning and variance
- regularizes like noise injection because it adds error term to norm
bayes rule
- p(y|x)=p(x|y)p(y)/p(x)
- likelihood*prior/normalizer
- prior p(y)=instances/total
bidirectional RNN image classification
- transpose
- vertical RNN and horizontal RNN
- global maxpool on both
bidirectional RNN
- make another recurrent unit but read in reverse
- concat hidden to get size ht 2M
- ht=[h–>,h]
- many to one case
- output is O=[hT–>,h1]
- could also take max
- is able to predict first item in sequence
bidirectional RNN architecture

black swan AI attack
- sooner or later, an attack will throw off your classifier
- develop incident response process
- have necessary controls to delay or halt processing when debugging
- know who to call
- use transfer learning to protect new products
- pretrained models or public datasets
- leverage anomaly detection
- e.g. - abuse of free tier to mine data (changes of uses of platform)
cognitive computing
- perform specific, human-like tasks in an intelligent way using machine learning
- simulate human thought processes using a computerized model
- imitate the way the human brain functions
collaborative filtering
- user rank: s(i,j). = µi +∑wii’{rij-µi’}/∑|wii’|
- pearson correlation (correlation b/w variables) - basically cosine similarity
- wii’=∑(xi-µ(x)i)(yi-µ(y)i)/√(∑(xi-µ(x)i)2∑(yi-µ(y)i)2)
- Ψii’: numerator is over rating by both users
- Ψi, Ψi’: denominator is over respective user
collaborative filtering formula

concatenation trick
- the weights of the input and hiddens can be concatenated
- size Mx(M+D)
convolution
- acts as a filter; blurring, edge detection, etc..
- denoted I*K(image convoluted with filter/kernel)
- a(t)*w(t)=∫a(τ)w(t-τ)dτ continuous or a[x]*w[x]=Σa[x2-x]w[x]
- signal and kernel interchangeable
- can replace x with x’
- filters are the pattern were looking for, spikes are occurance
convolution versus cross-correlation
convolution: (f*g)[n] = ∑f[m]g[n-m]
cross-correlation: (f♦g)[n] = ∑f[m]g[n+m]
convolutional neural network
- imitates the visual cortex in the brain
- invariant to shifts in features of image (translational invariance)
- convolution->pooling->convolution->pooling->fully connected layer
- 3d image has 3d filter
- multiply filter by input elementwise; pooling looses a dimension
- the features that are found actually resemble things in input (layer 1 shape, layer 2 body part, label 3 face)
cross-entropy
- use for classification
- use for autoencoder
- entropy and variance are correlated to unpredictableness
- -T.mean(T.log(y[T.arange(y.shape[0]), t]))
d(sigmoid(x))/dx
y*(1-y)
d(tanh(x))/dx
1-y**2
data lake/warehouse
- lake: large-scale pool of raw data without a concrete purpose
- data warehouse: repository for structured, filtered data that has already been processed for a specific purpose.
- data lakes were born out of the need to harness big data and benefit from the raw, granular structured and unstructured data for machine learning
- data warehouses for analytics used by business users.
data pipeline
- encapsulate a number of processing steps required to prepare data for machine learning
- performing “data prep” operations such as cleansing data and handling missing data and outliers
- also transforming data into a form better suited for machine learning
- includes training or fitting a model and determining its accuracy
- automated so their steps may be performed on a continued basis
data poisoning
- feeding training adversarial data to the classifier to polute training data to perform worse
-
model skewing:
- attackers pollute training data to shift learned boundary
- use sensible data sampling
- don’t allow one user too much influence to system
- use weight decay
- compare classifer to previous
- dark launch: compare two outputs on same traffic
- backtesting: A/B testing of fraction of traffic
- build golden standard dataset: classifier must accurately predicct
-
feedback weaponization: use feedback systems to attack legitimate users and content
- verify feedback befor making decisions
- don’t assume benefactor is responsible: hackers cover tracks to penalize users
deeper is better
less hidden units per layer and achieve better performance
DeepMind wavenet
- text to speech
- uses CNN
- is used by Google Assistant
dis/advantage full batch
- wont work on big dataset due to size
- maximize likelihood over entire set
distribution ML pipeline tools
- Apache Spark
- Apache Airflow
- Kubeflow
docker containers
- small, user-level virtual machine that helps data scientists build, install, and run code
- built from a script
- ability to version control a data science environment
does data influence training time?
- training time does not depend on amount of data
- only the creation time increases
dropout regularization
- drops random nodes during training
- hidden layer units rely on multiple input
- emulates ensembles
- approximates 2^(# neurons) networks
- MULTIPLY PREDICTION LAYERS BY P-KEEP AT END
edge analytics
- data collection and analysis where an analytical computation is performed on data at the point of collection (e.g. a sensor) instead of waiting for the data to be sent back to a centralized data store
- IoT model of connected devices has become more established
- filter for what information is worth sending to a central data store for later use.
ensemble network
- majority vote; better accuracy than one model; two methods (different features, same features)
- 100 features 1 million data points
- 10 networks of 10 features
exponentially-smoothed average
- z(t)=(1-å)Y(t-1)+å*costs(t), (0 < å < 1)
- y(t) = z / (1 - decay ** t + 1)
- more recent values matter more
- thus better for non-stationary data
face map
Face maps provide the ability to combine different types of relationships into one that you can do math on
-
THEOREM is that if you combine relationships that are null or simpler, you reconstruct the larger system
- Co-limits
- Validating: measure the low dimensional entropy and the high dimensional one
- Can iteratively combine low dimensional bad ones with high dimensional good ones until you fine the right one
fancy method
- NLP task
- max pool hidden y in the RNN
- determine if some sequence is in the data
- has 1 output
fastest AI
gated recurrent unit
- similar to rate, you have to choose b/w
- taking the old value
- or taking the new value
- if reset gate r(t) = 0, its like beginning new sequence
- but, h(t) will be a combo of h(t-1) and hhat(t)
- has same form as update gate
- is essentially a forgetting factor
- update gate weights previous
- z(t) = act(xtWXZ + ht-1Whz + bz)
geospatial analytics
- handle geographic information system (GIS) data (e.g. GPS data) and imagery (e.g. satellite photographs)
- uses geographic coordinates as well as identifier variables such as street address and zip code
- create geographic models and data visualizations for more accurate modeling and predictions.
GPU acceleration
- use GPUs and CPU to hasten
- GPU database accelerates certain database operations
graph
- represents a connection between a collection of entities (e.g. spending habits of consumers)
- vertex (nodes): node attributes such as age and height, and number of neighbors
- edge: is relationship between customer and product - edge identity and and weights
graph database
- uses “graph theory” to store, map and query relationships of data elements
- collection of nodes and edges
- node represents an entity such as a product or customer
- have: a unique identifier, a set of outgoing edges and/or incoming edges, in addition to a set of key/value pairs
- edge represents a connection or relationship between two nodes
- have: unique identifier, a starting-place and/or an ending-place node along with a set of properties
graph database image

grid search
exhaustive method that loops through all possible combinations of variable
GRU architecture

how to fix overrepresentation of end
- only add end token n % of the time
- otherwise stop on second to last word
how to prevent or cause same prediction every time
- dont model the initial word probability distribution
- p(w(0)) = softmax(f(‘start’)), w(0) = randint(V, p=p(w(0)))
- output probability instead of deterministic output
increase CNN invariance
- modify training data
- orientation,
- color,
- size, etc…
- can be used to increase one portion of data size if there is class imbalance
indirect encoding
- field of neuroevolution
- analogous to pruning in lottery ticket
- experissive while reducing parameters
item-item vs user-user
- item-item
- choose items for user b/c liked similar items in the past
- user-user
- choose items for user b/c liked by similar users
- uses the same algorithms just transpose the ratings matrix
- comparing items opposed to users provides more data, also it is faster
k-fold cross validation
- split into k groups
- training groups [1:k]
- training groups [0]+[2:k]
- etc…
- use t-test to compare intra
KL divergence
- use to compare 2 distributions similarity
- gradient is the same as cross entropy/negative log-likelihood
- kl and cross entropy replaceable for back propagation
L1 regularization
encourages sparsity (=0)
L2 regularization
encourages small weights (~=0);
latency & sensors in python
- write in pytorch or tensorflow
- rewrite model in C++
- either rewrite completely or serialize it
- pytorch tracing function
- tensor flows graph mode
- either rewrite completely or serialize it
linear regression regularization
- J = ∑(y - y^)2+0.5*lambda(||*||F2+||*||F2+…+||*||F2)
- ||*||F2=Frobenius norm
- * is a weight matrix
- punishes complexity
- improves error on unseen data
local minimum
- is more likely a saddle point therefore is not really a problem
- slides off
- very low probability for all dimensions
logarithmic sampling
how you look for parameters to ensure you get breadth instead of close values
lottery ticket
- find lucky subnetwork with sparse network
- present in LSTMs and transformers
- fit on small device
low-code/no-code
- ML applications w/ drag and drop components
- connect components to create a finished application
- many enterprise Business Intelligence (BI) platforms fall into this platform category
LSTM algorithm

LSTM architecture

LSTM params
- Input gate:
- params: Wxi, Whi, Wci, bi
- depends on: x(t), h(t-1), c(t-1)
- Candidate cell:
- params: Wxc, Whc, bc
- depends on: x(t), h(t-1)
- Forget gate:
- params: Wxf, Whf, Wcf, bf
- depends on: x(t), h(t-1), c(t-1)
- Output gate:
- params: Wxo, Who, Wco, bo
- depends on: x(t), h(t-1), c(t-1)
machine learning security
malicious modification of networks to worse output
main problem with AI
- don’t have common sense
- networks find correlation, not causation
- example research: causal bayesian network
major AI conferences
- NeurIPS (NIPS): neural networks, but not exclusively
- International Conference for Machine Learning (ICML): general machine learning
- International Conference on Learning Representations (ICLR): really the first conference focused on deep learning.
- Association for the Advancement of Artificial Intelligence (AAAI): more application based
- IEEE conferences
- International Joint Conference on NN (IJCNN)
- IEEE International Conference on Fuzzy Systems (FUZZ-IEEE)
- IEEE Congress on Evolutionary Comput (CEC)
Markov decision process
- set of all states: measurements
- set of all actions: actions the agent can do
- set of all rewards: received at each step
- state transition probabilities
- discount factor (gamma)
mixed data
multiple independent data types
- Numeric/continuous
- Categorical
- Image
model stealing
- steal prediction models and spam filters in order to optimize against them
- e.g. - blackbox probing for stock mrket predictions
- model reconstruction: recreate model by probing the public API and refining own model using it as an oracle
- membership leakage: attacker builds shadow models that is used to determine whether a given record was used to train a model
MSE versus cross-entropy
- MSE for regression
- assumes target continuous and normally distributed
- doesnt punish misclassification enough
- vanishing gradient
- cross-entropy for classification
- maximizes likelihood
- decision boundary is large
- converge faster
naive bayes
- generative classifier (p(x|y) instead of discriminitive p(y|x))
- p(x|y)=Πp(xn|y)
- naive because assume no covariance
- IID assumption valid if you use PCA
natural graphs
DEFINITION: a set of points that have an inherent relationship between them
- Co-occurance: capture user behavior based on interactions with data
EXAMPLES:
- Citation Graph: capture relationship between articles to other articles using citations
- Natural Language: node represent entity and edges represent relationships between pairs of entities
neural structured learning (NSL)
GOAL: optimized supervised AND neighbor loss to keep structural similarity to learn the structure; (bonus) requires less data
- Graph Regularization: idea is to train ANN with graph-regularized objective harnessing labeled and unlabeled data
- Adversarial learning: generate adversarial neighbors by keeping structure similarity to other samples. Why important?
-
Adversarial structures: as opposed to graphs, they are implicitly inferred;
- Use similarity between instances (using pertained embedder), else, If don’t have similar structures you create adversarials intended to mislead the neural network to producing the incorrect classification
- Usually perturbations generated by reverse gradient
- Use similarity between instances (using pertained embedder), else, If don’t have similar structures you create adversarials intended to mislead the neural network to producing the incorrect classification
neuro-symbolic
- symbolics: better at abstraction and reasoning
- ml: better at scalability and pattern recognition
- hybrid: understanding causal relationships
neurons in brain
- 10^11 neurons in the human brain
- each is connected to 1,000 to up to 10,000 other neurons
node classification methods
- deepwalk: derive embeddings from truncated random walk from graph data
- graph CNN: Use Convolution on linear layers; each layer equals expanding of network
- graph BERT: removed dependency on links
noise injection
- type of regularization
- N ~ (0, var << 1)
- injection to inputs and weights
NSL Architecture
olfactory machine learning
- uses 4 layers
- finds the maximum signal strength
- one neuron can represent multiple smells
- as opposed to visual cortex which one neuron is one pixel
- attempts coctail party problem
- narrows in on particular sound signals in short period
- disentangle conversations
one-hot encoding
assign random numbers
i.e red = 1, blue = 2, and green = 3
open-source libraries
- Tensorflow (Google): symbol math library; high level keras; c++ support
- Pytorch (Facebook): ML/DL
- Scikit-learn: classification, regression, and clustering algorithms
- Jax: Basically numpy with ML support and GPU, TU
-
PySpark: extremely fast cluster computing for python using spark on standalone cluster
- Spark SQL for dataframes
- MLib for ML (for spark cluster - runs on Hadoop, Apache Mesos, Kubernetes)
overfit
- really complex model with many parameter may not do good
- bias variance tradeoff
- simpler models overfit less
- regularization
- noise injection
- dropout
- batch normalization
p(drop/keep)
- probability of keeping or dropping a node
- typically p(keep) 0.8 for input layers and p(drop) 0.5 for dropout layers
- multiply by mask
parity problem
- application: data transmission in communication system
- sends bitstream to reciever
- if odd 1’s add parity bit 1
- if odd 1’s even add parity bit 0
- if odd then there is an error
penalty term
term that grows as the weight grows
pooling
- a method using in CNNs to downsample (reduce size of features)
- take the maximum/average in a block
pretrained models
- bigger model with more comprehensive pretraining data
- performs better at multiple downstream tasks
- fewer training examples
- saves money collecting more annotated samples
rated recurrent neural network (RRNN)
- Z = rate
- weight two things
- f(x(t), h(t-1)): output of RNN
- h(t-1): previous value of hidden state
- element-wise multiplication
- h(t) = (1-Z)°h(t-1) + Z°f(x(t), h(t-1))
- acts like a low pass filter
- Z can be calculated many ways
- weight param
- function of X
- z(t) = f(WXZx(T)+WHZh(t-1)+bZ)
recommendation interfaces
- news feed
- product feed
- similar product
- associated products
- search engine
- advertisements
recurrent unit (Elman unit)
- h(t)=f(WhTh(t-1)+WXTx(t)+bh)
- y(t)=softmax(W0Th(t)+b0)
- relies on all previous states, no markov assumption
- X(t): D, Wi: D x M
- h(t): M, Wh: M x M
- Y(t): K, W0: M x K
recurrent vs markov model
- using softmax with RNN is about optimizing joint probability
- longer sequences dont go to 0
- joint probability is implicit in model
ReLU
- rectified linear unit
- biologically plausible due to asymmetry
- better gradient propagation (no pretraining prep needed for deep networks)
- however dying relu can occur when all activation is turned off
same/half padding
- padding the input with 0 (~half the length of the filter) to allow the output size to equal input size
- filters usually odd size to allow both sides of input to have equal padding
saturated
the region of a function (such as sigmoid) that are not dynamic
selecting CNN layers
- depth increases through layers, height and width decrease (reverse for type 2 - image in output)
- convolutional on the side of image
- fully-connected on the side of vector
- fully connected layers can all be the same size (research has found that it does not overfit)
self attention diagram 1

self-attention diagram 2

solving class imbalance
sample with replacement for until the smaller set is the same size as the bigger set
spiking neural networks
activation potentials resemble biological neuron
stochastic gradient descent (SGD)
- assume all data IID
- very slow
- error improves long-term (although it may or may not immediately)
- sample with replacement (sometimes)
structured v. unstructured data
- tabular data versus raw data
- data that looks like an sql data with input and id neatly defined versus non-structured like an image
supervised learning for recommender systems
- input demographics
- age, gender, religion
- occupation, education
- location, race
- predict the users reaction
- did they buy item
- did they click on add
truncated back propagation
- long sequences take a lot of time
- stop at certain number of time steps
types of RNN
- label for sequence
- label for each layer
- no labels
unrolled RNN

valid/full mode
- valid mode: output n-k+1
- full mode: output n+k-1
vanishing gradient/exploding gradient
- sigmoid is at max 0.25
- a**n=0/inf
- learn very slowly/not at all
weight initialization
- tanh: 1/M1 or 1/(M1+M2);
- relu: 2/M1;
- DRAWING FROM N~(0, 1)
- should normalize input X=(X-µ)/std
why initialize weights randomly
prevents symmetry (like having one unit!)
why sigmoid?
- monotonically increasing
- is the output of binary logistic regression (neuron like logistic)
zero-day inputs
not very often, but happen
- new product or feature launch: new funcationalities open new attack surfaces
- increase incentive: google cloud compute to mine bitcoin
image-as-graph
each pixel represents a 3d RGB vector connected to 8 “adjacent” neighbors
Text-as-graph
Represent text as a sequence of tokens in a directional graph
Examples of Graphs
- Social networks: people as nodes and their relationship as an edge
- molecules: Atoms as nodes and covalent bonds as edges
-
Citation networks:each paper is a node and and edge is a citation between one paper and another
- Can even do word embedding of abstract as information about one node
Graph vs. Node vs. Edge TASK
- Graph: predict the property of the entire graph (.e.g. smell of molecule, label,)
- Node: predict the identity of role of each node (e.g. member loyal to John or Jane, image segmentation, POS tagging)
- Edge: determine the relationship between objects in images
representing graphs in ML
-
Tabular: often sparse
- node: feature matrix N x D (nodes x features)
- connectivity:
-
Lists: more computationally efficient
- Nodes = [[0, 1], [1, 0], [0, 0]
- Edges = [[2, 1], [1, 1,]]
- Adjacent list = [[1, 0], [2, 0]]
graph neural network
an optimizable transformation on all attributes of the graph (nodes, edges, global-context)
- graph-in, graph-out: architecture accepts graph as an input, with information loaded into its nodes, edges, and global-context, and progressively transforms the embeddings, without changing the connectivity of the input graph
simplest GNN
- Learn embeddings for graph attributes (nodes, edges, global) W/O changing connectivity
- Multilayer perceptron on each component
- N(n_f|n_0), E(e_f|e_0), G(g_f|g_0)
Large Language Generative Models Sizes
- params: NVidia-Turing (570 b), GPT3 & AI21 Jurassic (175 b)
- tokens: GPT3 (300 b), chinchilla (1.4 t)
- human feedback training
areas for improvement generative models 2022
- memory - recall previous information
- summarize - decompose info into elements
- Model/Data quality assurance -
- 4.6-17.2 T tokens (books, literature, articles)
- NOT blogs, webpages, social media, spoken content
- bigger is not better = more cost, impossible to beat, …
- text-to-video
affective AI
- emotion detection in image, video, text, sound waves
- current technologies average samples which leads to possible bias (e.g. - cultural and sex differences)
2024 predictions
- humanoid robots - maybe a foundational model
- ML Ops - weights and biases (raised 200 million at 1 billion)
- ethics and law tossed in generative AI mix
- still lacking basic knowledge
- AI search using generative AI
- mixed modal foundational models
6 openAI competitors
whomever can run cheapest and best for the job
- AdeptAl - ~660M Greylock - 20 employees - browser extension for instructional surfing
- Al21 - 750 million - 160 employees - customize models for automated copywriting, summarize documents
- Anthropic - 4 billion - 80 employees - general-purpose LLMs free of bias and toxicity
- Character - 18 employees - create and interact with chatbots that can role play
- Cohere - ~1 bil Tiger Global - summarizing documents, copywriting, and search
- Inflection - 1.23B Grevlock - communicate with machines to relay our thoughts and ideas
How to find bad labelled data?
- Sort by model losses and finding ambiguous samples, AND sorting by confidence where model and ground truth disagree
- Confident Learning: pruning noisy data based on prediction intervals (worse than aforementioned method)