Machine Learning & AI Flashcards
9 breakthrough technologies 2022
- unhackable Internet: entangled particles transmitted end to end, unable to be read w/o disrupting content
- hyper-personalized medicine: takes large team to develop treatment for rare condition; solutions using digital speed
- digital currency: has potential if backed by real currency
- anti-aging drugs: senolytics - remove senescent cells that create low-level inflammation, creating toxicity, and also benefits of
- AI-discovered molecules: 10^60 chemical molecules possible; use machine learning to speed up process of finding possible drugs
- satellite mega-constellations: blanket the world with high speed internet or junk-ridden minefield
- tiny AI: lower carbon emission; increases speed; increases privacy (local storage)
- differential privacy: inject noise into user data to increase privacy
- climate change attribution: improving techniques to link weather to climate change; disentangling factors
adversarial inputs
- aim at misclassification in order to avaid detection
- e.g. - malicious documents designed to evade antivirus, and email attempting to evade spam filters
-
mutated input: attackers actively minimize classifers detection rate using undetectable payloads. To develop detection systems
- limit information leakage: don’t provide error codes or confidence values
- limit probing: limit how many paylods they can test (e.g. captha)
- ensemble learning: combing variable detection methods
self-attention
AutoML
- attempt to make machine learning available to people without strong expertise in the field,
- automate repetitive tasks which enables a data scientist to focus more on the problem rather than the models
- automate data pipeline components - helps to avoid errors that might slip in with manual processes
baseline models
- calculate
- accuracy
- efficiency
- cost
- examples
- naive bayes
- linear regression
- Markov Model (NLP)
batch normalization
- N ~ (0, 1)
- this is where activation are the most dynamic
- do this to every layer
- exponential-smoothed average
- removes the need for bias
- need batch and running meaning and variance
- regularizes like noise injection because it adds error term to norm
bayes rule
- p(y|x)=p(x|y)p(y)/p(x)
- likelihood*prior/normalizer
- prior p(y)=instances/total
bidirectional RNN image classification
- transpose
- vertical RNN and horizontal RNN
- global maxpool on both
bidirectional RNN
- make another recurrent unit but read in reverse
- concat hidden to get size ht 2M
- ht=[h–>,h]
- many to one case
- output is O=[hT–>,h1]
- could also take max
- is able to predict first item in sequence
bidirectional RNN architecture
black swan AI attack
- sooner or later, an attack will throw off your classifier
- develop incident response process
- have necessary controls to delay or halt processing when debugging
- know who to call
- use transfer learning to protect new products
- pretrained models or public datasets
- leverage anomaly detection
- e.g. - abuse of free tier to mine data (changes of uses of platform)
cognitive computing
- perform specific, human-like tasks in an intelligent way using machine learning
- simulate human thought processes using a computerized model
- imitate the way the human brain functions
collaborative filtering
- user rank: s(i,j). = µi +∑wii’{rij-µi’}/∑|wii’|
- pearson correlation (correlation b/w variables) - basically cosine similarity
- wii’=∑(xi-µ(x)i)(yi-µ(y)i)/√(∑(xi-µ(x)i)2∑(yi-µ(y)i)2)
- Ψii’: numerator is over rating by both users
- Ψi, Ψi’: denominator is over respective user
collaborative filtering formula
concatenation trick
- the weights of the input and hiddens can be concatenated
- size Mx(M+D)
convolution
- acts as a filter; blurring, edge detection, etc..
- denoted I*K(image convoluted with filter/kernel)
- a(t)*w(t)=∫a(τ)w(t-τ)dτ continuous or a[x]*w[x]=Σa[x2-x]w[x]
- signal and kernel interchangeable
- can replace x with x’
- filters are the pattern were looking for, spikes are occurance
convolution versus cross-correlation
convolution: (f*g)[n] = ∑f[m]g[n-m]
cross-correlation: (f♦g)[n] = ∑f[m]g[n+m]
convolutional neural network
- imitates the visual cortex in the brain
- invariant to shifts in features of image (translational invariance)
- convolution->pooling->convolution->pooling->fully connected layer
- 3d image has 3d filter
- multiply filter by input elementwise; pooling looses a dimension
- the features that are found actually resemble things in input (layer 1 shape, layer 2 body part, label 3 face)
cross-entropy
- use for classification
- use for autoencoder
- entropy and variance are correlated to unpredictableness
- -T.mean(T.log(y[T.arange(y.shape[0]), t]))
d(sigmoid(x))/dx
y*(1-y)
d(tanh(x))/dx
1-y**2
data lake/warehouse
- lake: large-scale pool of raw data without a concrete purpose
- data warehouse: repository for structured, filtered data that has already been processed for a specific purpose.
- data lakes were born out of the need to harness big data and benefit from the raw, granular structured and unstructured data for machine learning
- data warehouses for analytics used by business users.
data pipeline
- encapsulate a number of processing steps required to prepare data for machine learning
- performing “data prep” operations such as cleansing data and handling missing data and outliers
- also transforming data into a form better suited for machine learning
- includes training or fitting a model and determining its accuracy
- automated so their steps may be performed on a continued basis
data poisoning
- feeding training adversarial data to the classifier to polute training data to perform worse
-
model skewing:
- attackers pollute training data to shift learned boundary
- use sensible data sampling
- don’t allow one user too much influence to system
- use weight decay
- compare classifer to previous
- dark launch: compare two outputs on same traffic
- backtesting: A/B testing of fraction of traffic
- build golden standard dataset: classifier must accurately predicct
-
feedback weaponization: use feedback systems to attack legitimate users and content
- verify feedback befor making decisions
- don’t assume benefactor is responsible: hackers cover tracks to penalize users
deeper is better
less hidden units per layer and achieve better performance
DeepMind wavenet
- text to speech
- uses CNN
- is used by Google Assistant
dis/advantage full batch
- wont work on big dataset due to size
- maximize likelihood over entire set
distribution ML pipeline tools
- Apache Spark
- Apache Airflow
- Kubeflow
docker containers
- small, user-level virtual machine that helps data scientists build, install, and run code
- built from a script
- ability to version control a data science environment
does data influence training time?
- training time does not depend on amount of data
- only the creation time increases
dropout regularization
- drops random nodes during training
- hidden layer units rely on multiple input
- emulates ensembles
- approximates 2^(# neurons) networks
- MULTIPLY PREDICTION LAYERS BY P-KEEP AT END
edge analytics
- data collection and analysis where an analytical computation is performed on data at the point of collection (e.g. a sensor) instead of waiting for the data to be sent back to a centralized data store
- IoT model of connected devices has become more established
- filter for what information is worth sending to a central data store for later use.
ensemble network
- majority vote; better accuracy than one model; two methods (different features, same features)
- 100 features 1 million data points
- 10 networks of 10 features
exponentially-smoothed average
- z(t)=(1-å)Y(t-1)+å*costs(t), (0 < å < 1)
- y(t) = z / (1 - decay ** t + 1)
- more recent values matter more
- thus better for non-stationary data
face map
Face maps provide the ability to combine different types of relationships into one that you can do math on
-
THEOREM is that if you combine relationships that are null or simpler, you reconstruct the larger system
- Co-limits
- Validating: measure the low dimensional entropy and the high dimensional one
- Can iteratively combine low dimensional bad ones with high dimensional good ones until you fine the right one
fancy method
- NLP task
- max pool hidden y in the RNN
- determine if some sequence is in the data
- has 1 output
fastest AI
gated recurrent unit
- similar to rate, you have to choose b/w
- taking the old value
- or taking the new value
- if reset gate r(t) = 0, its like beginning new sequence
- but, h(t) will be a combo of h(t-1) and hhat(t)
- has same form as update gate
- is essentially a forgetting factor
- update gate weights previous
- z(t) = act(xtWXZ + ht-1Whz + bz)
geospatial analytics
- handle geographic information system (GIS) data (e.g. GPS data) and imagery (e.g. satellite photographs)
- uses geographic coordinates as well as identifier variables such as street address and zip code
- create geographic models and data visualizations for more accurate modeling and predictions.
GPU acceleration
- use GPUs and CPU to hasten
- GPU database accelerates certain database operations
graph
- represents a connection between a collection of entities (e.g. spending habits of consumers)
- vertex (nodes): node attributes such as age and height, and number of neighbors
- edge: is relationship between customer and product - edge identity and and weights
graph database
- uses “graph theory” to store, map and query relationships of data elements
- collection of nodes and edges
- node represents an entity such as a product or customer
- have: a unique identifier, a set of outgoing edges and/or incoming edges, in addition to a set of key/value pairs
- edge represents a connection or relationship between two nodes
- have: unique identifier, a starting-place and/or an ending-place node along with a set of properties
graph database image
grid search
exhaustive method that loops through all possible combinations of variable
GRU architecture
how to fix overrepresentation of end
- only add end token n % of the time
- otherwise stop on second to last word
how to prevent or cause same prediction every time
- dont model the initial word probability distribution
- p(w(0)) = softmax(f(‘start’)), w(0) = randint(V, p=p(w(0)))
- output probability instead of deterministic output
increase CNN invariance
- modify training data
- orientation,
- color,
- size, etc…
- can be used to increase one portion of data size if there is class imbalance
indirect encoding
- field of neuroevolution
- analogous to pruning in lottery ticket
- experissive while reducing parameters