Machine Learning & AI Flashcards

Question

deeper is better

Answer 1

less hidden units per layer and achieve better performance

Answer 2

* text to speech * uses CNN * is used by Google Assistant

Answer 3

* wont work on big dataset due to size * maximize likelihood over entire set

Answer 4

* Apache Spark * Apache Airflow * Kubeflow

Answer 5

* small, user-level virtual machine that helps data scientists build, install, and run code * built from a script * ability to version control a data science environment

Answer 6

* training time does not depend on amount of data * only the creation time increases

Answer 7

* drops random nodes during training * hidden layer units rely on multiple input * emulates ensembles * approximates 2^(# neurons) networks * MULTIPLY PREDICTION LAYERS BY P-KEEP AT END

Answer 8

* data collection and analysis where an analytical computation is performed on data at the point of collection (e.g. a sensor) instead of waiting for the data to be sent back to a centralized data store * IoT model of connected devices has become more established * filter for what information is worth sending to a central data store for later use.

Answer 9

* majority vote; better accuracy than one model; two methods (different features, same features) * 100 features 1 million data points * 10 networks of 10 features

Answer 10

* z(t)=(1-å)Y(t-1)+å\*costs(t), (0 \< å \< 1) * y(t) = z / (1 - decay \*\* t + 1) * more recent values matter more * thus better for non-stationary data

Answer 11

**Face maps** provide the ability to combine different types of relationships into one that you can do math on * **_THEOREM_** is that if you combine relationships that are null or simpler, you reconstruct the larger system * Co-limits * Validating: measure the low dimensional entropy and the high dimensional one * Can iteratively combine low dimensional bad ones with high dimensional good ones until you fine the right one

Answer 12

* NLP task * max pool hidden y in the RNN * determine if some sequence is in the data * has 1 output

Answer 13

* similar to rate, you have to choose b/w * taking the old value * or taking the new value * if reset gate r(t) = 0, its like beginning new sequence * but, h(t) will be a combo of h(t-1) and hhat(t) * has same form as update gate * is essentially a forgetting factor * update gate weights previous * z(t) = act(x_tW_XZ + h_t-1W_hz + b_z)

Answer 14

* handle geographic information system (GIS) data (e.g. GPS data) and imagery (e.g. satellite photographs) * uses geographic coordinates as well as identifier variables such as street address and zip code * create geographic models and data visualizations for more accurate modeling and predictions.

Answer 15

* use GPUs and CPU to hasten * GPU database accelerates certain database operations

Answer 16

* represents a connection between a collection of entities (e.g. spending habits of consumers) * **vertex (nodes)**: node attributes such as age and height, and number of neighbors * **edge:** is relationship between customer and product - edge identity and and weights

Answer 17

* uses “graph theory” to store, map and query relationships of data elements * collection of nodes and edges * node represents an entity such as a product or customer * have: a unique identifier, a set of outgoing edges and/or incoming edges, in addition to a set of key/value pairs * edge represents a connection or relationship between two nodes * have: unique identifier, a starting-place and/or an ending-place node along with a set of properties

Answer 18

exhaustive method that loops through all possible combinations of variable

Answer 19

* only add end token n % of the time * otherwise stop on second to last word

Answer 20

* dont model the initial word probability distribution * p(w(0)) = softmax(f('start')), w(0) = randint(V, p=p(w(0))) * output probability instead of deterministic output

Answer 21

* modify training data * orientation, * color, * size, etc... * can be used to increase one portion of data size if there is class imbalance

Answer 22

* field of neuroevolution * analogous to pruning in lottery ticket * experissive while reducing parameters

Answer 23

* item-item * choose items for user b/c liked similar items in the past * user-user * choose items for user b/c liked by similar users * uses the same algorithms just transpose the ratings matrix * comparing items opposed to users provides more data, also it is faster

Answer 24

1. split into k groups 2. training groups [1:k] 3. training groups [0]+[2:k] 4. etc... 5. use t-test to compare intra

Answer 25

* use to compare 2 distributions similarity * gradient is the same as cross entropy/negative log-likelihood * kl and cross entropy replaceable for back propagation

Answer 26

encourages sparsity (=0)

Answer 27

encourages small weights (~=0);

Answer 28

* write in pytorch or tensorflow * rewrite model in C++ * either rewrite completely or serialize it * pytorch tracing function * tensor flows graph mode

Answer 29

* J = ∑(y - y^)²+0.5\*lambda(||\*||_F²+||\*||_F²+...+||\*||_F²) * ||\*||_F²=Frobenius norm * \* is a weight matrix * punishes complexity * improves error on unseen data

Answer 30

* is more likely a saddle point therefore is not really a problem * slides off * very low probability for all dimensions

Answer 31

how you look for parameters to ensure you get breadth instead of close values

Answer 32

* find lucky subnetwork with sparse network * present in LSTMs and transformers * fit on small device

Answer 33

* ML applications w/ drag and drop components * connect components to create a finished application * many enterprise Business Intelligence (BI) platforms fall into this platform category

Answer 34

* Input gate: * params: W_xi, W_hi, W_ci, b_i * depends on: x(t), h(t-1), c(t-1) * Candidate cell: * params: W_xc, W_hc, b_c * depends on: x(t), h(t-1) * Forget gate: * params: W_xf, W_hf, W_cf, b_f * depends on: x(t), h(t-1), c(t-1) * Output gate: * params: W_xo, W_ho, W_co, b_o * depends on: x(t), h(t-1), c(t-1)

Answer 35

malicious modification of networks to worse output

Answer 36

* don’t have common sense * networks find correlation, not causation * example research: causal bayesian network

Answer 37

1. **NeurIPS (NIPS):** neural networks, but not exclusively 2. **International Conference for Machine Learning (ICML):** general machine learning 3. **International Conference on Learning Representations (ICLR)**: really the first conference focused on deep learning. 4. **Association for the Advancement of Artificial Intelligence (AAAI)**: more application based 5. IEEE conferences 1. International Joint Conference on NN (IJCNN) 2. IEEE International Conference on Fuzzy Systems (FUZZ-IEEE) 3. IEEE Congress on Evolutionary Comput (CEC)

Answer 38

1. set of all states: measurements 2. set of all actions: actions the agent can do 3. set of all rewards: received at each step 4. state transition probabilities 5. discount factor (gamma)

Answer 39

multiple independent data types * Numeric/continuous * Categorical * Image

Answer 40

* steal prediction models and spam filters in order to optimize against them * e.g. - blackbox probing for stock mrket predictions * **model reconstruction**: recreate model by probing the public API and refining own model using it as an oracle * **membership leakage**: attacker builds shadow models that is used to determine whether a given record was used to train a model

Answer 41

* MSE for regression * assumes target continuous and normally distributed * doesnt punish misclassification enough * vanishing gradient * cross-entropy for classification * maximizes likelihood * decision boundary is large * converge faster

Answer 42

* generative classifier (p(x|y) instead of discriminitive p(y|x)) * p(x|y)=Πp(x_n|y) * naive because assume no covariance * IID assumption valid if you use PCA

Answer 43

**_DEFINITION_**: a set of points that have an inherent relationship between them * **Co-occurance**: capture user behavior based on interactions with data **_EXAMPLES_:** 1. Citation Graph: capture relationship between articles to other articles using citations 2. Natural Language: node represent entity and edges represent relationships between pairs of entities

Answer 44

**_GOAL_**: optimized supervised AND neighbor loss to keep structural similarity to learn the structure; (**bonus**) requires less data * **Graph Regularization**: idea is to train ANN with graph-regularized objective harnessing labeled and unlabeled data * **Adversarial learning**: generate adversarial neighbors by keeping structure similarity to other samples. Why important? * **Adversarial structures**: as opposed to graphs, they are implicitly inferred; * Use similarity between instances (using pertained embedder), else, If don't have similar structures you create **adversarials** intended to mislead the neural network to producing the incorrect classification * Usually perturbations generated by reverse gradient

Answer 45

* **symbolics**: better at abstraction and reasoning * **ml**: better at scalability and pattern recognition * **hybrid**: understanding causal relationships

Answer 46

* 10^11 neurons in the human brain * each is connected to 1,000 to up to 10,000 other neurons

Answer 47

* deepwalk: derive embeddings from truncated random walk from graph data * graph CNN: Use Convolution on linear layers; each layer equals expanding of network * graph BERT: removed dependency on links

Answer 48

* type of regularization * N ~ (0, var \<\< 1) * injection to inputs and weights

Answer 49

* uses 4 layers * finds the maximum signal strength * one neuron can represent multiple smells * as opposed to visual cortex which one neuron is one pixel * attempts coctail party problem * narrows in on particular sound signals in short period * disentangle conversations

Answer 50

assign random numbers i.e red = 1, blue = 2, and green = 3

Answer 51

* **Tensorflow** (Google): symbol math library; high level keras; c++ support * **Pytorch** (Facebook): ML/DL * **Scikit-learn:** classification, regression, and clustering algorithms * **Jax**: Basically numpy with ML support and GPU, TU * **PySpark**: extremely fast cluster computing for python using spark on standalone cluster * Spark SQL for dataframes * MLib for ML (for spark cluster - runs on Hadoop, Apache Mesos, Kubernetes)

Answer 52

* really complex model with many parameter may not do good * bias variance tradeoff * simpler models overfit less * regularization * noise injection * dropout * batch normalization

Answer 53

1. probability of keeping or dropping a node 2. typically p(keep) 0.8 for input layers and p(drop) 0.5 for dropout layers 3. multiply by mask

Answer 54

* application: data transmission in communication system * sends bitstream to reciever * if odd 1's add parity bit 1 * if odd 1's even add parity bit 0 * if odd then there is an error

Answer 55

term that grows as the weight grows

Answer 56

* a method using in CNNs to downsample (reduce size of features) * take the maximum/average in a block

Answer 57

* bigger model with more comprehensive pretraining data * performs better at multiple downstream tasks * fewer training examples * saves money collecting more annotated samples

Answer 58

* Z = rate * weight two things * f(x(t), h(t-1)): output of RNN * h(t-1): previous value of hidden state * element-wise multiplication * h(t) = (1-Z)°h(t-1) + Z°f(x(t), h(t-1)) * acts like a low pass filter * Z can be calculated many ways * weight param * function of X * z(t) = f(W_XZx(T)+W_HZh(t-1)+b_Z)

Answer 59

* news feed * product feed * similar product * associated products * search engine * advertisements

Answer 60

* h(t)=f(W_h^Th(t-1)+W_X^Tx(t)+b_h) * y(t)=softmax(W₀^Th(t)+b₀) * relies on all previous states, no markov assumption * X(t): D, W_i: D x M * h(t): M, W_h: M x M * Y(t): K, W₀: M x K

Answer 61

* using softmax with RNN is about optimizing joint probability * longer sequences dont go to 0 * joint probability is implicit in model

Answer 62

* rectified linear unit * biologically plausible due to asymmetry * better gradient propagation (no pretraining prep needed for deep networks) * however dying relu can occur when all activation is turned off

Answer 63

* padding the input with 0 (~half the length of the filter) to allow the output size to equal input size * filters usually odd size to allow both sides of input to have equal padding

Answer 64

the region of a function (such as sigmoid) that are not dynamic

Answer 65

* depth increases through layers, height and width decrease (reverse for type 2 - image in output) * convolutional on the side of image * fully-connected on the side of vector * fully connected layers can all be the same size (research has found that it does not overfit)

Answer 66

sample with replacement for until the smaller set is the same size as the bigger set

Answer 67

activation potentials resemble biological neuron

Answer 68

* assume all data IID * very slow * error improves long-term (although it may or may not immediately) * sample with replacement (sometimes)

Answer 69

* tabular data versus raw data * data that looks like an sql data with input and id neatly defined versus non-structured like an image

Answer 70

* input demographics * age, gender, religion * occupation, education * location, race * predict the users reaction * did they buy item * did they click on add

Answer 71

* long sequences take a lot of time * stop at certain number of time steps

Answer 72

1. label for sequence 2. label for each layer 3. no labels

Answer 73

1. valid mode: output n-k+1 2. full mode: output n+k-1

Answer 74

* sigmoid is at max 0.25 * a\*\*n=0/inf * learn very slowly/not at all

Answer 75

1. tanh: 1/M1 or 1/(M1+M2); 2. relu: 2/M1; * DRAWING FROM N~(0, 1) * should normalize input X=(X-µ)/std

Answer 76

prevents symmetry (like having one unit!)

Answer 77

* monotonically increasing * is the output of binary logistic regression (neuron like logistic)

Answer 78

not very often, but happen * new product or feature launch: new funcationalities open new attack surfaces * increase incentive: google cloud compute to mine bitcoin

Answer 79

each pixel represents a 3d RGB vector connected to 8 "adjacent" neighbors

Answer 80

Represent text as a sequence of tokens in a directional graph

Answer 81

* **Social networks**: people as nodes and their relationship as an edge * **molecules**: Atoms as nodes and covalent bonds as edges * **Citation networks**:each paper is a node and and edge is a citation between one paper and another * Can even do word embedding of abstract as information about one node

Answer 82

* Graph: predict the property of the entire graph (.e.g. smell of molecule, label,) * Node: predict the identity of role of each node (e.g. member loyal to John or Jane, image segmentation, POS tagging) * Edge: determine the relationship between objects in images

Answer 83

1. **Tabular**: often sparse * node: feature matrix N x D (nodes x features) * connectivity: 2. **Lists**: more computationally efficient * Nodes = [[0, 1], [1, 0], [0, 0] * Edges = [[2, 1], [1, 1,]] * Adjacent list = [[1, 0], [2, 0]]

Answer 84

an optimizable transformation on all attributes of the graph (nodes, edges, global-context) * graph-in, graph-out: architecture accepts graph as an input, with information loaded into its nodes, edges, and global-context, and progressively transforms the embeddings, without changing the connectivity of the input graph

Answer 85

* Learn embeddings for graph attributes (nodes, edges, global) W/O changing connectivity * Multilayer perceptron on each component * N(n\_f|n\_0), E(e\_f|e\_0), G(g\_f|g\_0)

Answer 86

* **params**: NVidia-Turing (570 b), GPT3 & AI21 Jurassic (175 b) * **tokens**: GPT3 (300 b), chinchilla (1.4 t) * human feedback training

Answer 87

1. memory - recall previous information 2. summarize - decompose info into elements 3. Model/Data quality assurance - 1. 4.6-17.2 T tokens (books, literature, articles) 2. NOT blogs, webpages, social media, spoken content 4. bigger is not better = more cost, impossible to beat, … 5. text-to-video

Answer 88

* emotion detection in image, video, text, sound waves * current technologies average samples which leads to possible bias (e.g. - cultural and sex differences)

Answer 89

1. humanoid robots - maybe a foundational model 2. ML Ops - weights and biases (raised 200 million at 1 billion) 3. ethics and law tossed in generative AI mix 1. still lacking basic knowledge 4. AI search using generative AI 5. mixed modal foundational models

Answer 90

whomever can run cheapest and best for the job 1. **AdeptAl** - ~660M Greylock - 20 employees - browser extension for instructional surfing 2. **Al21** - 750 million - 160 employees - customize models for automated copywriting, summarize documents 3. **Anthropic** - 4 billion - 80 employees - general-purpose LLMs free of bias and toxicity 4. **Character** - 18 employees - create and interact with chatbots that can role play 5. **Cohere** - ~1 bil Tiger Global - summarizing documents, copywriting, and search 6. **Inflection** - 1.23B Grevlock - communicate with machines to relay our thoughts and ideas

Answer 91

1. Sort by model losses and finding ambiguous samples, AND sorting by confidence where model and ground truth disagree 2. Confident Learning: pruning noisy data based on prediction intervals (worse than aforementioned method)

Machine Learning & AI Flashcards

(124 cards)