Lecture 3 Flashcards
What are the building blocks of AI?
addition and multiplication
There are two sides to the “is AI intelligent” debate, what are these two sides?
It’s just math (so no) vs. look at the performance (so yes)
aka how ai works vs what it can do
What is the Xor problem?
Very simple models cannot learn the logical function of either this or that
What is the solution to the Xor problem?
more complicated neural networks
What is a neural network?
A neural network is really just stacked logistic regression (and intermediate steps)
Look at figure
Why do neural networks work better than logistic regression?
because they can do multiple things at once, intermediate steps
(i.e. use genre as a determinant as the example in the lecture aka consider other factors that have an effect)
What is the trade-off of using neural networks vs. logistic regression?
Neural networks need large amounts of data to work
The intermediate steps that are a part of neural networks, do they have to be put in?
No the model predicts these intermediate predictors themself, based on prior data
What is the universal approximation theorem?
Any function that looks like predictor <> outcome can be captured with neural networks
Does an AI like chatGPT work on this neural network model?
No, but it is one of the building blocks
How do language models work (simply)?
They predict the next word (dependent on the data they trained with)
Why do neural networks not work for language models?
They disregard the order of the words
What is the solution for neural networks not working for language models?
Transformer models
What are transformer models used for?
Language based models, but nowadays most AI have this as their basis (non-language included)
there are two sides to the transformer models, why?
The left side is to transform the input words into numbers and the right side is to put the numbers back into words
What are skip-connections?
Basically the number strings (words) are copied and pasted throughout the model, as to not forget good info that was early in the model
Basically you combine the early simple thing with the later complicated thing to get the best result
The skip-connections are the add-norm part of the model, what do the add and norm individually do tho?
The add part is really the skip-connections, so the simple + complicated
The norm part is to make sure it is not blown out of proportion aka you only take a small part of the early and not the whole
What is the positionwise FFN?
The “basic” neural networks introduced in the lecture (stacked logistic regression)
What are embeddings?
Embeddings are numerical representations of word meanings based on it’s associated w/ other words
How do embeddings work?
Basically basic NN with singular word input with multiple outputs of the words’ associaton with other words in the language (based on previous learning)
[0.4, 0.4, 0] is a simplistic version of an embedding string, what do these numbers mean?
Probabilities of the words’ association with other words
What is positional encoding?
Words get new set of strings that = position in the sentence
How does positional encoding work?
Wavelines on graph, look at which position the word should be and look at that same word on the graph, then assign associated string w/ it
Between which numbers is positional encoding always?
-1 and 1
How do positional encoding and embedding combine?
adding together - [1, 0.3] and [-0.2, 0.8] become [0.8, 1.1]
What is multi head attention?
Enriching the numbers with the context of other words in the sentence (i.e. prize or price depend on context)
How does multi head attention work? short answer
Check the word for it’s association with others in this specific sentence and put these into an equation
How does multi head attention work? long answer
Previous string is taken and 3 new strings are made from it (query, key and value). You take the query from a word and the key from the word you want to check it’s association with (this goes into equation as). The value string you have left is what you put into the equation as the word meaning
Example: wins = correlation * “wins value” + cor * “prize v” + cor * “home v”
The transformer model is usually called a base model, when in reality a chat model is usually used, what is the difference?
The chat model has an “end token” that makes sure the input is seen as a whole statement so the chat generates new text
What is an evaluation type that models often depend on?
Human evaluation