lecture 6 Flashcards

Question

P(O|C): normalization

Answer 1

converts results into valid probabilities = sum of all exponentiated dot products for all word pairs in the vocabulary

Answer 2

- gradually adjust all parameters in theta to minimize loss - theta represents all model parameters (i.e., all word vectors) in one long vector. each word has 2 vectors: C and O. - with d-dimensional vectors, and V-many words, theta = 2dV parameters. - optimize parameters with gradient descent

Answer 3

1. maximum size of context window: different sizes can affect the quality and nature of the embedding 2. weighting scheme: the model weights context words based on their distance from the target word 3. relative position: symmetric, lef- or right side 4. linguistic boundaries: e.g., sentence endings

Answer 4

- assigning a categorical label to each element in a sequence of inputs (POS tags) - pattern recognition task - input: sequence of length n (x = x1...xn) (words) - output: sequence of length n (y = y1...yn). each yi is a label of xi. (POS tags)

Answer 5

- tags that allow us to label sequences in ways that mimic human understanding of them - first step towards syntactic analysis

Answer 6

1. given a sequence of tokens, how can we predict linguistic tags for each one 2. how can we exploit the sequential nature of this data to model hidden features and structural dependencies that will help downstream tasks

Answer 7

- humans produce and process natural language sequentially - but these possess a hidden structure (i.e., lexical, semantic, syntactic structure) that constrain possible sequences - so, merely having the right words and tags isnt sufficient for proper interpretation and meaningful communication - hidden structures allow humans to generalize to new infinite sequences (recursion)

Answer 8

1. primary goal: given a sequence of tokens, **predict (hidden) linguistic tags** that allow for generalization across categories 2. secondary goal: exploit additional hidden linguistic structure to make this task more accurate --> i.e., the use of additional contextual and syntactic information that is not immediately apparent from the sequence of tokens alone

Answer 9

1. dependency parsing 2. semantic parsing 3. coreference resolution 4. information extraction 5. question answering

Answer 10

1. POS tagging 2. named entity recognition (NER) 3. BIO chunking

Answer 11

1. machine learning - input - feature extraction - classification - output 2. deep learning - input - FE + classification (both automatized) - output

Answer 12

1. supervised 2. unsupervised 3. semi-supervised

Answer 13

- technique that leverages data to create new variables that aren't in the training set. (both for supervised and unsupervised) - goal: simplifying and speeding up data transformations and enhancing model accuracy - use domain knowledge of the data to extract important linguistic information (features) and train a model on these

Answer 14

1. describing data with features a computer can understand --> requires expert knowledge 2. learning algorithm --> optimizing weights on features

Answer 15

1. independent: each member is treated independenty 2. dependent: each member is dependent on other members for its label

Answer 16

open and closed classes - closed class --> english uses closed class words to express syntactic relations between words. these give clues to open class. --> fixed membership - open class --> membership is larger and can grow

Answer 17

though word classes do share semantic tendencies, POS is primarily defined on 1. grammatical relationship with neighboring words 2. morphological properties of affixes: these help distinguish between words that are morphologically related but serve different roles in sentences

Answer 18

many high frequency words have more than one POS tag around 50% of tokens are ambiguous

Answer 19

1. lexical based: assign POS tags that occur most frequently with words in training 2. rule-based: assign POS tags based on rules 3. probabilistic 4. deep learning

Answer 20

- based on augmenting the markov chain - allow us to talk about observed events (words) and hidden events (POS tags) - **strong assumption**: if we want to make a prediction about a future state, only the current state matters --> this simplifies computation by disregarding the entire history of previous states - markov assumption: P(yi=a|yi-1) - equivalent to bigram model

Answer 21

a model that tells us something about the **probabilities of sequences of random variables (states)** which take the value of some set - states are words (nodes) - probabilities of transitioning from one state to another are transitions (edges) - **multiply starting probability by transition probabilities**

Answer 22

vector representing the probabilities of transitioning to each possible state (i.e., sums to 1)

Answer 23

- hidden MC + observed variables - transition matrix + emission matrix

Answer 24

- find optimal sequence of hidden states - most likely weather (T) sequence for the observed mood (E) sequence - compute probability corresponding to each possible permutation and find the one with maximum probability --> argmax P(x1,y1...xn,yn)

Answer 25

1. ML - HMMs show us to observe a sequence (x) and predict the labels of sequence tokens (y) - only considers previous state, and often relies on hand-crafted features or language-specific resources for performance - can be combined with algorithms to consider full context (backward/forward) 2. DL - can automatically learn complex features of data and model multiple dependencies between tokens - generalizable to out of vocabulary (OOV) words and languages.

Answer 26

1. large amounts of training data favor deep learning 2. faster machines and multicore CPU/GPUs favor DL 3. new models, algorithms, ideas 4. better, more flexible learning of intermediate representations 5. effective end-to-end joint system learning 6. effective learning methods for using contexts and transferring between tasks 7. improved performance

Answer 27

- basic building block of a neural netword - takes one or more inputs and produces a single output - each input is multiplied by a weight, and the products are summed together - the sum is then passed through an activation function, which determines the output of the perceptron - output = activation_function(weighted_sum_of_inputs + bias) - weighted sum of inputs = dot product of input vector and weight vector

Answer 28

- multiple layers of neurons - can solve linearly-separable problems - weights optimize the NN performance and represent unlabeled, distributed knowledge

Answer 29

1. text classification: sentiment analysis, language detection 2. unsupervised learning: word2vec, dimension reduction

Answer 30

1. training set: ordered pairs each with input and targeted output 2. loss function: a function to be optimized (e.g., cross entropy) 3. optimizer: a method for adjusting weights (e.g., gradient descent)

lecture 6 Flashcards

(54 cards)