Language Flashcards
spans all tasks where the AI gets human language as input
Natural Language Processing
examples of tasks in Natural Language Processing
• automatic summarization
• information extraction
• language identification
• machine translation
• named entity recognition
• speech recognition
• text classification
• word sense disambiguation
sentence structure
Syntax
meaning of words or sentences
Semantics
system of rules for generating sentences in a language
Formal Grammar
text is abstracted from its meaning to represent the structure of the sentence using formal grammar
Context-Free Grammar
a sequence of n items from a sample of text.
n-gram
a contiguous sequence of n characters from a sample of text
character n-gram
a contiguous sequence of n words from a sample of text
word n-gram
a contiguous sequence of 1 item from a sample of text
unigram
a contiguous sequence of 2 item from a sample of text
bigram
a contiguous sequence of 3 item from a sample of text
trigrams
task of splitting a sequence of characters into pieces (tokens)
Tokenization
the task of splitting a sequence of characters into words
word tokenization
the task of splitting a sequence of characters into sentences
sentence tokenization
How to generate text using a Markov Model
Markov Models
a model that represents text as an unordered collection of words.
Bag-of-words Model
adding a value α to each value in our distribution to smooth the data
additive smoothing
adds 1 to each value in our distribution, pretending that all values have been observed at least once.
Laplace Smoothing
task of finding relevant documents in response to a user query.
Information retrieval
models for discovering the topics for a set of documents
topic modeling
counting how many times a term appears in a document.
term frequency
words that have little meaning on their own, but are used to grammatically connect other words
function words
am, by, do, is, which, with, yet,
function words
words that carry meaning independently
content words
algorithm, category, computer, …
content words
measure of how common or rare a word is across documents in a corpus
Inverse Document Frequency
ranking of what words are important in a document by multiplying term frequency (TF) by inverse document frequency (IDF)
tf-idf
task of extracting knowledge from documents
Information Extraction
each word is represented with a vector that consists of as many values as we have words
One-Hot Representation
meaning is distributed across multiple values in a vector.
Distributed Representation
algorithm for generating distributed representations of words
word2vec
a neural network architecture for predicting context given a target word
Skip-Gram Architecture