NLP Flashcards
Corpus
A collection of speech data. Used for training NLP models
Word Sense Disambiguation
The process of identifying the correct meaning of a word, when multiple meanings can be interpreted.
BLEU
Bi-Lingual Evaluation Understudy.
Metric used to measure the quality of a machine translation, compared to reference translations.
ROUGE
A metric used to measure the quality of machine summaries, compared to reference summaries.
Hidden Markov Model (HMM)
Statistical model used for part-of-speech tagging and speech recognition
Part-of-Speech tagging
Tags words in a sentence according to noun, verb, adverb etc.
Transfer Learning
Applying a model trained on one task to a different task.
N-Gram
A sequence of N continuous items from a text of speech.
Lemmatization
Reducing words to their base form, undoing any conjugations. Similar to Stemming.
Stemming
Reducing words to their stem, undoing any conjugations. Similar to Lemmatization.
Named Entity Recognition
Categorising names into predefined groups.
Co-reference Resolution
Identifying which words in a text refer to the same entity.
Stop Word
A commonly used word which does not contribute to a texts content.
Word Embeddings
Vectorisation of words. Similar words are mapped to nearby vectors in vector space.
Word2Vec
Word vectorisation method.
GloVe
Word vectorisation method.
BERT
Word vectorisation method.
Bag-of-Words
Method for representing a set of words, without regard to order or grammar.
TF-IDF
Term Frequency - Inverse Document Frequency
Metric measuring how important a word is to a document in a corpus, relative to its frequency in the rest of the corpus.
Latent Semantic Analysis
Analyses relationships between words in a document corpus to discover semantic structures.
Latent Dirichlet Allocation
Generative probabilistic model identifying topics in a document corpus.
Generative Probabilistic Model
Perplexity
Metric measuring how well a model predicts a sample. Lower perplexity is better performing.
Componential Semantics
Words represented by sets of semantic components which together describe the meaning of the word.
Distributional Semantics
Describing words according to the contexts they appear in.
Thematic Distance
Metric measuring the similarity of words based on the angle between their vectors.
Saltonian vector
Binary vector representation of a word. Zero everywhere except the index of the word in the corpuses complete wordlist.
Vector size reduction
Word error rate
A metric measuring the relative error between a generated text and a reference text.
S + D + I / N
S: Substitutions
D: Deletions
I: Insertions
N: Number of words in reference text
Connectionist Temporal Classification
A method for end-to-end Automatic Speech Recognition.
Attention-based Encoder-Decoder Models
A method for end-to-end Automatic Speech Recognition
Transducer Models (RNN-T)
A method for end-to-end Automatic Speech Recognition