NLP _ 01 Flashcards
What is the most likely the first step of NLP?
cutting board
Text preprocessing
What is the most likely the first step of NLP?
cutting board
Text preprocessing
what is Noise removal?
front of fridge
stripping text of formatting.(e.g. HTML tags.)
What is Tokenization?
under sink door
breaking text into individual words
What is normalization?
Cleaning text data in any other way than Noise removal and tokenization
What is stemming?
it is a blunt axt to shop off word prefexes ans suffixes.
What is lemmatization?
coat closet
It is a scalpel to bring words down to their root forms
What would I import to use regex?
import re
What python package could I use for NLP?
import nltk
what method of nltk would I use to tokenize text?
from nltk.tokenize import word_tokenize
Give an example of a list comprehension :
lemmatized = [lemmatizer.lemmatize(token) for token in tokenized]
How would you import WordNetLemmatizer?
from nltk.stem import WordNetLemmatizer
How would you import PorterStemmer?
from nltk.stem import PorterStemmer?
By default lemmatize() treat every word as a…?
Noun
Language models are probabilistic machine models of …?
language used for NLP comprehension tasks
Language models learn a …?
probability of word occurrence over a sequence of words and use it to estimate the relative likelihood of different phrases.
Common language models include:
Statistical models:
- bag of words (unigram model)
- n-gram models
Neural Language Modeling(NLM)
What is Text simlarity in NLP?
Text similarity is a facet of NLP concerned with the similarity between texts.
What are two popular text similarity metrics?
- Levenshtein distance
- cosine similarity
How would you describe the metric : Levenshtein distance
it is defined as the minimum number of edit operations( deletions,insertions, or substitutions) required to transform a text into another
Define the metric : Cosine similarity
It is defined as teh cosine of the angle between two vectors. To determine the cosine similarity, text documents need to be converted into vectors.
**What are common forms of language prediction?
- **Auto-suggest **and suggested replies
Natural Language processing is concerned with …?
enabling computers to interpret, analyze, and **approximate **the generation of human speech.
What is Parsing w.r.t NLP?
it is the process concerned with segmenting text based on syntax
What is Part-Of-Speech tagging
It identifies parts of speech(verbs, nouns, adjectives, etc..)
It helps computers understand the relationship between the words in a sentence?
A Dependacy grammar tree
What does a Dependency grammar tree help you understand?
The relationship between the words in a sentence.
What does NER stand for?
Named entity recognition
What does NER help identify?
Proper Nouns (e.g., “Natalia” , or “Berlin” ) in a test. This can be a clue to figure out the topic of the text.
When you have ____ coupled with POS tagging you can idenfity specific phrase chuncks
Regex parsing
When you couple Regex parsing and POS tagging you can…?
identify the specific phrase chucks.
A very common unigram model, a statictical language model commonly known as ..?
front door
The Bag-Of-Words
Bag-of-Words can be an excellent way of looking at lanuage when you want to make predicitons concerning….?
the topic or sentiment of a test
When grammer and word order are irrelevant, this is a good mode
what would I import to get word counts for the bag of words model?
from collections
import Counter
how would I import a part-of-speach function for lemmatization?
from part_of_speach import get_part_of_speech
For parsing entire phrases or conducting language prediction , you will want a model that …..?
pays attention to each word’s neighbors.
Unlike bag-of-words, the n-gram model considers a ….?
….sequence of some number (n) units and calculates the probability of each unit in a body of language given the preceding sequence of length n.
Because of this, n-gram probabilities with larger n values can
be impressive at language prediction.
What tactic can help with adjusting probabilities for unkown words but it is not always ideal?
Language smoothing
What is Language smoothing?
a tactic that can help adust probabilities for unknown words, but it isn’t always ideal
For a model that more accurately predicts human language patterns, you want n
(your sequence length) …?
….to be as large at possible.
What happens if you make your n-grams to long?
The number of examples to train off of shrinks and you won’t have enough to train on.
What the common Neural langauge models (NLMs) ?
- LSTMs
- Transformer models
What is Topic Modeling?
It is an area of NLP dedicated to uncovering latent, or hidden , topics within a body of language.
A common * technique* is to deprioritize the most common words and prioritize less frequently used terms as topics in a process known as …?
…term frequency-inverse document frequency (tf-idf)
What libraries in Python have modules to handle tf-idf?
gensim and sklearn
What is LDA or Latent Dirichlet allocation?
LDA is a statistical model that takes your documents and determines which words keep popping up together in the same contexts(i.e. documents)
What is word embedding?
The process of word-to-vector mapping
word-to-vector mapping is also called?
word embedding
If I would like to visualize the topics model results . You could use…?
word2vec:
* it is a great technique that can map out your topic model results spatially as vectors so that similary usded words are closer together.
How is the Levenshtein ditance calculated?
the distance calculated through the minimum number of insertions, deletions, and substitutions that would need to occur for one word to become another.
Define: Levenshtein distance
the minimal edit ditance between two words.
What is Phonetic silimarity?
how much words or phrases sound the same.
Define: Lexical Similarity
window over kitchen sink
the degree to which texts use the same vocabulary and phrases
Define: Semantic similarity
the degree to which documents contain similar meaning or topics
Addressing ________ _________ - including spelling correction - is a major challenge within natural language processing
Text similarity
What is it called when documents/text contain similar meaning or topics?
Semantic similarity
What is called when documents/texts share the same degree to which texts use the same vocabulary and phrases
Window over kitchen sink
Lexical similarity
How would I import a tool to measure the Levenshtein distance?
from nltk.metrics import edit_distance
what python module has a built-in function to check the levenshtein distance?
nltk
What is the application of NLP concerned with predicting test given preceding text?
Language prediction
What is the first step to language prediction?
It is picking your langauge model
Bag of words alone is generally …?
not a great model for langauge prediction.
w.r.t Langauge prediction if you go with the n-gram route, you will most likely pick what model?
Magnetic knife holder
Markov chains
Define the Lanuage Model: Markov chains
Magnetic knife holder
the model the predicts the statistical likelihood of each following word(or character) based on the training corpus.
Markov chains are memory-less and make statistica predictions based entierly on the current n-gram on hand.
What is a supervised machine learning algorithm that leverages a probabilistic theorem to make predictions and classifications?
Naive Bays Classifiers
Define :
sentiment analysis
determing whether a given block of lanuage expresses negative or postive feelings.
Text preproccessing is a stage of ….?
NLP focused on cleaning and preparing text for other NLP tasks
Parsing is an ….?
NLP technique concerned with breaking up text based on syntax
What are two python libraries that can handle syntax parsing?
gensim & sklearn
What is are common text preprocessing steps
Tokenization will… ?
break multi-word strings into smaller components
Normalization is a ….?
catch-all term for processing data. this includes stemming and lemmatization
Noise removal is when we…?
remove unnecessary charaters and formating
Stemming is….?
text preprocessing nomalization task concerned with bluntly removing word affixes(prefixes and suffixes)
Lemmatization is a ….?
Coat closet
text preprocessing nomalization task concerned with bring words down to thier root forms.
https://www.codecademy.com/learn/paths/data-science-nlp/tracks/dsnlp-text-preprocessing/modules/nlp-text-preprocessing/cheatsheet
Stopword Removal is the process of ….?
removing words from a string that don’t provide any information about the tone of a statement.
https://www.codecademy.com/learn/paths/data-science-nlp/tracks/dsnlp-text-preprocessing/modules/nlp-text-preprocessing/cheatsheet
Using part-of-speech can …?
improve the results of lemmatization
What are two common Python libraries used in text preprocessing?
NLTK and re
\_\_\_\_\_\_\_\_\_
is a technique that devolopers use in a variety of domains
Text cleaning
When you are text cleaning you may want to remove unwanted info such as:
1. ` ______??________`
2. Special Characters
3. Numeric digits
4. Leading, ending, and veritcal whitespace
5. HTML formatting
Punctuation and accents
When you are text cleaning you may want to remove unwanted info such as:
1. Punctuation and accents
2. ` ______??________`
3. Numeric digits
4. Leading, ending, and veritcal whitespace
5. HTML formatting
Special Characters
When you are text cleaning you may want to remove unwanted info such as:
1. Punctuation and accents
2. Special Characters
3. —??———
4. Leading, ending, and veritcal whitespace
5. HTML formatting
Numeric digits
When you are text cleaning you may want to remove unwanted info such as:
1. Punctuation and accents
2. Special Characters
3. Numerica Digits
4.—-??——
5. HTML formatting
Leading, ending, and veritcal whitespace
When you are text cleaning you may want to remove unwanted info such as:
1. Punctuation and accents
2. Special Characters
3. Numerica Digits
4. Leading , ending , and vertical whitespace
5.—???—
HTML formatting
The type of noise you need to remove from text usually depends on the ….?
source
marketing journal vs a medical journal
You can use the \_\_\_\_\_
method in Python’s regular expression library for most of your noise removal needs.
.sub()
The .sub()
method has three required arguments:
- —?—
-
replacement_text
– text that replaces all matches in the input string -
input
– the input string that will be edited by the .sub() method
Top of Fridge
pattern
– a regular expression that is searched for in the input string. There must be an r preceding the string to indicate it is a raw string, which treats backslashes as literal characters.
The .sub()
method has three required arguments:
-
pattern
– a regular expression that is searched for in the input string. There must be an r preceding the string to indicate it is a raw string, which treats backslashes as literal characters. - —?—
-
input
– the input string that will be edited by the .sub() method
Top of fridge - ingredients
replacement_text
– text that replaces all matches in the input string
The .sub()
method has three required arguments:
-
pattern
– a regular expression that is searched for in the input string. There must be an r preceding the string to indicate it is a raw string, which treats backslashes as literal characters. -
replacement_text
– text that replaces all matches in the input string - —?—
top of fridge , ingrediants
input – the input string that will be edited by the
.sub()` method
The method .sub()
returns a ….?
a string
with all instances of the pattern
replaced by the replacement_text
.
How could you remove the HTML tag <p>
from a string?
``import re `
`text = “<p> This is a paragraph</p>”
result = re.sub(r’<.?p>’, ‘’, text)
print(result) `
This is a paragraph
What is a common practice to replace HTML tags with..?
empty string ''