Lecture 1 Flashcards
What are some applications of NLP?
Question answering, information extraction, sentiment analysis, machine translation, Language technology
What are hot topics in NLP that are still rather hard to solve?
Question answering, paraphrasing and summarization
What are some (unresolved) issues in NLP?
- ambiguity within sentences or questions
- non-standard english i.e. in tweets
- idioms i.e. get cold feet
- neologisms
- tricky entity names
Data mining vs. Text mining
Data mining is a process used to find and extract patterns within a large set of data. This process is often done as a first step of the project to prepare the data for further analysis.
Data mining is all about finding the connection between the different data points.
Text mining is one of the automated techniques used in natural language processing that converts unstructured text to structured data that a computer can process and understand. By converting text to information, we can apply further analysis to the data to extract useful information.
What is the difference between a bag-of-words and a string of words?
A bag of words is the collection of unique words used in a text corpus i.e. in a particular string.
A string of text is a sentence of not unique words
What are regular expressions?
A formal language for specifying text search strings
- It requires a pattern that we want to search for, and a corpus of texts to search through
- A regular expression search function will search through the corpus returning all texts that contain the pattern.
What is text normalization?
Task of putting words/tokens in a standard format.
Text normalization is the process of transforming text into a single canonical form. Normalizing text before allows for proper processing since input is guaranteed to be consistent before operations are performed on it.
When we normalize a natural language resource, we attempt to reduce the randomness in it, bringing it closer to a predefined “standard”.
What are common steps in text normalization?
- Segmenting/tokenizing words from running text
- Normalizing word formats
- Segmenting sentences in running text
What is the difference between stemming and lemmatization?
The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. For instance:
am, are, is –> be
car, cars, car’s, cars’ –> car
The result of this mapping of text will be something like:
the boy’s cars are different colors –> the boy car be differ color
However, Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. Lemmatization: Task of determining that two words have same root, despite their surface differences. usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma .
What is case folding?
Applications like ‘speech recognition’ and ‘information retrieval’
• reduce all letters to lower case
What is the most common english stemmer algorithm?
Porter's --> follows a couple rules Step 1a sses →ss caresses →caress ies →i ponies →poni ss→ss caress →caress s →ø cats →cat
What is sentence segmentation and what is a common problem with it?
process of dividing written text into meaningful units i.e. sentences
! and ? unambiguous but “.” very ambiguous in a sentence i.e. Dr. Claas
What are a few kinds of classifiers?
- Linear regression
- neural networks
- SVMs
How can you determine how similar two text entities are?
Minimum edit distance algorithm
What is the minimum edit distance and what are its operations?
Is the minimum number of editing operations •Insertion •Deletion •Substitution Needed to transform one into the other