NLP-3 Flashcards
define text normalisation
Text Normalisationis a process helps in
cleaning up the textual data in such a way that it comes down to a level where its complexity is lower
than the actual data. It comprises of seven steps:
(a) Sentence Segmentation
(b) Tokenisation
(c) Removal of stop words, special characters and numbers
(d) Converting Text to same case
e) stemming/lemmatisation
define corpus
a collection of textual data forms a corpus.
In Text Normalization, we undergo several steps to normalizethe text to a lower level. That is, we will be working on text from multiple documents and the term used for the whole textual data from all the documents altogether is known as corpus.
what is sentence segmentation
Under sentence segmentation, the whole corpus is divided into sentences. Each sentence is taken as
a different data so now the whole corpus gets reduced to sentences.
define token
Tokens is a term
used for any word or number or special character occurring in a sentence.
what happens in tokenisation
After segmenting the sentences, each sentence is then further divided into tokens.Under tokenisation, every
word, number and special character is considered separately and each of them is now a separate
token.
define stopwords?
Stopwords are the words which occur very frequently in the corpus but do not add any value to it.
why are grammatical words considered stopwaord
Humans use grammar to make their sentences meaningful for the other person to understand. But
grammatical words do not add any essence to the information which is to be transmitted through the
statement hence they come under stopwords.
what happens in stopword removal
stop words like a’an’the,and.are etc occur the most in any given corpus but talk very little or nothing about the context or the
meaning of it. Hence, to make it easier for the computer to focus on meaningful terms, these words
are removed.
what else is removed along with stop words
Along with these words, a lot of times our corpus might have special characters and/or numbers. Now
it depends on the type of corpus that we are working on whether we should keep them in it or not.
For example, if you are working on a document containing email IDs, then you might not want to
remove the special characters and numbers whereas in some other textual data if these characters do
not make sense, then you can remove them along with the stopwords
what is done in case conversion of corpus
After the stopwords removal, we convert the whole text into a similar case, preferably lower case.
This ensures that the case-sensitivity of the machine does not consider same words as different just
because of different cases.
explain the process of stemming
Stemming is a process by which affixes are removed and a word is converted to its root/base form. The stemmed word may or may not be meaningful.Stemming does not take into account if the stemmed word is meaningful or not. It just removes the
affixes hence it is faster.
explain the process of lemmatisation
Lemmatisation is a process by which affixes are removed and a word is converted to its base/root form. The word after affix-removal (or lemma) is always meaningful. Lemmatization makes sure that lemma is a word with meaning and hence it takes a longer time to
execute than stemming.
differece lemmatisaitona dn stmepsnign
stemming
- the root word/stemmed word may or may not be meaningful
-takes less time to be executed than lemmatisation
-studied-es= studi
studying-ing=study
lemmatisation
-the root word/lemma is always meaningful
- it takes slightly longer time to be exectued than stemming
-studies-es=study
studying-ing=study
Does the vocabulary of the corpus remain the same before and after text normalization? Give reasons.
No, it doesn’t. The process of text normalization reduces the corpus to the minimum vocabulary possible, as the machine doesn’t require grammatically correct sentences, only the essence of the corpus, to function.
In text normalization, stop words, special characters and numbers are removed.
In the processes of stemming and lemmatization, the affixes of words are removed and the word is converted to its base form.
Thus, the vocabulary after text normalization is decreased.