Text Documents and Pre-Processing Flashcards

Question 1

Q

What is the document retrieval scenario?

(and the problems surrounding it)

Answer

A

Describes the process of retrieving documents relevant to a query from a corpus.

The problem is teaching the computer how to properly identify relevant documents, as it doesn’t know how to process text the way humans do.

E.g, searching for “cat” might not recognise “cats” or could incorrectly recognise “category.”

Question 2

Q

What is document representation and why is it important?

Answer

A

Document representation refers to how collections of documents are presented to make them easier for the computer to search.

Question 3

Q

What is preprocessing and types of preprocessing are there.

Answer

A

Preprocessing involves transforming the document content into a clean and consistent format so it can be searched and analysed more easily.
* Document and sentence segmentation
* Work tokenization
* Text Normalisation
* Stemming
* Lemmatisation

Question 4

Q

What is a corpus?

How are they stored?

Answer

A

A corpus is a collection of documents.

They can be stored as one document per file, using special characters as delimiters, or in a single file with one document per line.

Documents in a corpus can be essays, tweets, books, articles, etc…

Question 5

Q

What is sentence segmentation?

Answer

A

Sentence segmentation is the process of taking the contents of a document, and breaking it down into a collection of sentences.

Punctuation and capitalisation can make this challenging, but we can just important tokenizers that are quite good at it do it for us anyways.

Question 6

Q

What is tokenization?

How can you do it in python?

Answer

A

Tokenization is the process of segmenting running text into “words.”

“words” can be: words, items of punctuation, numerical quantities.

You can use the built in split function, or use NLTK’s tokenizer.

Question 7

Q

What is Zipf’s Law?

Answer

A

Zipf’s law describes statistical distribution in corpus.

It suggests that
* half of all words in a corpus only occur once.
* a words frequency is constant:
* if the most common word occurs 100 times, the second most common word will occur approximately 50 times
*the third most common word will occur approximately 33 times, etc.

unique words = hapax legomena

Question 8

Q

What is case normalisation?

and why is it used?

Answer

A

Case normalisation is the process of ignoring case by converting it all to lower case, since case is often irrelevant in document analysis.

Increases word frequency, at the cost of sometimes increasing ambiguity.

Question 9

Q

What is number normalisation?

and why is it used?

Answer

A

Number normalisation is the process of replacing any and all numbers with a standardised string like “NUM”.

It’s done because the exact number is often irrelevant, and they contribute greatly to the hapax legomena.