Week 3 - Text Mining I Flashcards

Question 1

Q

What is text quantification?

Answer

A

Text quantification is the method for analysing large amount of text and converting it to structured formats for numerical analysis

Question 2

Q

Why is text quantification necessary?

Answer

A

Machines and modelling approaches cannot process raw text directly. Hence, they compare this to numerical data (e.g., crime statistics or trading data)

Question 3

Q

How can you describe text quantitatively?

Answer

A

You can describe text quantitatively by measuring the ideology or sentiment (e.g., determining if it’s positive, negative or neutral).

Another way to describe text quantitatively is by identifying important topics and measuring similarities between texts.

Question 4

Q

How does quantitative text analysis work?

Answer

A

Quantitative text analysis requires converting text into a numerical representation.

Question 5

Q

How many features of text data are there? And what are each feature?

Answer

A

Meta dimension - number of words, and number of sentences in a text
Syntactic dimension (structural components): word frequencies, number of verbs, nouns, the structure of a sentence
Semantic dimension: meaning of words, relationship between words (e.g., related / similar words), sentiment, psycholinguistics features)
Text metrics: measuring readability (how easy it is to understand/read) and lexical diversity (how many different words appear in a text)

Question 6

Q

What is the long term for QUANTEDA?

Answer

A

QUANTEDA: Quantitative Analysis of Text in R

It provides a comprehensive toolkit for processing and analysing text

Question 7

Q

How many levels of data are there? And what are some of the examples?

Answer

A

character: ‘c’, ‘r’, ‘i’, ‘m’, ‘e’
word/token: any word ‘crime’ , ‘science’
sentence: ‘a winner is a dreamer who never gives up’
types: distinct or unique words
document: a unit of the text (e.g., website, email, chapter, essays, speech)
corpus: a collection of documents (e.g., websites, emails, tweets, chapters, essays, speeches)

Question 8

Q

How can you count meta features using quanteda?

Answer

A

install.packages (“quanteda”)
library (quanteda)

my_text <- “a winner”

ntoken(my_text) #words
tokens(my_text) #tokenise text

Question 9

Q

What’s the difference between ntoken() and ntype()?

Answer

A

ntoken(): Counts the number of tokens in a Quanteda tokens or dfm object

ntypes(): Returns a vector of the counts of unique tokens per document

Question 10

Q

What is a TTR?

Answer

A

A type-token ratio (TTR) is the total number of UNIQUE words (types) divided by the total number of words (tokens) in a given segment of language.

Most well-known measure of lexical diversity

Question 11

Q

Can you give an example on how to calculate TTR based on this:

another_text <- “As you may have noticed.. shdiskwo.”

Answer

A

Answer:

another_text_tokens <- tokens(another_text)

ntype(another_text) / ntoken(another_text_tokens)

Question 12

Q

How can you do a TTR but for:

characters per word
words per sentence

Answer

A

Answer:
1. nchar(another_text) / ntoken(another_text)

ntoken(another_text) / nsentence (another_text)

Question 13

Q

What is term frequency in terms of text representation?

Answer

A

It is the most basic way to represent text by counting the frequency of its tokens (terms).

You can manually create a column for each term and count the frequency of the term. Such representation of text is known as bag-of-words.

This can also be applied to a sequence of terms (known as n-grams)

Question 14

Q

What is DFM? and explain the breakdown of it.

Answer

A

DFM otherwise known as Document Feature Matrix is the frequency of each token in each document.

This is a table (matrix) that describes how frequently terms occur in each document. Each row is a document, and each column is a feature (term). Each cell represents the number of appearances of that feature in that document.

Question 15

Q

How can you construct a corpus in DFM?

Answer

A

Create a mini corpus when dealing with a collection of documents
Makes it easy to use some functions in Quanteda

library (quanted)
#install.packages(“readtext”)
library(readtext)

biden <- readtext (“…”)
trump <- readtext(“…”)

Question 16

Q

How can you construct a corpus II from a character vector and display attributes?

Answer

A

presidents_corpus <- corpus (c(biden=biden$text, trump=trump$text))
summary(presidents_corpus)

Question 17

Q

What is Zipf’s law?

Answer

A

Zipf’s law says, ‘given some corpus of natural language, the frequency of any word is inversely proportional to its rank in the frequency table’

Question 18

Q

What is Zipf’s law in relation to the Brown Corpus?

Answer

A

The Brown Corpus is a collection of American English compiled from a wide variety of sources (1 million words).

From Brown Corpus, we note that relying on term frequency may overstate the importance of non-informative terms.

Question 19

Q

What are considered as the important terms in a document?

Answer

A

Some terms that are often not informative occur too often in all documents. While some terms contribute more meaning than others.

Ideally, we want to calculate an important score for each term.

Question 20

Q

How can you calculate what the important terms are in a document?

Answer

A

Term Frequency (TF) metric; measures how frequently a term (t) occurs within a document

tf (t,d) = count (t,d) / number of terms in d = proportion

tf (t,d) = count (t,d)

Question 21

Q

How can you address the issue of filler words when it comes to TF metrics?

Answer

A

To address this, we want to give more ‘weight’ to the terms that are
1. locally important (within individual documents)
2. not overly common across the entire collection of documents (globally rare)

The above can be achieved using the Term-Frequency: Inverse Document Frequency (TF-IDF)

Question 22

Q

What is Term Frequency - Inverse Document Frequency (TF-IDF)?

Answer

A

TF-IDF is a widely used statistical method in text mining and informational retrieval. It measures how important a term is within a document relative to a collection of documents (i.e., a corpus).

TF-IDF will be high if a word occurs frequently in a specific document but is rare across the entire corpus.

Question 23

Q

What are the components of TF-IDF?

Answer

A

TF - IDF (Term Frequency - Inverse Document Frequency)

TF (Term Frequency) - measures how frequently a term (t) occurs in a document

IDF (Inverse Document Frequency) - measures the importance of a term across all documents

Question 24

Q

Define Document Frequency (DF)

Answer

A

DF is the number of documents containing the term t

df(t) = occurrence of t in documents

In terms of DF, a term is more informative if it occurs in fewer documents. We would like to

give less weight to terms that appear more frequently
give more weight to terms that appear less frequently

And that’s why we need an IDF

Question 25

Q

Define IDF

Answer

A

IDF is an Inverse Document Frequency where you give less weight to terms that occur more frequently in a given document corpus.

idf (t) = N / df(t)
where N is the number of documents in the corpus

Question 26

Q

How can you address the issue of really high idf’s (such as rare terms)?

Answer

A

You can address it using log idf

Using logarithm to scale and avoided extreme values can be useful.

idf (t) = log (N / df(t))

Question 27

Q

Define Term Frequency - Inverse document frequency (TF - Idf)

Answer

A

TF - Idf is combining term frequency and inverse document frequency to define a composite measure for a term in a document:

tf-idf (t,d) = td (t,d) x idf (t)

Question 28

Q

How can you assign the term (t) a value in document (d)? What are the conditions?

Answer

A

tf-idf is highest when t occurs frequently within a small number of documents overall (globally rare but locally common)
tf-idf is lower when t occurs rarely in a document or when it occurs frequently in many documents
tf-idf is lowest when t occurs frequently in most documents

Question 29

Q

What is Lexical Diversity? And what is the metric used for it?

Answer

A

Lexical diversity can help understand the complexity of a text (vocabulary diversity) and gain insight into the use of language (e.g., fraud messages have high lexical diversity)

Most common metric type is TTR (ntype/token)

Question 30

Q

What is the main disadvantage for TTR?

Answer

A

TTR is highly sensitive to text length - the longer the text the lower the chance that a new token will be unique - shorter texts may have fewer term repetitions.

Question 31

Q

What are the other lexical diversity metrics?

Answer

A

Herdan’s C (log TTR, 1960): log total types / log total tokens
Guiraud’s TTR (1954): total types / sqrt total tokens
Simpsons’s D (1949): Subsampling tokens at random from the data, then computing the average
Mean Segment Type-Token Ration (MSTTR, Johnson 1944): Divide the text into segments, and compute the mean.
The Moving Average Type Token Ration (MATTR, Covington and McFall 2008): Calculates a moving-average type token ratio and uses the mean of the TTR of each window

Question 32

Q

How to calculate lexical diversity in R?

Answer

A

In R: to calculate lexical diversity of text(s) use the function textstat_lexdiv in library (“quanteda.textstats”)

Question 33

Q

Define Readability Metrics

Answer

A

Readability metrics is used to describe readability in terms of complexity (i.e., how hard it is to read a text)

The most common approach is to use numeric readability metrics to estimate the readability of texts.

Question 34

Q

What is FRE? Define.

Answer

A

FRE is the shorthand for Flesch Reading Ease (FRE) Score (1948); where the readability score uses a combination of words, sentences and syllables.

Measured on a scale of 1 and 100, with 100 being the highest (very easy to read) and 1 being the lowest (very complicated to read).

Question 35

Q

What is the standard formula of FRE?

Answer

A

FRE = 206.835 - 1.015 x (total words/total sentences) - 84.6 x (total syllables/total words)

Question 36

Q

What is the standard formula of FKGL?

Answer

A

FKGL = 0.39 x (total words/total sentences) + 118 x (total syllables/total words) - 15.59

FKGL (Flesch Kincaid Grade Level) is equivalent to the US grade level of education 0 -18

Question 37

Q

Define The Coleman Liau Readability Index (CLI, 1975)

Answer

A

CLI uses the number of letters and numbers of sentences.

The formula is:
CLI = (0.0588 x L) - (0.296 x S) - 15.8

Where L is the average number of letters per 100 words; and S is the average number of sentences per 100 words

The score is an approximate representation of the US grade level.

Question 38

Q

Define the Automated Readability Index (ARI)

Answer

A

ARI uses characters per word and words per sentence.

4.71 x (characters/words) + 0.5 x (words/sentences) - 21.43

Question 39

Q

What are some of the issues that arise when dealing with traditional readability metrics?

Answer

A

Metrics were developed decades earlier for different contexts (e.g., education research and applied psychology)
Other features of text may indicate greater complexity (syntactic and grammatical structure, and rare words)
Uncertainty estimates (e.g., ‘what does it mean to have a text with FRE 70 and another one with 75?’)

Question 40

Q

What is Pre-processing?

Answer

A

Pre-processing is the step you do when preparing text for analysis.

You need to clean or preprocess text before quantitative analysis

Question 41

Q

Define Pre-processing: text cleaning.

Answer

A

Removing unwanted punctuation, special characters, HTML, and XML tags, Twitter specific mark-up, URLs, extra white space and etc.

Question 42

Q

Define Pre-processing: text normalisation.

Answer

A

Removing capitalisation and convert to lower cases, handle abbreviations, convert acronyms, check spellings, replace slang words with their meanings, etc.

Question 43

Q

Define Pre-processing: further processing.

Answer

A

Further processing is stemming and lemmatisation to reduce words to their base form; and remove stop words (e.g., ‘the’, ‘and’)

Question 44

Q

Define data handling.

Answer

A

Removing duplications, handling missing values

Question 45

Q

Define custom pre-processing.

Answer

A

Requires your own specialist knowledge

Question 46

Q

What is the difference between stems and lemmas?

Answer

A

Stems is removing suffixes from words (e.g., ‘s’ and ‘es’ endings) to reduce words to a common base

Lemmas is reducing different forms of a word to a common base form that has the same meaning and uses knowledge of word class.

Question 47

Q

What are the approaches a data scientist has when it comes to stopwords?

Answer

A

It is common for data scientists not to use all these stopwords and add their own stopwords or create your own stopword list

Question 48

Q

What are the aspects of pre-processing that you should be aware of?

Answer

A

May negatively impact your results
There is no one-size fits all solution in data science
Experiment and assess the impact of preprocessing and ensure your decisions are reasonable for your task