Week 3 - Text Mining I Flashcards

1
Q

What is text quantification?

A

Text quantification is the method for analysing large amount of text and converting it to structured formats for numerical analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why is text quantification necessary?

A

Machines and modelling approaches cannot process raw text directly. Hence, they compare this to numerical data (e.g., crime statistics or trading data)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How can you describe text quantitatively?

A

You can describe text quantitatively by measuring the ideology or sentiment (e.g., determining if it’s positive, negative or neutral).

Another way to describe text quantitatively is by identifying important topics and measuring similarities between texts.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How does quantitative text analysis work?

A

Quantitative text analysis requires converting text into a numerical representation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How many features of text data are there? And what are each feature?

A
  1. Meta dimension - number of words, and number of sentences in a text
  2. Syntactic dimension (structural components): word frequencies, number of verbs, nouns, the structure of a sentence
  3. Semantic dimension: meaning of words, relationship between words (e.g., related / similar words), sentiment, psycholinguistics features)
  4. Text metrics: measuring readability (how easy it is to understand/read) and lexical diversity (how many different words appear in a text)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the long term for QUANTEDA?

A

QUANTEDA: Quantitative Analysis of Text in R

It provides a comprehensive toolkit for processing and analysing text

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How many levels of data are there? And what are some of the examples?

A
  1. character: ‘c’, ‘r’, ‘i’, ‘m’, ‘e’
  2. word/token: any word ‘crime’ , ‘science’
  3. sentence: ‘a winner is a dreamer who never gives up’
  4. types: distinct or unique words
  5. document: a unit of the text (e.g., website, email, chapter, essays, speech)
  6. corpus: a collection of documents (e.g., websites, emails, tweets, chapters, essays, speeches)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How can you count meta features using quanteda?

A

install.packages (“quanteda”)
library (quanteda)

my_text <- “a winner”

ntoken(my_text) #words
tokens(my_text) #tokenise text

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What’s the difference between ntoken() and ntype()?

A

ntoken(): Counts the number of tokens in a Quanteda tokens or dfm object

ntypes(): Returns a vector of the counts of unique tokens per document

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a TTR?

A

A type-token ratio (TTR) is the total number of UNIQUE words (types) divided by the total number of words (tokens) in a given segment of language.

Most well-known measure of lexical diversity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Can you give an example on how to calculate TTR based on this:

another_text <- “As you may have noticed.. shdiskwo.”

A

Answer:

another_text_tokens <- tokens(another_text)

ntype(another_text) / ntoken(another_text_tokens)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How can you do a TTR but for:

  1. characters per word
  2. words per sentence
A

Answer:
1. nchar(another_text) / ntoken(another_text)

  1. ntoken(another_text) / nsentence (another_text)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is term frequency in terms of text representation?

A

It is the most basic way to represent text by counting the frequency of its tokens (terms).

You can manually create a column for each term and count the frequency of the term. Such representation of text is known as bag-of-words.

This can also be applied to a sequence of terms (known as n-grams)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is DFM? and explain the breakdown of it.

A

DFM otherwise known as Document Feature Matrix is the frequency of each token in each document.

This is a table (matrix) that describes how frequently terms occur in each document. Each row is a document, and each column is a feature (term). Each cell represents the number of appearances of that feature in that document.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How can you construct a corpus in DFM?

A
  1. Create a mini corpus when dealing with a collection of documents
  2. Makes it easy to use some functions in Quanteda

library (quanted)
#install.packages(“readtext”)
library(readtext)

biden <- readtext (“…”)
trump <- readtext(“…”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How can you construct a corpus II from a character vector and display attributes?

A

presidents_corpus <- corpus (c(biden=biden$text, trump=trump$text))
summary(presidents_corpus)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is Zipf’s law?

A

Zipf’s law says, ‘given some corpus of natural language, the frequency of any word is inversely proportional to its rank in the frequency table’

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is Zipf’s law in relation to the Brown Corpus?

A

The Brown Corpus is a collection of American English compiled from a wide variety of sources (1 million words).

From Brown Corpus, we note that relying on term frequency may overstate the importance of non-informative terms.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What are considered as the important terms in a document?

A

Some terms that are often not informative occur too often in all documents. While some terms contribute more meaning than others.

Ideally, we want to calculate an important score for each term.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

How can you calculate what the important terms are in a document?

A

Term Frequency (TF) metric; measures how frequently a term (t) occurs within a document

tf (t,d) = count (t,d) / number of terms in d = proportion

tf (t,d) = count (t,d)

21
Q

How can you address the issue of filler words when it comes to TF metrics?

A

To address this, we want to give more ‘weight’ to the terms that are
1. locally important (within individual documents)
2. not overly common across the entire collection of documents (globally rare)

The above can be achieved using the Term-Frequency: Inverse Document Frequency (TF-IDF)

22
Q

What is Term Frequency - Inverse Document Frequency (TF-IDF)?

A

TF-IDF is a widely used statistical method in text mining and informational retrieval. It measures how important a term is within a document relative to a collection of documents (i.e., a corpus).

TF-IDF will be high if a word occurs frequently in a specific document but is rare across the entire corpus.

23
Q

What are the components of TF-IDF?

A

TF - IDF (Term Frequency - Inverse Document Frequency)

TF (Term Frequency) - measures how frequently a term (t) occurs in a document

IDF (Inverse Document Frequency) - measures the importance of a term across all documents

24
Q

Define Document Frequency (DF)

A

DF is the number of documents containing the term t

df(t) = occurrence of t in documents

In terms of DF, a term is more informative if it occurs in fewer documents. We would like to

  1. give less weight to terms that appear more frequently
  2. give more weight to terms that appear less frequently

And that’s why we need an IDF

25
Q

Define IDF

A

IDF is an Inverse Document Frequency where you give less weight to terms that occur more frequently in a given document corpus.

idf (t) = N / df(t)
where N is the number of documents in the corpus

26
Q

How can you address the issue of really high idf’s (such as rare terms)?

A

You can address it using log idf

Using logarithm to scale and avoided extreme values can be useful.

idf (t) = log (N / df(t))

27
Q

Define Term Frequency - Inverse document frequency (TF - Idf)

A

TF - Idf is combining term frequency and inverse document frequency to define a composite measure for a term in a document:

tf-idf (t,d) = td (t,d) x idf (t)

28
Q

How can you assign the term (t) a value in document (d)? What are the conditions?

A
  1. tf-idf is highest when t occurs frequently within a small number of documents overall (globally rare but locally common)
  2. tf-idf is lower when t occurs rarely in a document or when it occurs frequently in many documents
  3. tf-idf is lowest when t occurs frequently in most documents
29
Q

What is Lexical Diversity? And what is the metric used for it?

A

Lexical diversity can help understand the complexity of a text (vocabulary diversity) and gain insight into the use of language (e.g., fraud messages have high lexical diversity)

Most common metric type is TTR (ntype/token)

30
Q

What is the main disadvantage for TTR?

A

TTR is highly sensitive to text length - the longer the text the lower the chance that a new token will be unique - shorter texts may have fewer term repetitions.

31
Q

What are the other lexical diversity metrics?

A
  1. Herdan’s C (log TTR, 1960): log total types / log total tokens
  2. Guiraud’s TTR (1954): total types / sqrt total tokens
  3. Simpsons’s D (1949): Subsampling tokens at random from the data, then computing the average
  4. Mean Segment Type-Token Ration (MSTTR, Johnson 1944): Divide the text into segments, and compute the mean.
  5. The Moving Average Type Token Ration (MATTR, Covington and McFall 2008): Calculates a moving-average type token ratio and uses the mean of the TTR of each window
32
Q

How to calculate lexical diversity in R?

A

In R: to calculate lexical diversity of text(s) use the function textstat_lexdiv in library (“quanteda.textstats”)

33
Q

Define Readability Metrics

A

Readability metrics is used to describe readability in terms of complexity (i.e., how hard it is to read a text)

The most common approach is to use numeric readability metrics to estimate the readability of texts.

34
Q

What is FRE? Define.

A

FRE is the shorthand for Flesch Reading Ease (FRE) Score (1948); where the readability score uses a combination of words, sentences and syllables.

Measured on a scale of 1 and 100, with 100 being the highest (very easy to read) and 1 being the lowest (very complicated to read).

35
Q

What is the standard formula of FRE?

A

FRE = 206.835 - 1.015 x (total words/total sentences) - 84.6 x (total syllables/total words)

36
Q

What is the standard formula of FKGL?

A

FKGL = 0.39 x (total words/total sentences) + 118 x (total syllables/total words) - 15.59

FKGL (Flesch Kincaid Grade Level) is equivalent to the US grade level of education 0 -18

37
Q

Define The Coleman Liau Readability Index (CLI, 1975)

A

CLI uses the number of letters and numbers of sentences.

The formula is:
CLI = (0.0588 x L) - (0.296 x S) - 15.8

Where L is the average number of letters per 100 words; and S is the average number of sentences per 100 words

The score is an approximate representation of the US grade level.

38
Q

Define the Automated Readability Index (ARI)

A

ARI uses characters per word and words per sentence.

4.71 x (characters/words) + 0.5 x (words/sentences) - 21.43

39
Q

What are some of the issues that arise when dealing with traditional readability metrics?

A
  1. Metrics were developed decades earlier for different contexts (e.g., education research and applied psychology)
  2. Other features of text may indicate greater complexity (syntactic and grammatical structure, and rare words)
  3. Uncertainty estimates (e.g., ‘what does it mean to have a text with FRE 70 and another one with 75?’)
40
Q

What is Pre-processing?

A

Pre-processing is the step you do when preparing text for analysis.

You need to clean or preprocess text before quantitative analysis

41
Q

Define Pre-processing: text cleaning.

A

Removing unwanted punctuation, special characters, HTML, and XML tags, Twitter specific mark-up, URLs, extra white space and etc.

42
Q

Define Pre-processing: text normalisation.

A

Removing capitalisation and convert to lower cases, handle abbreviations, convert acronyms, check spellings, replace slang words with their meanings, etc.

43
Q

Define Pre-processing: further processing.

A

Further processing is stemming and lemmatisation to reduce words to their base form; and remove stop words (e.g., ‘the’, ‘and’)

44
Q

Define data handling.

A

Removing duplications, handling missing values

45
Q

Define custom pre-processing.

A

Requires your own specialist knowledge

46
Q

What is the difference between stems and lemmas?

A

Stems is removing suffixes from words (e.g., ‘s’ and ‘es’ endings) to reduce words to a common base

Lemmas is reducing different forms of a word to a common base form that has the same meaning and uses knowledge of word class.

47
Q

What are the approaches a data scientist has when it comes to stopwords?

A

It is common for data scientists not to use all these stopwords and add their own stopwords or create your own stopword list

48
Q

What are the aspects of pre-processing that you should be aware of?

A
  1. May negatively impact your results
  2. There is no one-size fits all solution in data science
  3. Experiment and assess the impact of preprocessing and ensure your decisions are reasonable for your task