Week 4 - Text Mining II Flashcards

1
Q

How many sequences of words are there? And why should there be a sequence of words?

A

We need a sequence of words because sometimes a sequence of words may contain more information. There are three sections for the sequence of words:
1. compound words
2. phrases with more context
3. sentences for context

By tying a word to its surrounding words, we may retain a greater amount of information and a more precise understanding of context.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are N-grams?

A

N-gram is sequence of consecutive n elements extracted from a text. Elements can be words/token, syllables, characters, symbols.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the criteria when selecting n?

A

Generally smaller n (biagram or triagram) works well. Four-gram and five-gram might be useful when you have large data sets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is TFIDF with n-grams?

A

It’s a bag of n grams model (preserving more context) and common to create tfidf with n-grams as features to use in predictive modelling (e.g., machine learning)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is Key Words in Context (KWiC)?

A

Keywords-in-context (KWiC) displays the concordances lines with the keyword in the middle along with nearby words.

This helps gain insight into how a word or phrase is used in a corpus, how frequently and in which context

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is Part Of Specch (POS)?

A

POS is to identify the lexical category of words. This classifies a word with its corresponding part of speech (e.g., nouns, verbs, pronouns, adjectives, adverbs, and many more)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is POS tagging?

A

POS tagging is the process of assigning tags to words with its corresponding part of speech.

The context of the word is required to identify its POS.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the use/purpose of POS tagging?

A

POS tagging is used to compare the grammar of different texts, make grammar corrections, auto-complete, help to translate text from one language to another, and create more specific features in a document.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How do you generate POS tags?

A

POS tags are generate using Python libraries spaCy and NLTK, and Java libraries OpenNLP and CoreNLP.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a Sentiment Analysis?

A

Sentiment analysis (or opinion mining), studies the opinions, attitudes, and emotions of a writer toward a subject matter (e.g., an entity, or event).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the potential applications for security and crime science?

A

Examining discussions, public opinion or reactions related to security and crime: predicting crime patterns, potential signals for online radicalisation, disinformation, hate-speech and harmful behaviour detection, phishing etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Main steps in sentiment analysis

A
  1. Tokenise the text
  2. Create a sentiment lexicon (or use an existing sentiment lexicon)
  3. Judge the sentiment of tokens
  4. Match tokens with sentiment lexicon
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the specifics of creating a sentiment lexicon?

A

A sentiment lexicon is a dictionary of words where each word is assigned a corresponding sentiment score.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Do all the words have a potential sentiment?

A

Not all words (e.g., neutral words) may have a sentiment, but you may wish to focus on adjectives/adverbs.

Sentiment lexicons exist; lexicons might be designed for a specific intended usage

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What’s a limitation of syuzhet?

A

There’s a limitation when you match tokens with sentiment lexicon II

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is considered as Valence Shifters?

A

Nearly all sentiment analysis packages in R fail to recognise valence shifters except sentimentr

Negatos, amplifiers, de-amplifiers, adversative conjuctions.

17
Q

What is the limitation of nuanced approaches?

A

Sentiment for each sentence:

  • needs punctuated data but, data is often not punctuated (e.g., texts on social media) or badly punctuated.
  • without punctuation whole text is seen as one sentence
  • requires accurate sentence boundary disambiguation (sentence breaking) and data might not be suitable (e.g., Twitter -> too short)
18
Q

What is the benefit of Sentiment Trajectory Analysis?

A

It’s a possible to aid understand the style structure of texts, what makes a popular speech & detect misinformation

19
Q

What is the process for doing a sentiment trajectory?

A
  1. Parse the text into tokens
  2. Match the sentiment lexicon to each token

(match valence shifters to each context; apply valence shifter weights; build a naive context around the sentiment; return a modified sentiment)

20
Q

What is the ranking of the valence shifters?

A

There is 4 valence shifters:

1 = negator (not, never, …)
2 = amplifier (very, totally, …)
3 = deamplifier (hardly, barely, …)
4 = adversative conjunction (but, however, …)

21
Q

What is a sentiment trajectory?

A

Having a sentiment trajectory to build ‘naive’ context around the sentiment; this means 2 words before the sentiment word and 2 words after.

22
Q

In some cases of trajectories; the transcript has different lengths. How can you standardise the length?

A

The solution is length standardisation.

We transform all sentiment values into a standardized vector form. We do this because standard vector length makes comparisons easier and more consistent.

23
Q

What’s an aspect of trajectory that you need to beware of?

A

Beware of the “filter” size; increasing the filter parameter will add more granularity. The default filter size is 5, compared to the filter size is 20.