Lecture 10 Flashcards

1
Q

Text and Documents

A

In the field of text visualization, a document can be words, sentences, paragraphs or collection of any of these.

A collection of document is known as corpus(plural corpora).

These documents are often considered atomic. For example, a tweet is an atomic document.

Text and documents are often minimally structured.

For text and documents the most obvious task is searching for a phrase or word.
Sometimes, we perform sentiment analysis on the searched words to know the inclination of people towards a certain product or incident.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Levels of Text Representation

A

There are three levels of text representations:

Lexical Level
Syntactic Level
Semantic Level

Each requires us to convert the unstructured text to some form of structured data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Lexical Level

A

The lexical level is concerned with transforming a string of characters into a sequence of atomic entities called tokens.

Tokens can include characters, characters n-grams, words, word stems, phrases etc. with all associated attributes.

Lexical analyzer process the sequence of characters with a given set of rules into a new sequence of tokens that can be used for further analysis.

Many types of rules can be used to extract tokens, the most common of which are finite state machines defined by regular expressions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Syntactic Level

A

It deals with identifying and tagging(annotation) each token’s function.

We assign various tags, such as sentence position or whether a word is noun, adjective etc.

Tokens can also have attributes such as whether they are singular or plural, or their proximity to other tokens.

The process of extracting these annotations is called Named Entity Recognition(NER).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Semantic Level

A

Third and final level in text representation.

Encompasses the extraction of meaning and relationships between pieces of knowledge derived from the structures identified in the syntactical level.

The goal of this level is to define an analytical interpretation of the full text within a specific context, or sometimes even independent of context.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Document Visualization

A

Single Document Visualization - Visualizations of single text document

Word Clouds
Word Tree
Text Arc
Arc Diagrams
Literature Fingerprinting

Document Collection Visualization – Works on collection of documents, goal is to place similar document close to each other and dissimilar ones far apart.

Self-Organizing Maps
Themescapes
Document Cards

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Word Cloud

A

Also known as text clouds or tag clouds

These are layout of raw tokens, colored and sized by their frequency within a single document.

Text cloud and their variations, such as a Wordle, are examples of visualizations that use only term frequency and some layout algorithm to create the visualization.

The font size and darkness are proportional to the frequency of the word in the document.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Word Tree

A

The word tree visualization is a visual representation of both term frequency as well as their context.

Size is used to represent the term or phrase frequency.

The root of the tree is a user-specified word or phrase of interest, and the branches represent the various contexts in which the word or phrase is used in the document.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Text Arcs

A

Text Arc is a visual representation of how terms relate to the lines of text in which they appear.

Every word of the text is drawn in order around an ellipse as small lines with a slight offset at its start.

More frequently occurring words are drawn larger and brighter and inside the ellipse.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Arc Diagram

A

These are visualization focused on displaying repetition in text.

Repeated subsequences are identified and connected by semicircular arcs.

The thickness of the arc represents the length of the subsequence and the height of the arc represents the distance between the subsequence.

The arc diagram on the next slide, visualize the classic pattern of a minuet.

It contains two parts, each consisting of a long passage played twice.

The parts are loosely related, as shown by the bundle of thin arcs connecting the two main parts.

The overlap of the two main arcs shows that the end of the first passage is the same as the beginning of the second.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Literature Fingerprinting

A

Method of visualizing features used to characterize text.

Instead of calculating just one feature value, we calculate a sequence of feature values per text, presenting it as a characteristic fingerprint of the document.

This allows us to “look inside” the document and analyze the development of the values across the text.

Moreover, the structural information of the document is used to visualize the document on different levels of resolution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly