Lecture 10 Flashcards
Text and Documents
In the field of text visualization, a document can be words, sentences, paragraphs or collection of any of these.
A collection of document is known as corpus(plural corpora).
These documents are often considered atomic. For example, a tweet is an atomic document.
Text and documents are often minimally structured.
For text and documents the most obvious task is searching for a phrase or word.
Sometimes, we perform sentiment analysis on the searched words to know the inclination of people towards a certain product or incident.
Levels of Text Representation
There are three levels of text representations:
Lexical Level
Syntactic Level
Semantic Level
Each requires us to convert the unstructured text to some form of structured data.
Lexical Level
The lexical level is concerned with transforming a string of characters into a sequence of atomic entities called tokens.
Tokens can include characters, characters n-grams, words, word stems, phrases etc. with all associated attributes.
Lexical analyzer process the sequence of characters with a given set of rules into a new sequence of tokens that can be used for further analysis.
Many types of rules can be used to extract tokens, the most common of which are finite state machines defined by regular expressions.
Syntactic Level
It deals with identifying and tagging(annotation) each token’s function.
We assign various tags, such as sentence position or whether a word is noun, adjective etc.
Tokens can also have attributes such as whether they are singular or plural, or their proximity to other tokens.
The process of extracting these annotations is called Named Entity Recognition(NER).
Semantic Level
Third and final level in text representation.
Encompasses the extraction of meaning and relationships between pieces of knowledge derived from the structures identified in the syntactical level.
The goal of this level is to define an analytical interpretation of the full text within a specific context, or sometimes even independent of context.
Document Visualization
Single Document Visualization - Visualizations of single text document
Word Clouds Word Tree Text Arc Arc Diagrams Literature Fingerprinting
Document Collection Visualization – Works on collection of documents, goal is to place similar document close to each other and dissimilar ones far apart.
Self-Organizing Maps
Themescapes
Document Cards
Word Cloud
Also known as text clouds or tag clouds
These are layout of raw tokens, colored and sized by their frequency within a single document.
Text cloud and their variations, such as a Wordle, are examples of visualizations that use only term frequency and some layout algorithm to create the visualization.
The font size and darkness are proportional to the frequency of the word in the document.
Word Tree
The word tree visualization is a visual representation of both term frequency as well as their context.
Size is used to represent the term or phrase frequency.
The root of the tree is a user-specified word or phrase of interest, and the branches represent the various contexts in which the word or phrase is used in the document.
Text Arcs
Text Arc is a visual representation of how terms relate to the lines of text in which they appear.
Every word of the text is drawn in order around an ellipse as small lines with a slight offset at its start.
More frequently occurring words are drawn larger and brighter and inside the ellipse.
Arc Diagram
These are visualization focused on displaying repetition in text.
Repeated subsequences are identified and connected by semicircular arcs.
The thickness of the arc represents the length of the subsequence and the height of the arc represents the distance between the subsequence.
The arc diagram on the next slide, visualize the classic pattern of a minuet.
It contains two parts, each consisting of a long passage played twice.
The parts are loosely related, as shown by the bundle of thin arcs connecting the two main parts.
The overlap of the two main arcs shows that the end of the first passage is the same as the beginning of the second.
Literature Fingerprinting
Method of visualizing features used to characterize text.
Instead of calculating just one feature value, we calculate a sequence of feature values per text, presenting it as a characteristic fingerprint of the document.
This allows us to “look inside” the document and analyze the development of the values across the text.
Moreover, the structural information of the document is used to visualize the document on different levels of resolution.