B02 Text Analytics I Flashcards
Define Text Analytics
Text Analytics is the process of extracting high quality insights from textual data. It is sometimes referred to as text mining.
Define this type of Text Analytics: Search and information retrieval
Search and information retrieval covers indexing, searching, and retrieving documents from large text databases with keyword queries.
Define this type of Text Analytics: Document Clustering
Document clustering uses an unsupervised machine learning approach to group similar documents into clusters.
Define this type of Text Analytics: Document Classification
Document classification assigns a known set of labels to untagged documents using a model learned from documents with known labels.
Define this type of Text Analytics: Information Extraction
The goal of information extraction is to construct (or extract) structured data, such as names, places and organizations, from unstructured data.
Define this type of Text Analytics: Syntactic Parsing
Syntactic parsing uses part of speech tagging (POS) techniques to identify words in order to use them in a grammatical or useful context.
Define this type of Text Analytics: Concept Extraction
Concept extraction is focused on grouping words, phrases and other lexical structures into semantically similar groups in order to understand the text.
Define Regular Expressions (Regex)
- Regular expressions (“regex”) provide us with a concise language for describing patterns in text by using a set of metacharacters ($ * + . ? [] ^ { } | ( ) ). - Identifing patterns and manipulating text allow us to transform textual data from unstructured or semi-structured form to structured form for analysis.
The syntax for regular expressions can be grouped into four major categories:
1.Operators. 2.Escape sequences. 3.Anchors and repetitions. 4.Character classes.
Regex
Operators:
Regex
Escape Sequences
-Some special characters in R cannot be referenced directly.
-For example, to refer to a pattern that contains a single
quote (’), we refer to the single quote by preceding it with an
escape character ().
-Some common escape sequences include: single quote (\’),
double quote (\”), newline (\n), carriage return (\r), tab
character (\t) and even the back slash (\).
Regex
Anchors and Repetitions
Regex
Character Classes 1
Regex
Character Classes 2
Hypertext Markup Language (HTML)
-The standard markup
language used for
creating web pages.
-It describes the structure
of web pages by the use
of tags such as
,
, , etc.