10 - Text Analytics Flashcards
Describe the three main ways of data?
-
Unstructured
- PDF, JPEG, MP3, Movies
-
Semi-structured
- CSV, JSON, XML
-
Structured
- Oracle, MSSQL, MySQL
What is Text Analytics?
Text mining, also referred to as text data mining, roughly equivalent to text analytics, is the process of deriving high-quality information from text. Used for:
- Recognition of texts with a case-related criminalistic relevance
- Recognition of relations in these texts in order to reveal whole relationship networks and planned activities
- Identification and/or tracking of fragmented texts
- Identification or tracking of hidden semantics
What are the challenges of Text Analytics?
- Ambiguity (Mehrdeutigkeit)
- Non-Standard Language
- More Complex Languages Than English
Which two parts has the general process of text analytics?
-
Data Preparation
- Text PreProcessing, Text Transformation
-
Data Learning
- Feature Selection / Mining / Evalute Results
What are the five main steps of the Text Analytics Process?
- Text Pre-Processing
- Transformation
- Feature Selection
- Mining
- Evaluation
What is done in the Pre-Processing phase?
- Tokenization to words
- Stemming & Lemmatization
- Remove Punctuations
- Remove stop words
What are Stemming & Lemmatization?
Stemming is a technique used to find out the root / stem of a word.
Lemmatization is a technique used to find out the lemma of a word.
What does the abbreviation TF-IDF stands for and where is it used?
- TF - Term Frequency
- IDF - Inverse Document Frequency
Used in Text Data Analysis.
Explain TF?
TF = Term Frequency
Gives us the frequency of the word in each document in the corpus. It increases as the number of occurrences of that word within the document increases. Each term has its own TF in each document.
Explain IDF?
IDF = Inverse Document Frequency
Used to calculate the weight of rare words across all documents in the corpus. The words that occur rarely in the corpus have a high IDF score.
What can be done with vectors?
- Document Summarization
- Document Ranking
What are the three basic ways of text measurement in Text Mining?
- Euclidean Distance
- Cosine Similarity
- Jaccard Similarity
Limitations: The techniques do not cover the synonym scenario (dog / puppy) etc.
What are common techniques for Text Data Analysis?
- TF-IDF
- Ranking documents
- Distance & Similarity
- Vectorization