10 - Text Analytics Flashcards

Question 1

Q

Describe the three main ways of data?

Answer

A

Unstructured
- PDF, JPEG, MP3, Movies
Semi-structured
- CSV, JSON, XML
Structured
- Oracle, MSSQL, MySQL

Question 2

Q

What is Text Analytics?

Answer

A

Text mining, also referred to as text data mining, roughly equivalent to text analytics, is the process of deriving high-quality information from text. Used for:

Recognition of texts with a case-related criminalistic relevance
Recognition of relations in these texts in order to reveal whole relationship networks and planned activities
Identification and/or tracking of fragmented texts
Identification or tracking of hidden semantics

Question 3

Q

What are the challenges of Text Analytics?

Answer

A

Ambiguity (Mehrdeutigkeit)
Non-Standard Language
More Complex Languages Than English

Question 4

Q

Which two parts has the general process of text analytics?

Answer

A

Data Preparation
- Text PreProcessing, Text Transformation
Data Learning
- Feature Selection / Mining / Evalute Results

Question 5

Q

What are the five main steps of the Text Analytics Process?

Answer

A

Text Pre-Processing
Transformation
Feature Selection
Mining
Evaluation

Question 6

Q

What is done in the Pre-Processing phase?

Answer

A

Tokenization to words
Stemming & Lemmatization
Remove Punctuations
Remove stop words

Question 7

Q

What are Stemming & Lemmatization?

Answer

A

Stemming is a technique used to find out the root / stem of a word.

Lemmatization is a technique used to find out the lemma of a word.

Question 8

Q

What does the abbreviation TF-IDF stands for and where is it used?

Answer

A

TF - Term Frequency
IDF - Inverse Document Frequency

Used in Text Data Analysis.

Question 9

Q

Explain TF?

Answer

A

TF = Term Frequency

Gives us the frequency of the word in each document in the corpus. It increases as the number of occurrences of that word within the document increases. Each term has its own TF in each document.

Question 10

Q

Explain IDF?

Answer

A

IDF = Inverse Document Frequency

Used to calculate the weight of rare words across all documents in the corpus. The words that occur rarely in the corpus have a high IDF score.

Question 11

Q

What can be done with vectors?

Answer

A

Document Summarization
Document Ranking

Question 12

Q

What are the three basic ways of text measurement in Text Mining?

Answer

A

Euclidean Distance
Cosine Similarity
Jaccard Similarity

Limitations: The techniques do not cover the synonym scenario (dog / puppy) etc.

Question 13

Q

What are common techniques for Text Data Analysis?

Answer

A

TF-IDF
Ranking documents
Distance & Similarity
Vectorization