10 - Text Analytics Flashcards

1
Q

Describe the three main ways of data?

A
  • Unstructured
    • PDF, JPEG, MP3, Movies
  • Semi-structured
    • CSV, JSON, XML
  • Structured
    • Oracle, MSSQL, MySQL
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is Text Analytics?

A

Text mining, also referred to as text data mining, roughly equivalent to text analytics, is the process of deriving high-quality information from text. Used for:

  • Recognition of texts with a case-related criminalistic relevance
  • Recognition of relations in these texts in order to reveal whole relationship networks and planned activities
  • Identification and/or tracking of fragmented texts
  • Identification or tracking of hidden semantics
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the challenges of Text Analytics?

A
  • Ambiguity (Mehrdeutigkeit)
  • Non-Standard Language
  • More Complex Languages Than English
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Which two parts has the general process of text analytics?

A
  • Data Preparation
    • Text PreProcessing, Text Transformation
  • Data Learning
    • Feature Selection / Mining / Evalute Results
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the five main steps of the Text Analytics Process?

A
  1. Text Pre-Processing
  2. Transformation
  3. Feature Selection
  4. Mining
  5. Evaluation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is done in the Pre-Processing phase?

A
  1. Tokenization to words
  2. Stemming & Lemmatization
  3. Remove Punctuations
  4. Remove stop words
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are Stemming & Lemmatization?

A

Stemming is a technique used to find out the root / stem of a word.

Lemmatization is a technique used to find out the lemma of a word.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What does the abbreviation TF-IDF stands for and where is it used?

A
  • TF - Term Frequency
  • IDF - Inverse Document Frequency

Used in Text Data Analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Explain TF?

A

TF = Term Frequency

Gives us the frequency of the word in each document in the corpus. It increases as the number of occurrences of that word within the document increases. Each term has its own TF in each document.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Explain IDF?

A

IDF = Inverse Document Frequency

Used to calculate the weight of rare words across all documents in the corpus. The words that occur rarely in the corpus have a high IDF score.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What can be done with vectors?

A
  • Document Summarization
  • Document Ranking
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the three basic ways of text measurement in Text Mining?

A
  • Euclidean Distance
  • Cosine Similarity
  • Jaccard Similarity

Limitations: The techniques do not cover the synonym scenario (dog / puppy) etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are common techniques for Text Data Analysis?

A
  • TF-IDF
  • Ranking documents
  • Distance & Similarity
  • Vectorization
How well did you know this?
1
Not at all
2
3
4
5
Perfectly