Text Mining Flashcards

1
Q

What is the motivation for text mining?

A
  • 90 % of the worlds data is in unstructured format

e. g. web pages, emails, corporate documents, scientific papers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Define Text Mining

A

The extraction of implicit, previously unknown and potentially useful information from large amounts of textual resources

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are some application areas for Text Mining?

A
  1. Classification of news stories
  2. SPAM detection
  3. Sentiment analysis
  4. Clustering of documents or web pages
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Give an example of a mixture of document clustering and classification

A

Google News first clusters different news articles. Afterwards the classify the news articles

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the goal of sentiment analysis?

A
  • To determine the polarity of a given text at the document, sentence, or feature/aspect level
  • Polarity values (positive, neutral, negative)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

For which area can you apply sentiment analysis?

A

On document level: analysis of a whole document (tweets about president)

On feature/aspect level: analysis of product reviews (polarity values for different features within a review)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Explain search log mining

A
  • Analysis of search queries issued by large user communities
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are application areas for search log mining

A

1) Search term auto-completion (association analysis)

2) Query topic detection (classification)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is information extraction?

A
  • The task of automatically extracting structured information from unstructured or semi-structured documents
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the subtasks of information extraction

A

1) Named entity recognition (The parliament in Berlin …)
- Which parliament, which berlin; 2 named entities, use the context of the text to remove disambiguation
2) Relationship extraction
- PERSON works for ORGANIZATION
3) Fact Extraction
- CITY has population NUMBER

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the difference between Search/Query and Discovery?

A

Search/Query is Goal-oriented: You know what you want

  • Structured data: Query processing
  • Text: Information retrieval

Discovery is opportunistic: You don’t know in advance what patterns you identify in your data

  • Structured data: Data Minig
  • Text: Text Minig
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Explain the text mining process

A
  • Similar to the Data Mining Process
    1 Text preprocessing (syntactic / semantic analysis)
    2 Feature Generation (bag of words)
    3 Feature Selection (Reduce large number)
    4 Data Mining (clustering, classification, association analysis)
    5 Interpretation / Evaluation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Which techniques are used for text preprocessing

A
  • Tokenization
  • Stopword Removal
  • Stemming
  • POS Tagging
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Which syntactic and linguistic text preprocessing techniques exist?

A
  • Simple Syntactic Processing
  • Text cleanup (remove punctuation / HTML tags)
  • Tokenization (break text into single words)
  • Advanced Linguistic Processing
  • Word Sense Disambiguation (determine the sense of a word / normalize synonyms / pronouns)
  • Part of Speech (POS) Tagging (determine the function of each term; nouns, verbs
  • ( Depending on the task you might be only interested in nouns or verbs )
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Explain Stopword Removal

A
  • Many of the most frequent used words in english are likely to be useless
  • They are called Stopwords (the, and, to, is, that)
  • Domain specific stopword list may be constructed

You should remove stopwords:

  • To reduce data set size (they account for 20 -30% of the total word count)
  • Improve effectivity of text mining methods (they might confuse the mining algorithm)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is Stemming?

A
  • Techniques to find the stem of a word
    words: user, users, used, using -> stem: use
    words: engineering, engineered -> stem: engineer

Usefulness for Text Mining:

  • Improve effectivity of text mining methods (match of similar words)
  • Reduce term vector size (may reduce the term vector as much as 40-50%)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are the basic stemming rules?

A
  • Remove endings
    • if a word ends with a consonant other than s followed by a s, then delete s
    • if a word ends in es drop the s
    • remove ing of word that ends with ing if remaining word has more than one letter or is not th (thing)
    • if a word ends with ed, preceded by a consonant delete the ed unless this leaves a single letter
  • Transform words
    • if a word ends with ies but not eies or aies then
      ies -> y
18
Q

What are feature generation methods?

A

1) Bag-of-Words

2) Word Embeddings

19
Q

Explain the Bag-of-words feature generation

A
  • Document is treated as bag of words (each word/term becomes a feature; order of words is ignored)
  • Document is represented as a vector
20
Q

Briefly explain the three different techniques for vector creation: Binary Term occurrence, Term occurrence, and Terms frequency

A

1) Binary Term occurrence: Boolean attribute describes whether or not a term appears in the document (no matter how often
2) Term occurrence: Number of occurrences of a term in the document (problematic with texts of different length)
3) Terms frequency: frequency in which a term appears (number of occurrences / number of words in document; for documents with different length)

21
Q

Explain the TF-IDF Term Weighting Schema for feature generation (Term frequency inverse document frequency)

A
  • Extension of terms frequency to evaluate how important a word is to a corpus of documents
  • Multiplication of TF and IDF
    TF: Term Frequency
    IDF: Total number of docs in corpus / number of docs in which a term appears
22
Q

How does the TF-IDF distribute weights to words?

A
  • Give more weight to rare words (term that appears in a small fraction of documents might be useful)
  • Gives less weight to common words (domain-specific stopwords)
23
Q

Explain the feature generation method Word Embeddings

A
  • Each word is represented as a vector of real numbers (distributed representation)
  • Semantically related words end up at similar location in the vector space (embeddings deal better with synonyms)
  • Embeddings are calculated based on the assumption that similar words appear in similar contexts (distributional similarity)
24
Q

How can you conduct feature selection for text mining?

A
  • High dimensional data makes it difficult for some learners
  • Pruning Document Vectors
  • Filter Tokens by POS Tags
25
How can you prune document vectors?
- Specify if and how too frequent or too infrequent words should be ignored Options: - Percentual - Absolute --> Could lead to overfitting if only rare (infrequent occurring) terms are learned
26
How can you filter tokens by POS Tags?
- Sometimes you want to focus on certain classes of words - Adjectives (for sentiment analysis (good, bad, great) - Nouns (for text clustering)
27
Which methods can be used for pattern discovery in text mining?
1) Cluster Analysis 2) Classification 3) Association Analysis
28
Explain Document Clustering (Goal, Applications, Main Question)
- Given a set of documents and similarity measure among documents find clusters (documents in one cluster are more similar to one another; documents in separate clusters are less similar to one another) Applications: - Topical clustering of news stories - Email message thread identification Main Question: - Which similarity measures are a good choice for comparing document vectors (similarity function depends on the vector creation method)
29
Explain the Jaccard Coefficient and for with which vector creation method it works good?
- Works good to measure similarity for vectors with asymmetric binary attributes - Number of 11 matches / number of not-both-zero attribute values - Used together with binary term occurrence vector 1 represents occurrence of specific word 0 represents absence of specific word
30
Explain the Cosine Similarity and for with which vector creation method it works good?
- For comparing weighted document vectors (term-frequency or TF-IDF vectors) - Uses vector dot product and length of a vector (dot product takes only words into account that appear in both documents)
31
How does the combination of cosine similarity and TF-IDF work?
1) represent documents as vectors of TF-IDF weights | 2) Determine the similarity between the documents based on the TF-IDF vectors
32
How to determine the embedding-based similarity?
1) Translate documents into embedding vectors (e.g. with doc2vec) 2) Calculate similarity of document embedding vectors (cosine similarity, neural nets)
33
Explain Document Classification (Goal, Applications,
Goal: Given a collection of labeled documents (training data) find a model for the class that can assign a class to a previously unseen document as accurate as possible Application: - Topical classification of news stories or web pages - SPAM detection - Sentiment analysis
34
Explain Classification Methods for Document Classification
1) Naive bayes (can handle lots of features) 2) support vector machines (requires a good hyperparameter tuning) 3) Recurrent neural networks 4) KNN or random forest also works
35
How would you implement a sentiment analysis?
- Use a supervised classification task (needs training data and pairs like - Be careful when preprocessing - Punctuation (Smileys punctuated :) ), visual markup, amount of capitalization) might include valuable features -> Replace smileys or visual mark ups with sentiment words in preprocessing :) -> Great, COOl-> cool cool
36
How can you obtain labeled data for sentiment analysis?
- Labeling is expensive | - Reviews from the web may be used as labeled data (Amazon Product Data)
37
How can you find selective words?
- Weight words according to their correlation with class label - Select top-k words with highest correlation -> Helpful for all text classification tasks
38
How can sentiment lexicons help you with sentiment analysis?
- Helps the classifier to generalize better because the lexicons can contain words that might not appear in the training data
39
What is the main challenge in Text Mining?
- Preprocessing and vectorization (in order to be able to apply standard data mining algorithms)
40
Which vectorization technique is most commonly used in practice?
- Embeddings