Text Mining Flashcards

1
Q

What is the motivation for text mining?

A
  • 90 % of the worlds data is in unstructured format

e. g. web pages, emails, corporate documents, scientific papers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Define Text Mining

A

The extraction of implicit, previously unknown and potentially useful information from large amounts of textual resources

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are some application areas for Text Mining?

A
  1. Classification of news stories
  2. SPAM detection
  3. Sentiment analysis
  4. Clustering of documents or web pages
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Give an example of a mixture of document clustering and classification

A

Google News first clusters different news articles. Afterwards the classify the news articles

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the goal of sentiment analysis?

A
  • To determine the polarity of a given text at the document, sentence, or feature/aspect level
  • Polarity values (positive, neutral, negative)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

For which area can you apply sentiment analysis?

A

On document level: analysis of a whole document (tweets about president)

On feature/aspect level: analysis of product reviews (polarity values for different features within a review)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Explain search log mining

A
  • Analysis of search queries issued by large user communities
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are application areas for search log mining

A

1) Search term auto-completion (association analysis)

2) Query topic detection (classification)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is information extraction?

A
  • The task of automatically extracting structured information from unstructured or semi-structured documents
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the subtasks of information extraction

A

1) Named entity recognition (The parliament in Berlin …)
- Which parliament, which berlin; 2 named entities, use the context of the text to remove disambiguation
2) Relationship extraction
- PERSON works for ORGANIZATION
3) Fact Extraction
- CITY has population NUMBER

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the difference between Search/Query and Discovery?

A

Search/Query is Goal-oriented: You know what you want

  • Structured data: Query processing
  • Text: Information retrieval

Discovery is opportunistic: You don’t know in advance what patterns you identify in your data

  • Structured data: Data Minig
  • Text: Text Minig
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Explain the text mining process

A
  • Similar to the Data Mining Process
    1 Text preprocessing (syntactic / semantic analysis)
    2 Feature Generation (bag of words)
    3 Feature Selection (Reduce large number)
    4 Data Mining (clustering, classification, association analysis)
    5 Interpretation / Evaluation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Which techniques are used for text preprocessing

A
  • Tokenization
  • Stopword Removal
  • Stemming
  • POS Tagging
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Which syntactic and linguistic text preprocessing techniques exist?

A
  • Simple Syntactic Processing
  • Text cleanup (remove punctuation / HTML tags)
  • Tokenization (break text into single words)
  • Advanced Linguistic Processing
  • Word Sense Disambiguation (determine the sense of a word / normalize synonyms / pronouns)
  • Part of Speech (POS) Tagging (determine the function of each term; nouns, verbs
  • ( Depending on the task you might be only interested in nouns or verbs )
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Explain Stopword Removal

A
  • Many of the most frequent used words in english are likely to be useless
  • They are called Stopwords (the, and, to, is, that)
  • Domain specific stopword list may be constructed

You should remove stopwords:

  • To reduce data set size (they account for 20 -30% of the total word count)
  • Improve effectivity of text mining methods (they might confuse the mining algorithm)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is Stemming?

A
  • Techniques to find the stem of a word
    words: user, users, used, using -> stem: use
    words: engineering, engineered -> stem: engineer

Usefulness for Text Mining:

  • Improve effectivity of text mining methods (match of similar words)
  • Reduce term vector size (may reduce the term vector as much as 40-50%)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are the basic stemming rules?

A
  • Remove endings
    • if a word ends with a consonant other than s followed by a s, then delete s
    • if a word ends in es drop the s
    • remove ing of word that ends with ing if remaining word has more than one letter or is not th (thing)
    • if a word ends with ed, preceded by a consonant delete the ed unless this leaves a single letter
  • Transform words
    • if a word ends with ies but not eies or aies then
      ies -> y
18
Q

What are feature generation methods?

A

1) Bag-of-Words

2) Word Embeddings

19
Q

Explain the Bag-of-words feature generation

A
  • Document is treated as bag of words (each word/term becomes a feature; order of words is ignored)
  • Document is represented as a vector
20
Q

Briefly explain the three different techniques for vector creation: Binary Term occurrence, Term occurrence, and Terms frequency

A

1) Binary Term occurrence: Boolean attribute describes whether or not a term appears in the document (no matter how often
2) Term occurrence: Number of occurrences of a term in the document (problematic with texts of different length)
3) Terms frequency: frequency in which a term appears (number of occurrences / number of words in document; for documents with different length)

21
Q

Explain the TF-IDF Term Weighting Schema for feature generation (Term frequency inverse document frequency)

A
  • Extension of terms frequency to evaluate how important a word is to a corpus of documents
  • Multiplication of TF and IDF
    TF: Term Frequency
    IDF: Total number of docs in corpus / number of docs in which a term appears
22
Q

How does the TF-IDF distribute weights to words?

A
  • Give more weight to rare words (term that appears in a small fraction of documents might be useful)
  • Gives less weight to common words (domain-specific stopwords)
23
Q

Explain the feature generation method Word Embeddings

A
  • Each word is represented as a vector of real numbers (distributed representation)
  • Semantically related words end up at similar location in the vector space (embeddings deal better with synonyms)
  • Embeddings are calculated based on the assumption that similar words appear in similar contexts (distributional similarity)
24
Q

How can you conduct feature selection for text mining?

A
  • High dimensional data makes it difficult for some learners
  • Pruning Document Vectors
  • Filter Tokens by POS Tags
25
Q

How can you prune document vectors?

A
  • Specify if and how too frequent or too infrequent words should be ignored
    Options:
  • Percentual
  • Absolute

–> Could lead to overfitting if only rare (infrequent occurring) terms are learned

26
Q

How can you filter tokens by POS Tags?

A
  • Sometimes you want to focus on certain classes of words
  • Adjectives (for sentiment analysis (good, bad, great)
  • Nouns (for text clustering)
27
Q

Which methods can be used for pattern discovery in text mining?

A

1) Cluster Analysis
2) Classification
3) Association Analysis

28
Q

Explain Document Clustering (Goal, Applications, Main Question)

A
  • Given a set of documents and similarity measure among documents find clusters (documents in one cluster are more similar to one another; documents in separate clusters are less similar to one another)

Applications:

  • Topical clustering of news stories
  • Email message thread identification

Main Question:
- Which similarity measures are a good choice for comparing document vectors (similarity function depends on the vector creation method)

29
Q

Explain the Jaccard Coefficient and for with which vector creation method it works good?

A
  • Works good to measure similarity for vectors with asymmetric binary attributes
  • Number of 11 matches / number of not-both-zero attribute values
  • Used together with binary term occurrence vector
    1 represents occurrence of specific word
    0 represents absence of specific word
30
Q

Explain the Cosine Similarity and for with which vector creation method it works good?

A
  • For comparing weighted document vectors (term-frequency or TF-IDF vectors)
  • Uses vector dot product and length of a vector (dot product takes only words into account that appear in both documents)
31
Q

How does the combination of cosine similarity and TF-IDF work?

A

1) represent documents as vectors of TF-IDF weights

2) Determine the similarity between the documents based on the TF-IDF vectors

32
Q

How to determine the embedding-based similarity?

A

1) Translate documents into embedding vectors (e.g. with doc2vec)
2) Calculate similarity of document embedding vectors (cosine similarity, neural nets)

33
Q

Explain Document Classification (Goal, Applications,

A

Goal: Given a collection of labeled documents (training data) find a model for the class that can assign a class to a previously unseen document as accurate as possible

Application:

  • Topical classification of news stories or web pages
  • SPAM detection
  • Sentiment analysis
34
Q

Explain Classification Methods for Document Classification

A

1) Naive bayes (can handle lots of features)
2) support vector machines (requires a good hyperparameter tuning)
3) Recurrent neural networks
4) KNN or random forest also works

35
Q

How would you implement a sentiment analysis?

A
  • Use a supervised classification task (needs training data and pairs like
  • Be careful when preprocessing
    • Punctuation (Smileys punctuated :) ), visual markup, amount of capitalization) might include valuable features
      -> Replace smileys or visual mark ups with sentiment words in preprocessing
      :) -> Great, COOl-> cool cool
36
Q

How can you obtain labeled data for sentiment analysis?

A
  • Labeling is expensive

- Reviews from the web may be used as labeled data (Amazon Product Data)

37
Q

How can you find selective words?

A
  • Weight words according to their correlation with class label
  • Select top-k words with highest correlation

-> Helpful for all text classification tasks

38
Q

How can sentiment lexicons help you with sentiment analysis?

A
  • Helps the classifier to generalize better because the lexicons can contain words that might not appear in the training data
39
Q

What is the main challenge in Text Mining?

A
  • Preprocessing and vectorization (in order to be able to apply standard data mining algorithms)
40
Q

Which vectorization technique is most commonly used in practice?

A
  • Embeddings