- Techniques to find the stem of a word words: user, users, used, using -> stem: use words: engineering, engineered -> stem: engineer Usefulness for Text Mining: - Improve effectivity of text mining methods (match of similar words) - Reduce term vector size (may reduce the term vector as much as 40-50%)

Text Mining Flashcards by Jan Kiefer

What is the motivation for text mining?

90 % of the worlds data is in unstructured format

e. g. web pages, emails, corporate documents, scientific papers

How well did you know this?

Not at all

Perfectly

Define Text Mining

The extraction of implicit, previously unknown and potentially useful information from large amounts of textual resources

How well did you know this?

Not at all

Perfectly

What are some application areas for Text Mining?

Classification of news stories
SPAM detection
Sentiment analysis
Clustering of documents or web pages

How well did you know this?

Not at all

Perfectly

Give an example of a mixture of document clustering and classification

Google News first clusters different news articles. Afterwards the classify the news articles

How well did you know this?

Not at all

Perfectly

What is the goal of sentiment analysis?

To determine the polarity of a given text at the document, sentence, or feature/aspect level
Polarity values (positive, neutral, negative)

How well did you know this?

Not at all

Perfectly

For which area can you apply sentiment analysis?

On document level: analysis of a whole document (tweets about president)

On feature/aspect level: analysis of product reviews (polarity values for different features within a review)

How well did you know this?

Not at all

Perfectly

Explain search log mining

Analysis of search queries issued by large user communities

How well did you know this?

Not at all

Perfectly

What are application areas for search log mining

1) Search term auto-completion (association analysis)

2) Query topic detection (classification)

How well did you know this?

Not at all

Perfectly

What is information extraction?

The task of automatically extracting structured information from unstructured or semi-structured documents

How well did you know this?

Not at all

Perfectly

What are the subtasks of information extraction

1) Named entity recognition (The parliament in Berlin …)
- Which parliament, which berlin; 2 named entities, use the context of the text to remove disambiguation
2) Relationship extraction
- PERSON works for ORGANIZATION
3) Fact Extraction
- CITY has population NUMBER

How well did you know this?

Not at all

Perfectly

What is the difference between Search/Query and Discovery?

Search/Query is Goal-oriented: You know what you want

Structured data: Query processing
Text: Information retrieval

Discovery is opportunistic: You don’t know in advance what patterns you identify in your data

Structured data: Data Minig
Text: Text Minig

How well did you know this?

Not at all

Perfectly

Explain the text mining process

Similar to the Data Mining Process
1 Text preprocessing (syntactic / semantic analysis)
2 Feature Generation (bag of words)
3 Feature Selection (Reduce large number)
4 Data Mining (clustering, classification, association analysis)
5 Interpretation / Evaluation

How well did you know this?

Not at all

Perfectly

Which techniques are used for text preprocessing

Tokenization
Stopword Removal
Stemming
POS Tagging

How well did you know this?

Not at all

Perfectly

Which syntactic and linguistic text preprocessing techniques exist?

Simple Syntactic Processing
Text cleanup (remove punctuation / HTML tags)
Tokenization (break text into single words)
Advanced Linguistic Processing
Word Sense Disambiguation (determine the sense of a word / normalize synonyms / pronouns)
Part of Speech (POS) Tagging (determine the function of each term; nouns, verbs
( Depending on the task you might be only interested in nouns or verbs )

How well did you know this?

Not at all

Perfectly

Explain Stopword Removal

Many of the most frequent used words in english are likely to be useless
They are called Stopwords (the, and, to, is, that)
Domain specific stopword list may be constructed

You should remove stopwords:

To reduce data set size (they account for 20 -30% of the total word count)
Improve effectivity of text mining methods (they might confuse the mining algorithm)

How well did you know this?

Not at all

Perfectly

What is Stemming?

Techniques to find the stem of a word
words: user, users, used, using -> stem: use
words: engineering, engineered -> stem: engineer

Usefulness for Text Mining:

Improve effectivity of text mining methods (match of similar words)
Reduce term vector size (may reduce the term vector as much as 40-50%)

How well did you know this?

Not at all

Perfectly

What are the basic stemming rules?

Study These Flashcards

Remove endings
- if a word ends with a consonant other than s followed by a s, then delete s
- if a word ends in es drop the s
- remove ing of word that ends with ing if remaining word has more than one letter or is not th (thing)
- if a word ends with ed, preceded by a consonant delete the ed unless this leaves a single letter
Transform words
- if a word ends with ies but not eies or aies then
  ies -> y

What are feature generation methods?

Study These Flashcards

1) Bag-of-Words

2) Word Embeddings

Explain the Bag-of-words feature generation

Study These Flashcards

Document is treated as bag of words (each word/term becomes a feature; order of words is ignored)
Document is represented as a vector

Briefly explain the three different techniques for vector creation: Binary Term occurrence, Term occurrence, and Terms frequency

Study These Flashcards

1) Binary Term occurrence: Boolean attribute describes whether or not a term appears in the document (no matter how often
2) Term occurrence: Number of occurrences of a term in the document (problematic with texts of different length)
3) Terms frequency: frequency in which a term appears (number of occurrences / number of words in document; for documents with different length)

Explain the TF-IDF Term Weighting Schema for feature generation (Term frequency inverse document frequency)

Study These Flashcards

Extension of terms frequency to evaluate how important a word is to a corpus of documents
Multiplication of TF and IDF
TF: Term Frequency
IDF: Total number of docs in corpus / number of docs in which a term appears

How does the TF-IDF distribute weights to words?

Study These Flashcards

Give more weight to rare words (term that appears in a small fraction of documents might be useful)
Gives less weight to common words (domain-specific stopwords)

Explain the feature generation method Word Embeddings

Study These Flashcards

Each word is represented as a vector of real numbers (distributed representation)
Semantically related words end up at similar location in the vector space (embeddings deal better with synonyms)
Embeddings are calculated based on the assumption that similar words appear in similar contexts (distributional similarity)

How can you conduct feature selection for text mining?

Study These Flashcards

High dimensional data makes it difficult for some learners
Pruning Document Vectors
Filter Tokens by POS Tags

How can you prune document vectors?

- Specify if and how too frequent or too infrequent words should be ignored Options: - Percentual - Absolute --> Could lead to overfitting if only rare (infrequent occurring) terms are learned

How can you filter tokens by POS Tags?

- Sometimes you want to focus on certain classes of words - Adjectives (for sentiment analysis (good, bad, great) - Nouns (for text clustering)

Which methods can be used for pattern discovery in text mining?

1) Cluster Analysis 2) Classification 3) Association Analysis

Explain Document Clustering (Goal, Applications, Main Question)

- Given a set of documents and similarity measure among documents find clusters (documents in one cluster are more similar to one another; documents in separate clusters are less similar to one another) Applications: - Topical clustering of news stories - Email message thread identification Main Question: - Which similarity measures are a good choice for comparing document vectors (similarity function depends on the vector creation method)

Explain the Jaccard Coefficient and for with which vector creation method it works good?

- Works good to measure similarity for vectors with asymmetric binary attributes - Number of 11 matches / number of not-both-zero attribute values - Used together with binary term occurrence vector 1 represents occurrence of specific word 0 represents absence of specific word

Explain the Cosine Similarity and for with which vector creation method it works good?

- For comparing weighted document vectors (term-frequency or TF-IDF vectors) - Uses vector dot product and length of a vector (dot product takes only words into account that appear in both documents)

How does the combination of cosine similarity and TF-IDF work?

1) represent documents as vectors of TF-IDF weights | 2) Determine the similarity between the documents based on the TF-IDF vectors

How to determine the embedding-based similarity?

1) Translate documents into embedding vectors (e.g. with doc2vec) 2) Calculate similarity of document embedding vectors (cosine similarity, neural nets)

Explain Document Classification (Goal, Applications,

Goal: Given a collection of labeled documents (training data) find a model for the class that can assign a class to a previously unseen document as accurate as possible Application: - Topical classification of news stories or web pages - SPAM detection - Sentiment analysis

Explain Classification Methods for Document Classification

1) Naive bayes (can handle lots of features) 2) support vector machines (requires a good hyperparameter tuning) 3) Recurrent neural networks 4) KNN or random forest also works

How would you implement a sentiment analysis?

- Use a supervised classification task (needs training data and pairs like - Be careful when preprocessing - Punctuation (Smileys punctuated :) ), visual markup, amount of capitalization) might include valuable features -> Replace smileys or visual mark ups with sentiment words in preprocessing :) -> Great, COOl-> cool cool

How can you obtain labeled data for sentiment analysis?

- Labeling is expensive | - Reviews from the web may be used as labeled data (Amazon Product Data)

How can you find selective words?

- Weight words according to their correlation with class label - Select top-k words with highest correlation -> Helpful for all text classification tasks

How can sentiment lexicons help you with sentiment analysis?

- Helps the classifier to generalize better because the lexicons can contain words that might not appear in the training data

What is the main challenge in Text Mining?

- Preprocessing and vectorization (in order to be able to apply standard data mining algorithms)

Which vectorization technique is most commonly used in practice?

- Embeddings

Text Mining Flashcards

(40 cards)