Text Mining Flashcards
What is the motivation for text mining?
- 90 % of the worlds data is in unstructured format
e. g. web pages, emails, corporate documents, scientific papers
Define Text Mining
The extraction of implicit, previously unknown and potentially useful information from large amounts of textual resources
What are some application areas for Text Mining?
- Classification of news stories
- SPAM detection
- Sentiment analysis
- Clustering of documents or web pages
Give an example of a mixture of document clustering and classification
Google News first clusters different news articles. Afterwards the classify the news articles
What is the goal of sentiment analysis?
- To determine the polarity of a given text at the document, sentence, or feature/aspect level
- Polarity values (positive, neutral, negative)
For which area can you apply sentiment analysis?
On document level: analysis of a whole document (tweets about president)
On feature/aspect level: analysis of product reviews (polarity values for different features within a review)
Explain search log mining
- Analysis of search queries issued by large user communities
What are application areas for search log mining
1) Search term auto-completion (association analysis)
2) Query topic detection (classification)
What is information extraction?
- The task of automatically extracting structured information from unstructured or semi-structured documents
What are the subtasks of information extraction
1) Named entity recognition (The parliament in Berlin …)
- Which parliament, which berlin; 2 named entities, use the context of the text to remove disambiguation
2) Relationship extraction
- PERSON works for ORGANIZATION
3) Fact Extraction
- CITY has population NUMBER
What is the difference between Search/Query and Discovery?
Search/Query is Goal-oriented: You know what you want
- Structured data: Query processing
- Text: Information retrieval
Discovery is opportunistic: You don’t know in advance what patterns you identify in your data
- Structured data: Data Minig
- Text: Text Minig
Explain the text mining process
- Similar to the Data Mining Process
1 Text preprocessing (syntactic / semantic analysis)
2 Feature Generation (bag of words)
3 Feature Selection (Reduce large number)
4 Data Mining (clustering, classification, association analysis)
5 Interpretation / Evaluation
Which techniques are used for text preprocessing
- Tokenization
- Stopword Removal
- Stemming
- POS Tagging
Which syntactic and linguistic text preprocessing techniques exist?
- Simple Syntactic Processing
- Text cleanup (remove punctuation / HTML tags)
- Tokenization (break text into single words)
- Advanced Linguistic Processing
- Word Sense Disambiguation (determine the sense of a word / normalize synonyms / pronouns)
- Part of Speech (POS) Tagging (determine the function of each term; nouns, verbs
- ( Depending on the task you might be only interested in nouns or verbs )
Explain Stopword Removal
- Many of the most frequent used words in english are likely to be useless
- They are called Stopwords (the, and, to, is, that)
- Domain specific stopword list may be constructed
You should remove stopwords:
- To reduce data set size (they account for 20 -30% of the total word count)
- Improve effectivity of text mining methods (they might confuse the mining algorithm)
What is Stemming?
- Techniques to find the stem of a word
words: user, users, used, using -> stem: use
words: engineering, engineered -> stem: engineer
Usefulness for Text Mining:
- Improve effectivity of text mining methods (match of similar words)
- Reduce term vector size (may reduce the term vector as much as 40-50%)
What are the basic stemming rules?
- Remove endings
- if a word ends with a consonant other than s followed by a s, then delete s
- if a word ends in es drop the s
- remove ing of word that ends with ing if remaining word has more than one letter or is not th (thing)
- if a word ends with ed, preceded by a consonant delete the ed unless this leaves a single letter
- Transform words
- if a word ends with ies but not eies or aies then
ies -> y
- if a word ends with ies but not eies or aies then
What are feature generation methods?
1) Bag-of-Words
2) Word Embeddings
Explain the Bag-of-words feature generation
- Document is treated as bag of words (each word/term becomes a feature; order of words is ignored)
- Document is represented as a vector
Briefly explain the three different techniques for vector creation: Binary Term occurrence, Term occurrence, and Terms frequency
1) Binary Term occurrence: Boolean attribute describes whether or not a term appears in the document (no matter how often
2) Term occurrence: Number of occurrences of a term in the document (problematic with texts of different length)
3) Terms frequency: frequency in which a term appears (number of occurrences / number of words in document; for documents with different length)
Explain the TF-IDF Term Weighting Schema for feature generation (Term frequency inverse document frequency)
- Extension of terms frequency to evaluate how important a word is to a corpus of documents
- Multiplication of TF and IDF
TF: Term Frequency
IDF: Total number of docs in corpus / number of docs in which a term appears
How does the TF-IDF distribute weights to words?
- Give more weight to rare words (term that appears in a small fraction of documents might be useful)
- Gives less weight to common words (domain-specific stopwords)
Explain the feature generation method Word Embeddings
- Each word is represented as a vector of real numbers (distributed representation)
- Semantically related words end up at similar location in the vector space (embeddings deal better with synonyms)
- Embeddings are calculated based on the assumption that similar words appear in similar contexts (distributional similarity)
How can you conduct feature selection for text mining?
- High dimensional data makes it difficult for some learners
- Pruning Document Vectors
- Filter Tokens by POS Tags