Text Mining Flashcards
What is the motivation for text mining?
- 90 % of the worlds data is in unstructured format
e. g. web pages, emails, corporate documents, scientific papers
Define Text Mining
The extraction of implicit, previously unknown and potentially useful information from large amounts of textual resources
What are some application areas for Text Mining?
- Classification of news stories
- SPAM detection
- Sentiment analysis
- Clustering of documents or web pages
Give an example of a mixture of document clustering and classification
Google News first clusters different news articles. Afterwards the classify the news articles
What is the goal of sentiment analysis?
- To determine the polarity of a given text at the document, sentence, or feature/aspect level
- Polarity values (positive, neutral, negative)
For which area can you apply sentiment analysis?
On document level: analysis of a whole document (tweets about president)
On feature/aspect level: analysis of product reviews (polarity values for different features within a review)
Explain search log mining
- Analysis of search queries issued by large user communities
What are application areas for search log mining
1) Search term auto-completion (association analysis)
2) Query topic detection (classification)
What is information extraction?
- The task of automatically extracting structured information from unstructured or semi-structured documents
What are the subtasks of information extraction
1) Named entity recognition (The parliament in Berlin …)
- Which parliament, which berlin; 2 named entities, use the context of the text to remove disambiguation
2) Relationship extraction
- PERSON works for ORGANIZATION
3) Fact Extraction
- CITY has population NUMBER
What is the difference between Search/Query and Discovery?
Search/Query is Goal-oriented: You know what you want
- Structured data: Query processing
- Text: Information retrieval
Discovery is opportunistic: You don’t know in advance what patterns you identify in your data
- Structured data: Data Minig
- Text: Text Minig
Explain the text mining process
- Similar to the Data Mining Process
1 Text preprocessing (syntactic / semantic analysis)
2 Feature Generation (bag of words)
3 Feature Selection (Reduce large number)
4 Data Mining (clustering, classification, association analysis)
5 Interpretation / Evaluation
Which techniques are used for text preprocessing
- Tokenization
- Stopword Removal
- Stemming
- POS Tagging
Which syntactic and linguistic text preprocessing techniques exist?
- Simple Syntactic Processing
- Text cleanup (remove punctuation / HTML tags)
- Tokenization (break text into single words)
- Advanced Linguistic Processing
- Word Sense Disambiguation (determine the sense of a word / normalize synonyms / pronouns)
- Part of Speech (POS) Tagging (determine the function of each term; nouns, verbs
- ( Depending on the task you might be only interested in nouns or verbs )
Explain Stopword Removal
- Many of the most frequent used words in english are likely to be useless
- They are called Stopwords (the, and, to, is, that)
- Domain specific stopword list may be constructed
You should remove stopwords:
- To reduce data set size (they account for 20 -30% of the total word count)
- Improve effectivity of text mining methods (they might confuse the mining algorithm)
What is Stemming?
- Techniques to find the stem of a word
words: user, users, used, using -> stem: use
words: engineering, engineered -> stem: engineer
Usefulness for Text Mining:
- Improve effectivity of text mining methods (match of similar words)
- Reduce term vector size (may reduce the term vector as much as 40-50%)