Text Mining Flashcards
Definition of Text Mining
Text mining, also referred to as text data mining, roughly equivalent to text analytics, is the process of deriving high-quality information from text
Text mining tasks
Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling (i.e., learning relations between named entities).
Definition of Text Analytics
The term text analytics describes a set of linguistic, statistical, and machine learning techniques that model and structure the information content of textual sources for business intelligence, exploratory data analysis, research, or investigation.[1]
The latter term is now used more frequently in business settings while “text mining” is used in some of the earliest application areas, dating to the 1980s,[4] notably life-sciences research and government intelligence.
Why text analytics
It is a truism that 80 percent of business-relevant information originates in unstructured form, primarily text.[5] These techniques and processes discover and present knowledge – facts, business rules, and relationships – that is otherwise locked in textual form, impenetrable to automated processing.
the state of text analytics technology and practice
Prof. Marti A. Hearst in the paper Untangling Text Data Mining:
I suggest that to make progress we do not need fully artificial intelligent text analysis; rather, a mixture of computationally-driven and user-guided analysis may open the door to exciting new results.
what is natural language process (NLP)
Natural language processing (NLP) is a subfield of computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data.
Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation.
Basic concept in NLP
lexical —- syntactic — semantic — inference
ambiguity is the killer
Robust and general NLP tends to be shallow while deep understanding doesn’t scale up.
what can we do with NLP
part of speech tagging >90% accuracy
parsing > 90% accuracy
the rest ???
Text representation and enabled analysis
Text Rep Generality Enabled Analysis Examples of Application
String String processing Compression
Words Word relation analysis; topic analysis; sentiment analysis Thesaurus discovery; topic and opinion related applications
+ Syntactic structures Syntactic graph analysis Stylistic analysis; structure based feature extraction
+ Entities & relations Knowledge graph analysis; information network analysis Discovery of knowledge and opinions about specific entities
+ Logic predicates Integrative analysis of scattered knowledge; logic inference Knowledge assistant for