Week 2 Flashcards
What is lexical analysis
figure out basic meaning units in language and corresponding meaning.
What is Syntactic analysis
how words are related in sentences with others, decode structure of sentences
What is Semantic analysis
figure out the meaning of sentences by meaning of words and syntactic structure.
What is Pragmatic analysis
find out the meaning in context -> speech acts in language, purpose of communication
What is discourse analysis
to analyze a large chunk of text with many sentences, connections and context are considered
What is NLP and TIS
Text information system
- TIS can bypass advanced NLP for good performance
What is an easy and hard task in TIS
Easy - text classification and retrieval
Hard - machine translation and question answering
How is Text represented
string of characters Word sequence and POS tags Entity relation recognition Logic predicates Speech acts Deeper NLP -> more human Intervention -> less robust
What are Statistical Language models
- represent word sequence by a probability distribution
- is context dependant and generative model
- different sequence = different probability
What is the Unigram LM
frequency of the word in document/number of documents
Challenges of Unigram
- Unseen words = zero probability
Smoothing method in Unigram
Add one to frequency and to document . Or add K
Filter out stop words
Pull Vs Push
Pull (Search engine)
- User takes initiative
- Ad hoc information needs
- Query and borwsing
Push ( recommendation system)
- System takes initiative
- Stable information need/ system knows users need
Query vs Browsing
Query
- User enters a set of terms
- System returns relevant documents
- Good with keyword
Browsing
- User navigates into relevant info guided by structure/org of docs
- Good without keyword
Issues of document selection
Classifier in unlikely accurate
- over constrained query
- under constrained query
All relevant docs are not equally relevant