NLP Basics Flashcards
What is NLP?
Natural Language Processing
- Natural Language (as opposed to Artificial Language)
- Process with Computers
- Combination of Linguistics (study of languages) and Computer Science, information engineering and ai.
- Text or spoken voice
- Branch of AI that helps computers understand, interpret, and manipulate human language (SAS). Understanding and responding back.
What do we mean by Natural Language?
Natural vs. Artificial Language
- Natural Language evolved gradually over time, largely unconscious and is used for daily communication between people. Complex syntax, ambiguous semantics. Contains humor, irony, metaphor, connotation, neologisms.
- Designed, crafted, invented consciously (e.g., programming language, Elvish from Lord of the Rings, Klingon from Star Trek, Esperanto for German, Italy, etc. to have an easy to learn international language, Interlingua). Usually rule-based. Readily parsed, unambiguous, subject to regular, consistent rules of interpretation.
- Interesting fact: LISP and Prolog were created for AI, but not used as much anymore
- Morse code is not an artificial language. It’s just a code for alphabet and numbers.
- Flag semaphore is not an artificial language, it’s a code.
- Braille is the same thing.
- Sign language is not an artificial language.
- Oxford English different a few hundred thousand words
- Natural Languages were not designed to be processed by machine
What are the two sides of NLP?
NLP is either NLU or NLG.
- NLU: Natural Language Understanding. Trying to understand language that came from the ordinary world. This is a useful, structured interpretation of an input, like speech or written text. E.g., producing topics. We don’t care if the machine understands. It’s something useful that we can understand. Search is NLU.
- NLG: Natural Language Generation. Making natural language. We feed something like not sentences and we want the machine to output sentences. Chatbots are examples of NLG.
Who was the first person to think about can AI exist?
2000 years ago. Aristotle with toys and statues. Create a create that it would fool people into thinking it was real. Aristotle said no. Liebnez in late 1800s as well.
What are the applications of NLU?
Automated Text Annotation
- Tagging (important words and phrases)
- Metadata extraction/generation (data about the documen, e.g., author of a document, date)
- Classification (e.g., news article is sports, celebrities)
- Document summarization
Corpus analytics (corpus is document collection)
- Theme extraction
- Clustering
- Taxonomy mapping (mapping documents from one taxonomy to another taxonomy)
- Sentiment analysis (usually across a corpus, e.g, look at billions of tweets and find the ones that have negative emotion or cluster them)
Search applications
- Query repair (e.g., for a typo, did you mean x?)
- Query refinement (e.g., semantically ambiguous interpretation)
- Result postprocessing (ranking, clustering, encapsulation)
Advanced applications
- Machine translation
- Knowledge discovery
- Question handling (e.g., mapping questions to FAQs)
What are the applications of NLG?
Often a way of make NLU more digestable to humans.
Text annotation (fancier version of NLU) -Document summarization, e.g., make new sentence out of old sentence and make that the title of the article
Corpus analytics
- Adding labels on top of clusters (e.g., cheesy recipes, spicy recipes)
- Synopsizing corpus-wide topic and/or sentiment trends
Search applications
- Advanced capsule generation
- Advance query refinement
Advanced applications
- Machine translation
- Knowledge discovery
- Question handling (e.g., where’s a good view of Golden Gate Bridge?, getting back a question do you want a vista, hotel, restaurant, etc.)
What is a token?
A token is the technical name for a sequence of characters — such as hairy, his, or :) — that we want to treat as a group.
What is a collocation?
A collocation is a sequence of words that occur together unusually often. Thus red wine is a collocation,
What is a collocation?
A collocation is a sequence of words that occur together unusually often. Red wine is a collocation, machine learning, social media. When you split them up, they mean something different, flat screen. NLTK has a collocations() function.
What is a bigram?
A list of word pairs. NLTK has a bigrams function.
What is a Turing test?
Can a dialogue system, responding to a user’s text input, perform so naturally that we cannot distinguish it from a human-generated response?
What is RoBERTa?
An optimized method for pretraining self-supervised NLP systems
What is NER?
Named Entity Recognition. Locating entities in unstructured text.
What is Document Classification?
NLU application. Example is a spam filter. Amazon does large classification massive classification of product categories.
What is Search: Query Understanding?
Inferring the intent and meaning of a search engine users queries.
Query Segmentation: partition the queries into semantic units Query Scoping (NER): map query segments to entity types Query Expansion: broaden the query by adding additional phrases/tokens (usually synonyms and abbreviations, e.g., ML or Machine Learning, developer or engineer) Query Relaxation: make the query less restrictive by removing tokens (Black propane grill vs. propane grill)