Chapter 1 Flashcards
Contains concepts in chapter 1 of manning's book
Natural Language Processing
Concerned with processing natural languages such as English and Mandarin. Involves translating natural language into data that computer can use to learn about the world.
NLP system
Referred to as a pipeline because it involves several processing stages where natural language flows in one end and processed output flows the other.
FST (Finite State Transducer)
FSM that outputs a sequence of new symbols as it runs is called a finite state transducer
Formal languages
A Set of natural languages. Formal grammar can be used to generate many natural language statements.
Regular expressions
Special kind of formal language grammar
Regular grammars
Predictable, provable behavior and flexible enough to power some sophisticated dialog engines and chatbots
DFA (Deterministic Finite Automaton)
A formal mathematical object that processes regular language is called a Finite State Machine or Deterministic Finite Automaton
Regular exp notation
- OR
\ - preceding char can occur 0 or more times
[] - used to specify character class
* - regular expression matches any number of consecutive characters
Computational Theory of Mind
CTM assumes human-like NLP can be accomplished with finite set of logical rules that are processed in series
Distance Metrics (Levenshtein, Jaccard and Euclidean distance)
Useful for applications like spelling correctors and recognizing proper nouns where algorithm calculates the distances between words to find any spelling errors
Document Representation
Can be represented as a vector, a sequence of integers for each word or token in that document.
Vector space
Different ways that word could be combined to create vectors. Relationships between these vector make up our model, which tries to predict combinations of words occurring in a collection of various words. Can represent these vector using a Counter in python.
Disadvantage with bag of words
Does not work well for interpreting context of sentences (those for which order is very important)
Disadvantage with one-hot vectors
High-dimensionality space
SyntaxNet and Spacy
Two libraries that allowed natural language syntax tree parsers and made possible to extract syntactic and logical relationships
Chatbot processing stages
(PAGE)
1) Parse – extract features, structured numerical data from natural language text (SOTA: Tokenizers, Regular Expressions, tag, NER, extract info)
2) Analyzer – Generate and combine features by scoring text for sentiment, grammatically and semantics (Typically use a database is used) (SOTA: LSTM)
3) Generate – Compose possible responses using templates, search or language models (Search Templates, MCMC, LSTM, FSM)
4) Execute – Plan statements based on conversation history and objects and select the next response
A feedback loop (between 1 and 3) is used on generated text responses so that responses can be processed using same algorithms used to process user statements
Layers for feature extraction and analysis
Characters -> Tokens -> Tagged tokens -> Syntax tree (fed into POS tagger) -> Entity relationships -> Knowledge base (fed to logical compiler, info extractor)
Inferences
Logical extrapolations from a set of conditions detected in an environment
Fuzzy regular expressions
Find closest grammar match among possible grammar rules instead of exact ones. Effective for question answering systems and task-execution assistant bots.