Chapter 1 Flashcards
Contains concepts in chapter 1 of manning's book
Natural Language Processing
Concerned with processing natural languages such as English and Mandarin. Involves translating natural language into data that computer can use to learn about the world.
NLP system
Referred to as a pipeline because it involves several processing stages where natural language flows in one end and processed output flows the other.
FST (Finite State Transducer)
FSM that outputs a sequence of new symbols as it runs is called a finite state transducer
Formal languages
A Set of natural languages. Formal grammar can be used to generate many natural language statements.
Regular expressions
Special kind of formal language grammar
Regular grammars
Predictable, provable behavior and flexible enough to power some sophisticated dialog engines and chatbots
DFA (Deterministic Finite Automaton)
A formal mathematical object that processes regular language is called a Finite State Machine or Deterministic Finite Automaton
Regular exp notation
- OR
\ - preceding char can occur 0 or more times
[] - used to specify character class
* - regular expression matches any number of consecutive characters
Computational Theory of Mind
CTM assumes human-like NLP can be accomplished with finite set of logical rules that are processed in series
Distance Metrics (Levenshtein, Jaccard and Euclidean distance)
Useful for applications like spelling correctors and recognizing proper nouns where algorithm calculates the distances between words to find any spelling errors
Document Representation
Can be represented as a vector, a sequence of integers for each word or token in that document.
Vector space
Different ways that word could be combined to create vectors. Relationships between these vector make up our model, which tries to predict combinations of words occurring in a collection of various words. Can represent these vector using a Counter in python.
Disadvantage with bag of words
Does not work well for interpreting context of sentences (those for which order is very important)
Disadvantage with one-hot vectors
High-dimensionality space
SyntaxNet and Spacy
Two libraries that allowed natural language syntax tree parsers and made possible to extract syntactic and logical relationships