Language & Computation Flashcards
TW3V19002 UU
Formal automata
Models with step by step computation.
Formal grammars
Recipes to create patterns and sentences.
List the four language types from Chomsky’s hierarchy from simplest to most complicated:
- Regular
- Context-free
- Context-sensitive
- Recursively enumerable
List the four automata from Chomsky’s hierachy from simplest to most complicated:
- Finite state
- Pushdown
- Linearly bounded
- Turing machine
Describe the language and automaton of the type 3 grammar in Chomsky’s hiearchy:
Type 3 grammar has a regular language and a finite state automaton.
Describe the language and automaton of the type 2 grammar in Chomsky’s hiearchy:
Type 2 grammar has a context-free language and a pushdown automaton.
Describe the language and automaton of the type 1 grammar in Chomsky’s hiearchy:
Type 1 grammar has a context-sensitive language and a linearly bounded automaton.
Describe the language and automaton of the type 0 grammar in Chomsky’s hiearchy:
Type 0 grammar has a recursively enumerable language and a turing machine.
What is the trade-off found in Chomsky’s hierarchy concerned with?
Expressivity and complexity.
Greater expressivity requires…
greater complexity.
What does a(n English) syllable consist of?
The onset, the nucleus and the coda.
Concatenation
One symbol after another that form a sequence.
In what three ways can you make combinations in the formal grammar RegEx?
- Concatenation
- Choice
- Repetition
FSA
finite state automaton
What is the problem for finite automata on palindroms?
The computational power is not enough, it does not have any memory.
How is the stack of a pushdown automaton organized?
First In, Last Out
FILO
first in, last out
PDA
Pushdown automaton
Which grammar types have polynomial complexity?
Type 3 and type 2.
FA
finite automaton
The program (in FA/FSA)
The finite set of instructions for changing from state to state while reading.
Deterministic FA
FAs where there is only one instruction for each combination of state and read symbol.
State diagram
A representation for an FA in which each state is a circle and the instructions are arrows.
Give the formal definition of a deterministic FA:
A deterministic FA M is a quintuple [K, Sigma, delta, q0, F] with
K = finite set of states,
Sigma = finite set (the alphabet),
Q0 E K = the initial state,
F [subset] K = set of final states,
delta = f: K x Sigma -> K (the transition function).
Give the formal definition of a situation in a deterministic FA:
Given a dFA M = [K, Sigma, d, q0, F], a situation of M is a triple (x,q,y) where q E k and x,y E Sigma^x.
Lemmatization
The task of determining that two words have the same root
Stemming
A simple form of lemmatization where suffixes are stripped from the end of the word.
Sentence segmentation
Breaking up a text into individual sentences using cues like periods or exclamation points.
Concatenation
Putting characters in sequence.
What is the hierachy in regex?
- Parentheses ()
- Counters *, +, ? and {}
- Sequences and anchors ^, $
- Disjunction |
What is precision?
Minimise false positivies.
Recall
Minimize false negatives
Regex: find a pattern that contains either t or T.
/[tT]/
Regex: find a pattern that matches ‘tea’ or ‘Tea’.
/[tT]ea/
Regex: find any number in range 2, 3, 4, or 5.
/[2-5]/
Regex: find any letter in range c, d, e, f, g, h.
/[c-h]/
Regex: find any upper case letter.
/[A-Z]/
Regex: find any lowercase letter.
/[a-z]/
Regex: find any single digit.
/[0-9]/
Regex: find any single character, except f.
/[^f]/
Regex: find any single character, but neither a nor A.
/[^aA]/
Regex: find either f or ^.
/[f^]/
Regex: find the pattern c^f.
/c^f/
Regex: find ‘horse’ or ‘horses’.
/horses?/
Regex: what does ? mean?
The preceding character or nothing.
Regex; what does * mean?
Zero or unbounded amount of the previous character
/aa*/ (regex)
Any string of one a followed by zero or more a’s
What does . mean in regex?
It’s a wildcard, any single character.
How do you search for the start of a line with regex?
/^/
How do you search for the end of a line with regex?
$
How do you search for a word boundary in regex?
\b
How do you search for a non-word boundary in regex?
\B
How do you use the . as an actual dot in regex?
.
How do you show disjunction in regex?
|
How do you search for either ‘cat’ or ‘dog’ in regex?
/cat|dog/
When do you use parentheses in regex?
When you want the sequence to act as a single character to the operators around it.
How do you look for as little text as possible when using a Kleene star?
*?
How do you look for as little text as possible when using a Kleene plus in regex?
+?
How do you look for exactly x occurrences of the previous character?
/character{x}/
What is a shortcut in regex for any alphanumeric/underscore character?
\w
What is a shortcut in regex for any digit?
\d
Give the shortcut in regex for any non-digit:
\D
Give the shortcut in regex for any non-alphanumeric:
\W
Give the shortcut to look for a whitespace in regex:
\s
Give the shortcut in regex for a non-whitespace:
\S
How do you search for a - b occurrences of the previous char in regex?
/char{a-b}/
How do you look for at least n occurrences of the previous in regex?
/char{n,}/
How do you look for up to m occurrences of the previous in regex?
/char{,m}/
How do you search for the asterisk in regex?
*
How do you search for the question mark in regex?
\?
How do you find a newline in regex?
\n
How do you find a tab in regex?
\t
When do you use \1 in regex?
To refer to whatever string matched the first item in parentheses.
What do you use when you don’t want to capture the group between parentheses in regex?
(?:pattern)
Fragment
broken-off word
Lemma
Set of lexical forms having the same stem, pos and word sense.
Word form
The full inflected or derived form of the word.
Types (in corpora)
The number of distinct words in a corpus.
Tokens (in corpora)
Total number N of running words
Heardan’s law =
Heap’s law
Heap’s law
The relationship between the number of types |V| and number of tokens N.
|V| = k * N^beta
What 3 things are commonly applied when normalizing text?
- Tokenizing words
- Normalizing word formats
- Segmenting sentences
Named entity recognition
The task of detecting names, dates and organization.
Clitic
Part of a word that can’t stand on its own, but has be attached to another word.
What two parts do most tokenization schemes consist of?
- Token learner
- Token segmenter
What does a token learner do in tokenization?
It takes the raw training corpus and induces vocabulary.
What does a token segmenter do in tokenization?
It takes a raw test sentence and segments it into tokens in vocabulary (set of all individual characters). Chooses the 2 most frequent adjacent and adds the merged symbol to vocabulary, changes all occurrences in corpus to merged symbol. Continues until k new tokens have been created.
Coreference task
Task of deciding whether two strings refer to the same entity.
Alignment
Given two sequences, an alignment is a correspondence between substrings of the two sequences.
Levenshtein distance
The simplest weighting factor in which each of the 3 operations has a cost of 1.
By what 3 atomic shapes are regular expressions defined?
- a for any alphabet symbol a in Sigma (recognizes a)
- Epsilon (recognizes empty string)
- Empty set (recognizes no string)
By what 3 operations are regular expressions defined?
- Choice (U symbol)
- Concatenation
- Repetition
What does + mean in regex?
the previous character one or more times/
How do you search for words with an even number of symbols in regex?
(..)*
In what automaton is it allowed to move states without reading a symbol?
In non-deterministic FSAs
When is a string accepted by a non-deterministic FSA?
Iff there is some path through the state diagram which begins in the initial state, reads the entire input and ends in the final state.
NFA
non-deterministic finite automaton
What allows for non-transitivity in NFAs?
The fact that delta is a relation and not a function.
What is the relation between DFAs and NFAs?
- DFAs are a proper subclass of NFAs.
- NFAs accept exactly the same class of languages as DFAs.
Is the empty set a regular language?
Yes.
If A and B are regular languages, so are
The union of A and B
AB
What is the use of the pumping lemma?
Showing that certain languages are not fal’s. May not prove useful in other cases even though the language in question is not a fal.
Pumping lemma (theorem in their book) for FALs:
If L is an infinite fal over alphabet sigma, then there are strings x, y and z in sigma* such that y is not e and x*(y^n)z is in L for all n>=0.
What are the two types of pronominal subject doubling?
- Clitic doubling
- Topic doubling
Where does clitic doubling occur?
In subclauses and inverted main clauses.
Where does topic doubling occur?
In subject-initial main clauses.
Clitic doubled subject
A subject consisting of a clitic and a strong pronoun, taken as one merged DP.
Topic doubled subject
A subject that involves a non-clitic pronoun doubled by a strong pronoun.
Tripling
When a strong pronoun is both part of clitic doubling and topic doubling in one sentence.
How do you make a DFA out of a NFA?
- The DFA set of states is the power set of the NFA states.
- Initital states = readable from q0 without any input.
- Final state = any set of states that contains a final state of the NFA.
- Transition function; what set can be reached from this state?
What is the complexity of DFAs?
linear, does not depend on the size of the automaton.
What is the distinguishing factor between regular languages and context-free languages?
self-embedding
What are the two operations on a stack?
- Push
- Pop
Push operation on stack
Add something to the top of the stack
Pop operation on stack
Remove something from the top of the stack
The PDA transition function delta is a function such that…
The input is a 3-tuple (current state, input symbol, stack symbol to access/pop).
The output is a set of 2-tuples (next state, stack symbol/string to write/push).
Name the two main strategies of procesing context-free languages:
- Top-down
- Bottom-up
How does top-down processing of CFLs work?
CFG: build a tree from root to leaves
PDA: read in epsilon and push symbols unto the stack, then read input symbols and pop symbols from the stack.
How does bottom-up processing work for CFLs?
CFG: build a tree from leaves to root.
PDA: Read input symbols and push symbols onto the stack, then read in epsilon and pop symbols from stack.
What two examples were mentioned in lecture that are impossible to capture with CFGs/PDAs?
- a^n b^n c^n
- w^2
Can all regular languages be modelled with a PDA?
Yes
Are all context-free languages regular?
No
Can all context-free languages be modelled with an FSA?
No
When does a PDA accept a string?
- The entire input has been read
- The PDA is in a final state
- The stack is empty
When is a PDA non-deterministic?
If there is more than one move available to the automaton in certain situations.
Can deterministic PDAs have no move available for a situation?
Yes, it is only required that in no situation there is more than one move.
Do non-deterministic PDAs accept exactly the languages generated by context-free grammars?
Yes.
Are natural languages context-free?
No, because cross-serial dependency.
PaQu (abbreviation)
Parse and Query application
What is the Parse and Query application (PaQu)?
A web application for searching in Dutch treebanks and for analysing search results.
Treebank
A text corpus in which each sentence has been assigned a syntactic structure.
What part of natural language is a regular language mostly used for?
Phonology and most morphology.
What part of natural language is a finite dictionary mostly used for?
Lexicon
What part of natural language is a context-free grammar mostly used for?
Some syntax.
What part of natural language is a contxt-sensitive grammar mostly used for?
Syntax
What formal grammar can you use the best if you want to describe syntax?
Context-free or context-sensitive.
What formal grammar can you use best if you want to describe phonology or morphology?
Regular language
When is a context-free grammar G ambiguous?
If at least one word has multiple left-most derivations or multiple trees.
When is a language L ambiguous?
If all grammars that generate L are ambiguous.
If every grammar you can possibly create for the language creates two or more trees for some word,…
you have an ambiguous language.
Can regular languages be represented with CFGs?
Yes
A grammar is regular if and only if in every rule A -> u, …
u has no more than one nonterminal symbol,
AND
this nonterminal symbol is always at the beginning or always at the end of u.
A CFG is self-embedding if…
there is a nonterminal A that can be rewritten to itself with non-empty strings before and after:
A -> uAv with u and v are both non-empty
What distinguishes regular languages from context-free languages?
Self-embedding:
every self-embedding language is non-regular and every non-regular CFL is self-embedding.
What are the two main strategies of processing (parsing) CFLs?
- Top-down (root to leaves)
- Bottom-up (leaves to root)
Why is it impossible to get a^nb^nc^n pattern with CFG?
This pattern requires a memory beyond FILO.
What does P(w|h) denote?
The probability that word w will occur given word h.
Markov assumption
The assumption that the probability of a word depends only on the previous word.
MLE (abbreviation)
Maximum likelihood estimation
Relative frequency
The observed frequency of a particular sequence divided by the observed frequency of a prefix.
Extrinsic evaluation
Evaluation where the language model is embedded in an application and it is measured how the application improves.
Intrinsic evaluation
Evaluation where the quality of a language model is measured independently of any application, using a test and training set.
Perplexity
A probability-based metric for accuracy.
What is the usual division between training, development and test sets?
80% training, 10% devset and 10% testing.
PPL (abbreviation)
Perplexity
Minimizing perplexity is equivalent to…
maximizing the test set probability according to the language model.
What is the branching factor of a language?
The number of possible next words that can folow any word.
What is a requirement for comparing the perplexity of two language models?
The models need to be using identical vocabularies.
For what 2 reasons are zero probabilities a problem?
- Problem of sparsity: there are always some valid sequences missing from corpus.
- Can’t compute perplexity, then we would need to divide by 0.
OOV (abbreviation)
out of vocabulary
OOV rate
The percentage of OOV words that appear in the test set.
Laplace smoothing
A type of smoothing where all n-gram counts are raised by 1,
What smoothing methods require discounting to create a probability distribution?
Back off and interpolation.
How are regular languages defined?
Through FSAs and regular expressions.
All strictly local languages are…
regular.
Strictly local language
A regular language where all dependencies fit into a finite (max n ) sliding window of analysis.
How do you ensure that dependence of a symbol on other symbols is at a distance =< n in a strictly local FSA?
Have exactly one state for every sequence of up to n-1 symbols + a sink state. Every state is only reachable through one particular sequence of symbols.
Formal grammar (non-mathematical definition)
A deductive system of axioms and rules of inference, which generates the sentences of a language as its theorems.
What two types of alphabets do formal grammars use?
A terminal alphabet and a non-terminal alphabet.
What is the relation between the two alphabets used by a formal grammar?
They are disjoint (the terminal and the non-terminal alphabet).
What is a constraint on the left side of a rule in a formal grammar?
The string on the left side of a rule cannot consist entirely of terminal symbols.
What is the form of each rule in a type 1 formal grammar?
alpha A beta -> alpha psi beta
With alpha, psi, beta = arbitrary strings over the union of terminal and non-terminal symbols and A= non-terminal symbol.
What is the form of each rule in a type 2 formal grammar?
A -> psi
With alpha, psi, beta = arbitrary strings over the union of terminal and non-terminal symbols and A= non-terminal symbol.
What is the form of each rule in a type 3 formal grammar?
A -> x B or A -> x
With alpha, psi, beta = arbitrary strings over the union of terminal and non-terminal symbols and A= non-terminal symbol.
The type 3 languages are properly included in…
The type 2 languages.
Type 2 languages not containing the empty string are properly included in…
the type 1 languages
The type 1 languages are properly included in…
the type 0 languages.
How can we prove that English is not a natural language?
By using the pumping theorem and the fact that regular languages are closed under intersection.
FAL (abbreviation)
Finite automaton language
What do we do if we can’t model a natural language with a regular language?
We go up one level of complexity: to context-free languages (CFL).
Under what operations are regular languages closed?
- Boolean operations
- Union
- Intersection
- Complement
- Difference
What operations on strings in a regular language yield another regular language?
- Concatenation
- Repetition
- Choice
How do you model the expression ‘all X are Y’ using regex?
Given alphabet {0,1} with
1= X that is Y and
0= X that is not Y.
/1*/
there are only 1s, Xs that are Y
How do you model the expression ‘Some X is Y’ using regex over an alphabet {0,1} with 1= X that is Y and 0= X that is not Y?
/.1./
There is at least one 1 among all Xs.
How do you model the expression ‘No X is Y’ using regex over alphabet {0,1} with 1= X that is Y and 0= X that is not Y?
/0*/
There are only 0s, Xs that are not Y.
How do you model the expression ‘Not all X are Y’ using regex over the alphabet {0,1} with 1=X that is Y and 0= X that is not Y?
/.0./
There is at least one 0 ampng all Xs.
Why are we unable to model the expression ‘There are as many 1s as 0s’ with regex?
You need to remember how many 0s and 1s you’ve seen in order to check this, so it requires some form of memory.
Why can’t we model the pattern a^nb^n using regex?
It requires some type of memory: we need to have seen as many a’s as b’s at the end of reading the string.
What does this pattern consist of: (ww)^R
R = reverse, so if ww = ab then the pattern will be abba.
What four elements does a context-free grammar (CFG) consist of?
- Alphabet N of nonterminal symbols.
- Alphabet Sigma of terminal symbols.
- Set of rules P, pairs of a nonterminal symbol and string of terminal/nonterminal symbols.
- Start symbol S, element in N.
How do we define a language of a context-free grammar (CFG)?
The set of all strings of terminal symbols that can be derived from the start symbol, S, in zero or more steps.
When do we speak of leftmost derivation of S -> w?
If only the leftmost auciliary symbol is rewritten at every step.
Every leftmost derivation corresponds to exactly one…
tree diagram
When is a context-free grammar (CFG) self-embedding?
If there is a nonterminal A that can be rewritten to itself with non-empty strings before and after.
When is a language self-embedding?
If all grammars that describe this language are self-embedding.
What type of grammar allows a nonterminal X to be rewritten to aXb, so can match an ‘a’ to a ‘b’ at an arbitrary distance?
A context-free grammar (CFG)