Compiler Construction Flashcards
Left recursion
Top-down parsers cannot handle left-recursion in a grammar
-> transform the grammar to fix this cf sl. 40/41
Context-free grammars
s
syntax is context-free
semantics is context-sensitive
intermediate representation
aka parse tree
-> a notation that is not tied to any particular source language or target machine
Choosing the “right” IR is not an easy task: source code is too
high level for performing many analyses, and machine code is
too low level. Sometimes an AST is used (i.e., closer to source
code), and sometimes a special kind of abstract instruction set
(i.e., closer to machine code).
derivation
s
left-recursion
s
Bytecode
The “machine-code” of the virtual machine
What is the difference between a compiler and an
interpreter?
s
What are important qualities of compilers?
s
Why are compilers commonly split into multiple passes?
s
What are the typical responsibilities of the different parts of a modern compiler?
s
How are context-free grammars specified?
s
What is “abstract” about an abstract syntax tree?
s
What is intermediate representation and what is it for?
s
Why is optimization a separate activity?
s
Is Java compiled or interpreted? What about Smalltalk?
Ruby? PHP? Are you sure?
s
What are the key differences between modern compilers and compilers written in the 1970s?
s
Why is it hard for compilers to generate good error
messages?
s
What is “context-free” about a context-free grammar?
s
Traditional two pass compiler
A classical compiler consists of a front end that parses the source
code into an intermediate representation, and a back end that
generates executable code.
front end
parses the source code into an intermediate representation
- the front end should deal with all things to do with the language and source code
- consists of scanner (tokens) and parser (IR)
back end
generates executable code
- the back end should deal with all things to do with the
target architecture
- translate IR into target machine code
scanner
is responsible for converting the source code into a stream of tokens of the source language
parser
is responsible for recognizing structure in the stream of tokens - job of the parser is to recognize structure in the stream of tokens produced by the scanner - Typically it builds a parse tree representing this structure, but it may also produce another kind of IR (intermediate representation) more suitable for the backend
IR
e.g. parse tree
context-free?
“context-free” because rules for nonterminals
can be written without regard for the context in which
they appear
context-sensitive
context sensitive: you need the context of a code to validate it
context free: you do not need the context of a code to validate it
unambiguous grammar
it always produces a unique parse for a given valid input.
the parse and the parse tree
The parse is the sequence of productions needed to parse the input.
The parse tree is a structure representing this derivation
LL(1), LR(1)
“sweet spots”: allow interesting languages to be specified, but can also be parsed efficiently
syntax analysis
An AST is usually the result of the syntax analysis phase of a compiler
lexical analysis -> syntax analysis
Leftmost vs rightmost derivation
replace the leftmost non-terminal at each step
vs
replace the rightmost non-terminal at each step
table-driven parser
- Our recursive descent parser encodes state information in its call stack.
- Using recursive procedure calls to implement a stack
abstraction may not be particularly efficient - This suggests other implementation methods:
—explicit stack, hand-coded parser
—stack-based, table-driven parser
constant expression propagation and folding
evaluate and propagate constant expressions at compile time
Constant propagation entails recognizing that certain “variables” actually have constant values, and then propagating these constants to expressions where they are used. Later we will see SSA is ideal to analyze this.
constant folding:
> Evaluate constant expressions at compile time
> Only possible when side-effect freeness guaranteed
ambiguous grammar
If a grammar has more than one derivation for a single
sentential form
Top-down parser (LL)
- starts at the root of derivation tree and fills in
- picks a production and tries to match the input
- may require backtracking
- some grammars are backtrack-free (predictive)
- uses leftmost derivation
- can be either hand-written or generated
- cannot handle left-recursion in a grammar
-> Since a parser does not just parse, but must also produce a suitable IR, we must decorate the grammar with additional actions that the generated parser will take
Bottom-up parser (LR)
- starts at the leaves and fills in
- starts in a state valid for legal first tokens
- as input is consumed, changes state to encode possibilities (recognize valid prefixes)
- uses a stack to store both state and sentential forms
- uses rightmost derivation
- are normally built by parser generators
Reading in reverse, we have a rightmost derivation, first replacing
S, then B, A and A again. 9
parser generators
- build bottom-up parsers
dead code elimination
eliminate code that can never be
executed
left-recursion
If the parser makes the wrong choices, expansion doesn’t terminate! -> solve using right recursion: E -> E + T E -> T rewrite as: E -> T E' E' -> + T E' E' ->
look-ahead
top-down parsers may need to backtrack when they select the wrong production
- large subclasses of CFGs can be parsed with limited lookahead
- Among the interesting subclasses are:
LL(1): Left to right scan, Left-most derivation, 1-token look-ahead
LR(1): Left to right scan, Right-most derivation, 1-token look-ahead
recursive descent / predictive parsing
basic idea: For any two productions A → α | β, we would like a distinct way of choosing the
correct production to expand.
Whenever two productions A → α and A → β both appear in the grammar, we would like:
FIRST(α) ∩ FIRST(β) = ∅
This would allow the parser to make a correct choice with a look-ahead of only one symbol!
=> this type of parsing works only on grammars where the first terminal symbol of each subexpression provides enough information to choose which production to use (iff FIRST property can be established!).
What if a grammar does not have this property?
-> left factoring
left factoring
sometimes we can transform a grammar in order to have this property (FIRST(α) ∩ FIRST(β) = ∅):
- For each non-terminal A find the longest prefix α common to two or more of its alternatives
- if α ≠ ε then replace all of the A productions
A → αβ1 | αβ2 | … | αβn
with
A → α A´
A´ → β1 | β2 | … | βn
where A´ is fresh
- Repeat until no two alternatives for a single non-terminal have a common prefix.
-> we do this so that the token lookahead will be unambiguous! i.e. that selection requires only a single token look-ahead
lexical analyzer
The lexical analyzer breaks these syntaxes into a series of tokens, by removing any whitespace or comments in the source code. If the lexical analyzer finds a token invalid, it generates an error. The lexical analyzer works closely with the syntax analyzer. It reads character streams from the source code, checks for legal tokens, and passes the data to the syntax analyzer when it demands.
FIRST(a)
FIRST(a) is the set of all terminal symbols that can begin any string derived from a
-> if two different productions have the same left-hand-side symbol X and their right-hand sides have overlapping FIRST sets, then the grammar cannot be parsed using predictive parsing.
construct a predictive / recursive-descent parser
p. 50/51
- create predictive parsing table with one production per entry (if > 1 production, predictive parsing will not work -> is ambiguous)
- > an ambiguous grammar will always lead to duplicate entries in a predictive parsing table
- > need to find an unambiguous grammar
- > grammars whose predictive parsing tables contain no duplicate entries are called LL(1) = left-to-right prase, leftmost-derivation, 1-symbol lookahead
- > a recursive-descent parser does its job just by looking at the next token of the input, never looking more than one token ahead
… FIRST property
LL(k)
generalize the notion of FIRST sets to describe the first k tokens of a string, and make an LL(k) parsing table whose rows are the nonterminals and columns are every sequence of k terminals (rarely done -> tables get large!)
NFA -> DFA
Any NFA can be converted into a DFA, by simulating sets of simultaneous states.
Since a DFA must always be in a unique state, the states of the DFA must be all possible subsets of the NFA states.
NFA to DFA using the subset construction
s
Non-recursive predictive parsing
- use parse table
-
DFA Minimization algorithm
s
LL(1) parse table
s
Properties of LL(1) grammars
- No left-recursive grammar is LL(1)
- No ambiguous grammar is LL(1)
- Some languages have no LL(1) grammar
- An ε–free grammar where each alternative expansion for A begins with a distinct terminal is a simple LL(1) grammar
What are the key responsibilities of a scanner?
s
Bottom-up parsing
Goal:
- Given an input string w and a grammar G, construct a parse tree by starting at the leaves and working to the root.
- We parse bottom-up, replacing terms by non-terminals!
the question is always whether to shift or reduce!
What is a regular language?
s
handle
A handle of a right-sentential form γ is a production A → β and a position in γ where β may be found and replaced by A to produce the previous right-sentential form in a rightmost derivation of γ
—Suppose: S ⇒* αAw ⇒ αβw
—Then A → β in the position following α is a handle of αβw
-> NB: Because γ is a right-sentential form, the substring to the right of a handle contains only terminal symbols.
How can you generate a DFA recognizer from a regular
expression?
by a scanner generator
Why aren’t regular languages expressive enough for
parsing?
s
handle-pruning, bottom-up parser
shift-reduce parser
Shift-reduce parsers use a stack and an input buffer
- initialize stack with $
- Repeat until the top of the stack is the goal symbol and the input token is $:
a) Find the handle.
If we don’t have a handle on top of the stack, shift (push) an input symbol onto the stack
b) Prune the handle.
If we have a handle A → β on the stack, reduce
I. Pop |β| symbols off the stack
II. Push A onto the stack
A shift-reduce parser has just four canonical actions:
shift, reduce, accept, error
left-to-right-scan?
right-to-left-scan?
s
Scanning vs. parsing
…
CFGs are used to impose structure!!!
lexical analysis
- Maps sequences of characters to tokens
- Eliminates white space (tabs, blanks, comments etc.)
- > happens within the scanner
- > The string value of a token is a lexeme
CFG / parsing
CFGs are used to impose structure!!!
- For practical purposes it is important that a grammar
be unambiguous, i.e., that it always produces a unique parse for a given valid input
- Although parsers read their input Left to Right (the first “L” in most of these categories), they may work either top-down — producing a leftmost derivation — or bottom-up — producing a rightmost derivation.
- The process of discovering a derivation is called parsing.
left recursion vs right recursion
pro-contra?
Rule of thumb:
—right recursion for top-down parsers
—left recursion for bottom-up parsers
Leftmost derivation
replace the leftmost non-terminal at each step
- corresponds to top-down parsing, and is especially well-suited to recursive descent parsers
Rightmost derivation
replace the rightmost non-terminal at each
step
- is produced bottom-up, and is better-suited to table-driven parsing
review: LR(k):
must be able to recognize the occurrence of the right hand side of a production after having seen all that is derived from that right hand side with k symbols of look-ahead.
LR dilemma: pick A → b or B → b ? (how to reduce?)
Resolving ambiguity
eliminate by rearranging the grammar
Visitor Pattern
Intent: Represent an operation to be performed on the elements of an object structure. Visitor lets you define a new operation without changing the classes of the elements on which it operates.
- > A visitor gathers related operations
- > Visitor makes adding new operations easy (write new visitor)
- > Visitor can break encapsulation
Java Tree Builder (JTB)
- front-end for javacc
- supports the building of syntax trees that can be traversed using visitors.
- transforms a bare JavaCC grammar into three
components:
1) a JavaCC grammar with embedded Java code for building a syntax tree
2) one class for every form of syntax tree node
3) a default visitor which can do a depth-first traversal of a syntax tree
Why use intermediate representations?
—break compiler into manageable pieces
—isolates back end from front end
—different languages can share IR and back end
- Enables machine-independent optimization -> general techniques, multiple passes
IR types
- Abstract syntax trees (AST)
- Linear operator form of tree (e.g., postfix notation)
- Directed acyclic graphs (DAG)
- Control flow graphs (CFG)
- Program dependence graphs (PDG)
- Static single assignment form (SSA)
- 3-address code
- Hybrid combinations
lookahead
to decide which production rule to apply at any point without backtracking
Abstract syntax tree
An AST is a parse tree with nodes for most non-terminals removed.
-> Since the program is already parsed, non-terminals needed to establish precedence and associativity can be collapsed!
Remember: Concrete syntax trees, or parse trees, show all of the intermediate non-terminals needed to produce an unambiguous grammar. An abstract syntax tree collapses these to offer a much simpler version of the parse tree.
parsing
the process of discovering a derivation
Leftmost vs rightmost derivation
replace the leftmost non-terminal at each step
vs
replace the rightmost non-terminal at each step
3-address code
Statements take the form: x = y op z
Advantages:
—compact form
—names for intermediate values
Static Single Assignment Form (SSA)
Goal: simplify procedure-global optimizations
Program is in SSA form if every variable is only assigned once
- is only used for static analysis and optimization. It will
disappear when we generate the final executable code
- SSA is normally used for control-flow graphs (CFG)c
The Φ-Function
- Φ-functions are always at the beginning of a basic block
- Selects between values depending on control-flow
Minimal SSA
Two steps:
—Place Φ-functions
—Rename Variables
We want minimal amount of needed Φ
—Save memory
—Algorithms will work faster
Dominance Property of SSA
Dominance: node D dominates node N if every path from the
start node to N goes through D
(“strictly dominates”: D ≠ N)
-Definition dominates use
-> Dominance can be used to efficiently build SSA
Dominance Frontier
DF(D) = the set of all nodes N such that D dominates an immediate predecessor of N but does not strictly
dominate N.
application of DF: Φ-Functions are placed in all basic blocks of the
Dominance Frontier
parser generators
s
top-down parsing
A top-down parser starts with the root of the parse tree, labeled with the start or goal symbol of the grammar.
SSA and Register Allocation
- Idea: remove Φ as late as possible
- Variables in Φ-function never live at the same time!
- > Can be stored in the same register
- > Do register allocation on SSA!
So, don’t remove Φ functions before register allocation
look-ahead
top-down parsers may need to backtrack when they select the wrong production
- large subclasses of CFGs can be parsed with limited lookahead
- Among the interesting subclasses are:
LL(1): Left to right scan, Left-most derivation, 1-token look-ahead
LR(1): Left to right scan, Right-most derivation, 1-token look-ahead
predictive parsing
basic idea: For any two productions A → α | β, we would like a distinct way of choosing the
correct production to expand.