1. introduction Flashcards
regular expression
To identify tokens, we need some method of describing the possible token that can appear in the input stream for this purpose. We have regular expression It is used to describe all the tokens for programming language
lexical analyzer
tasks:
1. It reads the input character and produces a sequence of tokens(o/p) that parser uses for syntax analysis
- When receiving getnext token command from the parser, the LA Start reading input character until it finds next token.
- LA returns to the parser representation of the token it has formed. The representation will be integer code if it is the simple construct, such as parenthesis, comma or column
diag: file:///C:/Users/mural/OneDrive/Desktop/BSCS/3rd%20year/5th%20sem/CD/1-unit/CD_notes-BS-new.pdf
- It acts as user interface
eg: Stripping off commands on the white space in the form of blank tab and new line characters from the source code - connecting error messages from the compiler with the source programme
i/p buffering
Lexical analysis has to access secondary memory each time to identify tokens, It is so time consuming and costly. So the input strings are stored in a buffer and then scanned by the lexical analyser
LA scans input string from left to to right. One character at a time to identify tokens.
It uses Two pointers:
1. begin pointer(bptr) Points the beginning of the string that is to be read
2. look ahead or forward It moves ahead to search for the end of the token
1) Both the pointers start at the beginning of the string, which is stored in the buffer,
2) Look ahead pointer buffer(Move forward) until the token is found
3) Once look ahead pointer finds the or colon, as such It is regarded as the end of the Lexi
4) After processing that token then both the pointers are set to the next token, and the process continues until the programme is finished
There are two methods:
1. One buffer scheme
Here, only one buffer is used to store the input string, but the problem is, if Lexi is very long, then it crosses the buffer boundary to scan the rest of the lexeme So the buffer has to be refilled, which makes overwriting of the first lexeme
2. Two buffer scheme (sentinels)
Here, two lexings are used which overcomes one buffer scheme
The first buffer in the second buffer can alternatively When end of the current buffer is reached, the other buffer is filled
Initially both pointers are pointed to the first character of the first buffer Then look ahead pointer moves towards right in search of end of the Lexi, until blank character is recognised
To identify the end of the boundary, sentinel(Unique character, which isn’t included in the source code) is used Which identifies end of the buffer
Recognition of tokens
Token recognition is the role of LA. It involves identifying and categorising sequences of characters from source code into meaningful units called tokens
steps:
1. Regular expressions
2. lexical rules
3. Scanner or lexer
4. Token categories
Lex
Lex is the widely used lexical analyzer generator tool
It takes input as set of regular expressions and actions given by user and generates C code for a lexical analyzer
Functions of lex:
1. LA creates a programme called Lex specification file in Lex language—lex.l, Lex compiler runs lex.l And produces a C programme–lex.yy.c
2. Lex.yy.c is Run by C compiler and produces an object programme—-a.out(a.out is a lexical analyzer that transforms input stream into sequence of tokens)
lex source prog
–> lex compiler
–> lex. yy.c
(lex.l)
lex.yy.c
–> c compiler
–> a.out
i/p stream
–> a.out
–> Sequence of tokens
Structure of Lex
Declarations
%%
Translation rules
%%
Auxiliary functions
Declaration section:
1. Auxiliary declarations
written in C language and enclosed between
%{—–}%
- Used to declare functions(Auxiliary functions), header files, global variables and constants
2. regular definitions
Represented in DR form Where D is symbol representing the regular expression r
Rules in Lex programme:
1. The pattern to be matched.
2. The corresponding action to be executed
— Patterns are defined in regular expression
—-actions can be specified using C code
eg: R1 {Action1}
Regular expression
Regular expression
Language accepted by finite automata and can be described by simple expressions
(or)
Regular expression is the sequence of patterns which Defines a string
- Regular expressions are used to denote regular languages
Regular grammar
A grammar is said to be regular if it has rules in the form of A -> a or A -> aB or A -> ?.
- Where ? Is a special symbol called null
Regular language
Language is called as regular if it can be expressed by regular expressions
- The operators widely used in regular language are *, (.), + *
Rules of regular expression:
1. Every letter of Σ Can be made into regular expression
2. If r1 and r2 Are regular expressions then,
(r), r1.r2, r1+r2 Are also regular expressions
3. Epsilon is the null string itself is a regular expression
Operations performed on regular expressions
-
Union
Union of two regular expressions. L1 and L2 R represented Using L1 U L2
L1= (1+0).(1+0)={00, 10, 11, 01}
L@= {E, 100}
L1UL2 = {E, 00, 10, 11, 01} -
concatenation
Concatenation of two languages is L1. L2 -
kleene closure
If L1 is a regular expression then kleen closure i.e L!* of L1 is also regular expression
- L Is all possible strings with symbol zero and one, including null str
Algebraic properties of regular expressions
kleene closure()— unary operator
Union, concatenation- binary operators
Closure
If R1 and R2 are regular expressions, then
r* Is a regular expression*
r1+r2 Is a regular expression
R1. R2 is a regular expression
Finite automata
Finite automata Or finite state machine is the simplest machine to recognise patterns
- It has set top states and rules for moving from 1 state to another, but it depends upon applied input symbols.
- It is an abstract machine which have five elements or tuple
machine M={Q, Σ, q, f, δ}
where,
Q Is finite set of states
Σ Is a set of input symbols
q Is a initial state
f is Set of final states
δ Is transition state
Finite automata is of two types:
1. Deterministic finite automata(dfa)
2. non deterministic finite automato(NFA)
Deterministic finite automata
Nondeterministic finite automata
Deterministic finite automata
Finite automata is set to be deterministic, If there is only single resultant state i.e one transaction corresponding to an input symbol
- DFA is a set of five couples and represented as
M={Q, Σ, q0, f, δ}
Where,
Q is non empty finite set of states
Σ Is non empty finite set of i/p symbols…
δ : Q X Σ -> Q
Nondeterministic finite automata
Finite automata is Set to be non deterministic if there is possibility of more than one transition from one state On the same input symbol
M={Q, Σ, q0, f, δ}
Where,
Q is non empty finite set of states
Σ Is non empty finite set of i/p symbols…
δ : Q X Σ -> 2^Q
Transition diagram/ State transition diagram:
Finite automata represented using a transition graph, which is a directed graph
rules:
file:///C:/Users/mural/AppData/Local/Microsoft/Windows/INetCache/IE/4ST9NZ66/cd_unit_1[1].pdf
Regular expression to automata
file:///C:/Users/mural/AppData/Local/Microsoft/Windows/INetCache/IE/4ST9NZ66/cd_unit_1[1].pdf