chapter 2 Flashcards
what is The role of the lexical analyzer?
is to read a sequence of characters from the source program and produce tokens to be used by the parser.
The stream of tokens is sent to the parser for
…
syntax analysis
The lexical analyzer also interacts with the
symbol table
t
The lexical analyzer can also performs … secondary tasks:
– stripping out blanks, tabs, new lines
– stripping out comments
– keeping track of line numbers
– Expanding macros in some lexical analyzers
… is a rule describing the set of lexemes
that represent a token.
pattern
Patterns are usually specified using …
regular expressions
For example, the pattern [a-zA-Z]*
what is a A lexeme?
A lexeme is a sequence of characters in the source code that matches the pattern for a specific token.
x, distance, count ➔ IDENT. find the token and lexemes.
- Token: IDENT (Identifier)
- IDENT represents a category of tokens in the programming language.
- It identifies variables, functions, or other entities in the code. - Lexemes:
- x: represents a variable or some other named entity in the code.
- distance
- count
begin : give lexemes from this token
begin, Begin, BEGIN, beGin,
- Begin in small or capital letters
list common programming tokens
– keywords
– operators
– identifiers
– constants
– literals
– punctuation symbols
what are attributes
Attributes of tokens are additional pieces of information associated with lexemes that match a particular pattern and are classified as a specific token type.
… provide details about the specific lexeme and its context in the source code.
attributes
break down the tokens and their attributes : x = y + 2
<id, pointer to symbol-table entry for x>
<assign_op, >
<id, pointer to symbol-table entry for y>
<plus_op, >
<num, integer value 2>
… contains information about the token such as the lexeme, the line number in which it was first seen …
Symbol table entry
break down the tokens and their attributes : E = M * C ** 2
<id, pointer to symbol-table entry for E>
<assign_op, >
<id, pointer to symbol-table entry for M>
<mult_op, >
<id, pointer to symbol-table entry for C>
<exp_op, >
<num, integer value 2>
If the programmer mistakes wihle for while,
the lexical analyzer cannot detect the error (why?)
- a valid token is produced but with unintended meaning
Types of Lexical Errors?
- Illegal Characters: If the source code contains characters that do not belong to any specified token pattern. For example, if the source code contains a ‘?’ symbol but there’s no pattern defined to recognize it as a valid token.
- Exceeding Length: Errors can occur if identifiers or numeric constants exceed the maximum allowable length defined by the language specification.
- Unterminated Strings or Comments: Errors are detected if strings or comments in the source code are not properly terminated.
how lexical errors can be handled?
- Panic Mode Recovery
– deleting extraneous characters
– inserting missing characters
– replacing an incorrect character by a correct character
– transposing two adjacent characters
explain panic mode recovery
In panic mode recovery, the lexical analyzer attempts to recover by skipping (deleting) successive characters from the remaining input until it finds a well-formed token.
… are commonly used to specify the patterns of tokens in programming languages.
Regular expressions
define a regular expression for identifiers where an identifier starts with a letter (uppercase or lowercase) followed by zero or more letters or digits.
letter → [a-zA-Z]
digit → [0-9]
identifier → letter (letter | digit)*
Regular expressions (RE) are defined using an alphabet of terminal symbols and three fundamental operations: …
alternation, concatenation, and repetition.
explain Alternation (|)
- allows us to specify alternatives between different patterns.
- denoted by the vertical bar | and represents a choice between two or more patterns.
- For example : (a|b) matches either the character ‘a’ or the character ‘b’.
explain The concatenation operation
- combines two patterns sequentially.
It represents the sequential occurrence of one pattern followed by another. - denoted by simply juxtaposing two regular expressions.
- Example : (ab) matches the sequence of characters ‘a’ followed by ‘b’.
- The repetition operation specifies that a pattern can occur zero or more times consecutively.
- denoted by appending an asterisk * to the regular expression.
- Example: (a)* matches zero or more occurrences of the character ‘a’.
explain Positive Closure (+)
- It denotes one or more instances of the pattern.
- The positive closure r+ is equivalent to rr*, which means one occurrence of r followed by zero or more occurrences of r.
explain Zero or One Instance (?)
- indicates “zero or one occurrence” of the pattern.
- The operator ? is equivalent to r | λ, where λ represents the empty string.
explain Character Classes
- They allow you to specify a range or list of characters within square brackets [ ].
- [abc] represents the set of characters ‘a’, ‘b’, or ‘c’. (Alternation)
- [a-z] represents any lowercase letter from ‘a’ to ‘z’. (Alternation)
explain Any Character The dot . (period)?
- matches any single character except a newline (\n).
- Example:
In the regular expression a.b, the dot . matches any single character, so a.b would match strings such as : - “aab”,
- “a1b”, or
- “a@b”,
-where the dot . represents any character between ‘a’ and ‘b’.