PLaC-2.1
2. Symbol specifications and lexical analysis

Notations of tokens is specified by regular expressions
Token classes:  keywords (for, class), operators and delimiters (+, ==, ;, {),
identifiers (getSize, maxint), literals (42, quotesingle \nquotesingle )

Lexical analysis isolates tokens within a stream of characters and encodes them:

Tokens
int count = 0; double sum = 0.0; while (count<maxVect) { sum = sum+vect[count]; count++;}

(c) 2007 bei Prof. Dr. Uwe Kastens

--------------------------------------------------------------------------------

Lecture Programming Languages and Compilers WS 2011/12 / Slide 201

Objectives:
Introduction of the task of lexical analysis

In the lecture:
Explain the example

--------------------------------------------------------------------------------


PLaC-2.2
Lexical Analysis

Input: Program represented by a sequence of characters

Tasks:          Compiler modul:

Input reader

Recognize and classify tokens          Scanner (central phase, finite state machine)
Skip irrelevant characters

Encode tokens:

Identifier modul
Store token information          Literal modules
Conversion          String storage

Output: Program represented by a sequence of encoded tokens

(c) 2007 bei Prof. Dr. Uwe Kastens

--------------------------------------------------------------------------------

Lecture Programming Languages and Compilers WS 2011/12 / Slide 202

Objectives:
Understand lexical analysis subtasks

In the lecture:
Explain

* subtasks and their interfaces using example of PLaC-201,
* different forms of comments,
* sparation of tokens in FORTRAN,

Suggested reading:
Kastens / Übersetzerbau, Section 3, 3.3.1

--------------------------------------------------------------------------------


PLaC-2.3

Avoid context dependent token specifications

Tokens should be recognized in isolation:
e. G. all occurrences of the identifier a get the same encoding:

{int a; ... a = 5; ... {float a; ... a = 3.1; ...}}
distinction of the two different variables would require
information from semantic analysis

typedef problem in C:
The C syntax requires lexical distinction of type-names and other names:

typedef int *T; T (*B); X (*Y);
cause syntactically different structures: declaration of variable B and call of function X.
Requires feedback from semantic analysis to lexical analysis.

Identifiers in PL/1 may coincide with keywords:
if if = then then then := else else else := then
Lexical analysis needs feedback from syntactic analysis to distinguish them.

Token separation in FORTRAN:
"Deletion or insertion of blanks does not change the meaning."
DO 24 K = 1,5       begin of a loop, 7 tokens
DO 24 K = 1.5       assignment to the variable DO24K, 3 tokens
Token separation is determined late.

(c) 2003 bei Prof. Dr. Uwe Kastens

--------------------------------------------------------------------------------

Lecture Programming Languages and Compilers WS 2011/12 / Slide 203

Objectives:
Recognize difficult specifications

In the lecture:
Explain

* isolated recognition and encoding of tokens,
* feedback of information,
* unusual notation of keywords,
* separation of tokens in FORTRAN,

Suggested reading:
Kastens / Übersetzerbau, Section 3, 3.3.1

Questions:
* Give examples of context dependent information about tokens, which the lexical analysis can not know.
* Some decisions on the notation of tokens and the syntax of a language may complicate lexical analysis. Give examples.
* Explain the typedef problem in C.

--------------------------------------------------------------------------------


PLaC-2.4
Representation of tokens
Uniform encoding of tokens by triples:

Syntax code          attribute          source position

terminal code of        value or reference        to locate error messages
the concrete syntax     into data module         of later compiler phases

Examples:          double sum = 5.6e-5;
while (count < maxVect)
{ sum = sum + vect[count];

DoubleToken          12, 1
Ident          138          12, 8
Assign          12, 12
FloatNumber          16          12, 14
Semicolon          12, 20
WhileToken          13, 1
OpenParen          13, 7
Ident          139          13, 8
LessOpr          13, 14
Ident          137          13, 16
CloseParen          13, 23
OpenBracket          14, 1
Ident          138          14, 3

(c) 2003 bei Prof. Dr. Uwe Kastens

--------------------------------------------------------------------------------

Lecture Programming Languages and Compilers WS 2011/12 / Slide 204

Objectives:
Understand token representation

In the lecture:
Explain the roles of the 3 components using the examples

Suggested reading:
Kastens / Übersetzerbau, Section 3, 3.3.1

Questions:
* What are the requirements for the encoding of identifiers?
* How does the identifier module meet them?
* Can the values of integer literals be represented as attribute values, or do we have to store them in a data module?
Explain! Consider also cross compilers!

--------------------------------------------------------------------------------


PLaC-2.5
Specification of token notations

Example: identifiers
Ident = Letter (Letter | Digit)*

regular          regular
grammar          expression

Ident ::= Letter X
X     ::= Letter X          syntax
X     ::= Digit X          diagram
X      ::=          Letter

Ident:
Digit

Letter
finite state
machine

Letter         transformation
shown in this
1     Letter  2          lecture

Digit

(c) 2005 bei Prof. Dr. Uwe Kastens

--------------------------------------------------------------------------------

Lecture Programming Languages and Compilers WS 2011/12 / Slide 205

Objectives:
Equivalent forms of specification

In the lecture:
* Repeat calculi of the lectures "Modellierung" and "Berechenbarkeit und formale Sprachen".
* Our strategy: Specify regular expressions, transform into syntax diagrams, and from there into finite state machines

Suggested reading:
Kastens / Übersetzerbau, Section 3.1

Questions:
* Give examples for Unix tools which use regular expressions to describe their input.

--------------------------------------------------------------------------------


PLaC-2.6
Regular expressions mapped to syntax diagrams

Transformation rules:
regular expression A       syntax diagram for A
empty          empty

a          a          single character

B C          B     C          sequence

B

B | C          alternative
C
B*          repetition, may be empty

B
B+          B          repetition, non-empty

(c) 2003 bei Prof. Dr. Uwe Kastens

--------------------------------------------------------------------------------

Lecture Programming Languages and Compilers WS 2011/12 / Slide 206

Objectives:
Construct by recursive substitution

In the lecture:
* Explain the construction for floating point numbers of Pascal.

Suggested reading:
Kastens / Übersetzerbau, Section 3.1

Assignments:
* Apply the technique Exercise 6

Questions:
* If one transforms syntax diagrams into regular expressions, certain structures of the diagram require duplication of
subexpressions. Give examples.
* Explain the analogy to control flows of programs with labels, jumps and loops.

--------------------------------------------------------------------------------


PLaC-2.7

Naive transformation

1. Transform a syntax diagram          a
into a non-det. FSM by naively          4   c
exchanging nodes and arcs          2 a      c
b 1          e

b          6 e     0

b          3       5
a     d       e
a          d

2. Transform a non-det. FSM into a
det. FSM:
Merge equivalents sets of nodes          {1}         {2,3}          {4,5}          {6}       {0}
c
into nodes.          b      a          e
d

Syntax diagram          deterministic finite state machine

set of nodes mq          state q

sets of nodes mq and mr          transition q ---> r with character a
connected with the same character a

(c) 2011 bei Prof. Dr. Uwe Kastens

--------------------------------------------------------------------------------

Lecture Programming Languages and Compilers WS 2011/12 / Slide 207

Objectives:
Understand the transformation method

In the lecture:
* Explain the naive idea with a small artificial example

Suggested reading:
Kastens / Übersetzerbau, Section 3.2

Assignments:
* Apply the method Exercise 6

Questions:
* Why does the naive method may yield non-deterministic automata?

--------------------------------------------------------------------------------


PLaC-2.7a
Construction of deterministic finite state machines

Syntax diagram          deterministic finite state machine
set of nodes mq          state q
sets of nodes mq and mr          transitions q ---> r with character a
connected with the same character a

Construction:

1. enumerate nodes; exit of the diagram gets the number 0

2. initial set of nodes m1 contains all nodes that are reachable from the begin of the diagram;
m1 represents the initial state 1.          states

3. construct new sets of nodes (states) and transitions:          mq      mr
- chose state q with mq, chose a character a          q          r

- consider the set of nodes with character a, s.t. their labels k are in mq.         a

- consider all nodes that are directly reachable from those nodes;         kelement mq    nelement mr
  let mr be the set of their labels          a
- create a state r for mr and a transition from q to r under a.
nodes
4. repeat step 3 until no new states or transitions  can be created

5. a state q is a final state iff 0 is in mq.

(c) 2011 bei Prof. Dr. Uwe Kastens

--------------------------------------------------------------------------------

Lecture Programming Languages and Compilers WS 2011/12 / Slide 207a

Objectives:
Understand the transformation method

In the lecture:
* Explain the method using floating point numbers of Pascal (PLaC-2.8)
* Recall the method presented in the course "Modellierung".

Suggested reading:
Kastens / Übersetzerbau, Section 3.2

Assignments:
* Apply the method Exercise 6

Questions:
* Why does the method yield deterministic automata?

--------------------------------------------------------------------------------


PLaC-2.7b

Properties of the transformation

1. Syntax diagrams can express languages
more compact than regular expressions      (a (  | b)) | b
can:
A regular expression for { a, ab, b} needs
more than one occurrence of a or b -          a       b
a syntax diagram doesn't.

2. The FSM resulting from a transformation of          x a
PLaC 2.7a may have more states than
necessary.          y a

3. There are transformations that minimize
the number of states of any FSM.          x, y are
equivalent

(c) 2007 bei Prof. Dr. Uwe Kastens

--------------------------------------------------------------------------------

Lecture Programming Languages and Compilers WS 2011/12 / Slide 207b

Objectives:
Understand the transformation method

In the lecture:
* Explain the properties.
* Recall the algorithm.

Suggested reading:
Kastens / Übersetzerbau, Section 3.2

Assignments:
* Apply the method Exercise 6

--------------------------------------------------------------------------------


PLaC-2.8
Example: Floating point numbers in Pascal
Syntax diagram

1          2          3          4          7          0
d          .          d          E          d
5
+
6
-

{1}   {1, 2, 4}     {3}     {3, 4, 0}   {5, 6, 7}    {7}     {7, 0}

d     d   . E        d       d E       +  - d      d      d

+
d          .          d        E          d
1     2       3       4       5      6      7
-
d
d          d
E          d

deterministic finite state machine

(c) 2007 bei Prof. Dr. Uwe Kastens

--------------------------------------------------------------------------------

Lecture Programming Languages and Compilers WS 2011/12 / Slide 208

Objectives:
Understand the construction method

In the lecture:
The construction process of the previous slide is explained using this example.

--------------------------------------------------------------------------------


PLaC-2.9
Composition of token automata

Construct one finite state machine for each token. Compose them forming a single FSM:
* Identify the initial states of the single automata
and identical structures evolving from there (transitions with the same character and states).

* Keep the final states of single automata distinct, they classify the tokens.

* Add automata for comments and irrelevant characters (white space)

l, E, d
a          _         Example: tokens of Lax
*          *          [Waite, Goos:
2         3          4       5       Compiler Construction]

*       c          l, E, d
)        l, E
1          6
(          .         d      d     d

20   eof         0     b       d

8        9
character classes:          s          12

a all but *          19          :  =          d
/          d         .       E          d
c  all but * or )          17     14    13          E

d digits          7        10       11

l   all letters but E          +, -
/          =      =
s + - * < > ; , ) [ ] ^

b blank tab newline          18     16    15       d

(c) 2003 bei Prof. Dr. Uwe Kastens

--------------------------------------------------------------------------------

Lecture Programming Languages and Compilers WS 2011/12 / Slide 209

Objectives:
Construct a multi-token automaton

In the lecture:
Use the example to

* discuss the composition steps,
* introduce the abbreviation by character classes,
* to see a non-trivial complete automaton.

Suggested reading:
Kastens / Übersetzerbau, Section 3.2

Questions:
Describe the notation of Lax tokens and comments in English.

--------------------------------------------------------------------------------


PLaC-2.10
Rule of the longest match

An automaton may contain transitions from final states:
When does the automaton stop?
...          ...

Rule of the longest match:
* The automaton continues as long as there is a transition with the next character.
* After having stopped it sets back to the most recently passed final state.
* If no final state has been passed an error message is issued.
Consequence: Some kinds of tokens have to be separated explicitly.

Check the concrete grammar for tokens that may occur adjacent!

(c) 2003 bei Prof. Dr. Uwe Kastens

--------------------------------------------------------------------------------

Lecture Programming Languages and Compilers WS 2011/12 / Slide 210

Objectives:
Understand the consequences of the rule

In the lecture:
* Discuss examples for the rule of the longest match.
* Discuss different cases of token separation.

Suggested reading:
Kastens / Übersetzerbau, Section 3.2

Questions:
* Point out applications of the rule in the Lax automaton, which arose from the composition of sub-automata.
* Which tokens have to be separated by white space?

--------------------------------------------------------------------------------


PLaC-2.11
Scanner: Aspects of implementation

* Runtime is proportional to the number of characters in the program
* Operations per character must be fast - otherwise the Scanner dominates compilation time
* Table driven automata are too slow:
Loop interprets table, 2-dimensional array access, branches
* Directly programmed automata is faster; transform transitions into control flow:
sequence

repeat loop

branch, switch

* Fast loops for sequences of irrelevant blanks.

* Implementation of character classes:
bit pattern or indexing - avoid slow operations with sets of characters.

* Do not copy characters from input buffer - maintain a pointer into the buffer, instead.

(c) 2003 bei Prof. Dr. Uwe Kastens

--------------------------------------------------------------------------------

Lecture Programming Languages and Compilers WS 2011/12 / Slide 211

Objectives:
Runtime efficiency is important

In the lecture:
* Advantages of directly programmed automata. Compare to table driven.

Suggested reading:
Kastens / Übersetzerbau, Section 3.3

Assignments:
* Generate directly programmed automata Exercise 7

Questions:
* Are there advantages for table-driven automata? Check your arguments carefully!

--------------------------------------------------------------------------------


PLaC-2.11b
Characteristics of Input Data

significant numbers of characters

W. M. Waite:
The Cost of Lexical Analysis.
Software- Practice and Experience,

16(5):473-488, May 1986.

(c) 2007 bei Prof. Dr. Uwe Kastens

--------------------------------------------------------------------------------

Lecture Programming Languages and Compilers WS 2011/12 / Slide 211b

Objectives:
Profile how characters contribute to tokens

In the lecture:
* Measurements on occurrences of symbols: Single spaces, identifiers, keywords, squences of spaces are most frequent.
Comments contribute most characters.

Suggested reading:
Kastens / Übersetzerbau, Section 3.3

--------------------------------------------------------------------------------


PLaC-2.12
Identifier module and literal modules

* Uniform interface for all scanner support modules:
Input parameters: pointer to token text and its length;
Output parameters: syntax code, attribute

* Identifier module encodes identifier occurrences bijective (1:1), and
recognizes keywords
Implementation: hash vector, extensible table, collision lists

* Literal modules for floating point numbers, integral numbers, strings

Variants for representation in memory:
token text; value converted into compiler data; value converted into target data

Caution:
Avoid overflow on conversion!
Cross compiler: compiler representation may differ from target representation

* Character string memory:
stores strings without limits on their lengths,
used by the identifier module and the literal modules

(c) 2003 bei Prof. Dr. Uwe Kastens

--------------------------------------------------------------------------------

Lecture Programming Languages and Compilers WS 2011/12 / Slide 212

Objectives:
Safe and efficient standard implementations are available

In the lecture:
* Give reasons for the implementation techniques.
* Show different representations of floating point numbers.
* Escape characters in strings need conversion.

Suggested reading:
Kastens / Übersetzerbau, Section 3.3

Questions:
* Give examples why the analysis phase needs to know values of integral literals.
* Give examples for representation of literals and their conversion.

--------------------------------------------------------------------------------


PLaC-2.13
Scanner generators

generate the central function of lexical analysis

GLA  University of Colorado, Boulder; component of the Eli system
Lex   Unix standard tool
Flex  Successor of Lex
Rex  GMD Karlsruhe

Token specification: regular expressions

GLA          library of precoined specifications;
recognizers for some tokens may be programmed
Lex, Flex, Rex   transitions may be made conditional

Interface:

GLA         as described in this chapter; cooperates with other Eli components
Lex, Flex, Rex actions may be associated with tokens (statement sequences)
interface to parser generator Yacc

Implementation:

GLA          directly programmed automaton in C
Lex, Flex, Rex  table-driven automaton in C
Rex          table-driven automaton in C or in Modula-2

Flex, Rex          faster, smaller implementations than generated by Lex

(c) 2003 bei Prof. Dr. Uwe Kastens

--------------------------------------------------------------------------------

Lecture Programming Languages and Compilers WS 2011/12 / Slide 213

Objectives:
Know about the most common generators

In the lecture:
Explain specific properties mentioned here.

Suggested reading:
Kastens / Übersetzerbau, Section 3.4

Assignments:
Use GLA and Lex Exercise 7

--------------------------------------------------------------------------------