PLaC-2.1 2. Symbol specifications and lexical analysis Notations of tokens is specified by regular expressions Token classes: keywords (for, class), operators and delimiters (+, ==, ;, {), identifiers (getSize, maxint), literals (42, quotesingle \nquotesingle ) Lexical analysis isolates tokens within a stream of characters and encodes them: Tokens int count = 0; double sum = 0.0; while (count r with character a connected with the same character a (c) 2011 bei Prof. Dr. Uwe Kastens -------------------------------------------------------------------------------- Lecture Programming Languages and Compilers WS 2011/12 / Slide 207 Objectives: Understand the transformation method In the lecture: * Explain the naive idea with a small artificial example Suggested reading: Kastens / Übersetzerbau, Section 3.2 Assignments: * Apply the method Exercise 6 Questions: * Why does the naive method may yield non-deterministic automata? -------------------------------------------------------------------------------- PLaC-2.7a Construction of deterministic finite state machines Syntax diagram deterministic finite state machine set of nodes mq state q sets of nodes mq and mr transitions q ---> r with character a connected with the same character a Construction: 1. enumerate nodes; exit of the diagram gets the number 0 2. initial set of nodes m1 contains all nodes that are reachable from the begin of the diagram; m1 represents the initial state 1. states 3. construct new sets of nodes (states) and transitions: mq mr - chose state q with mq, chose a character a q r - consider the set of nodes with character a, s.t. their labels k are in mq. a - consider all nodes that are directly reachable from those nodes; kelement mq nelement mr let mr be the set of their labels a - create a state r for mr and a transition from q to r under a. nodes 4. repeat step 3 until no new states or transitions can be created 5. a state q is a final state iff 0 is in mq. (c) 2011 bei Prof. Dr. Uwe Kastens -------------------------------------------------------------------------------- Lecture Programming Languages and Compilers WS 2011/12 / Slide 207a Objectives: Understand the transformation method In the lecture: * Explain the method using floating point numbers of Pascal (PLaC-2.8) * Recall the method presented in the course "Modellierung". Suggested reading: Kastens / Übersetzerbau, Section 3.2 Assignments: * Apply the method Exercise 6 Questions: * Why does the method yield deterministic automata? -------------------------------------------------------------------------------- PLaC-2.7b Properties of the transformation 1. Syntax diagrams can express languages more compact than regular expressions (a ( | b)) | b can: A regular expression for { a, ab, b} needs more than one occurrence of a or b - a b a syntax diagram doesn't. 2. The FSM resulting from a transformation of x a PLaC 2.7a may have more states than necessary. y a 3. There are transformations that minimize the number of states of any FSM. x, y are equivalent (c) 2007 bei Prof. Dr. Uwe Kastens -------------------------------------------------------------------------------- Lecture Programming Languages and Compilers WS 2011/12 / Slide 207b Objectives: Understand the transformation method In the lecture: * Explain the properties. * Recall the algorithm. Suggested reading: Kastens / Übersetzerbau, Section 3.2 Assignments: * Apply the method Exercise 6 -------------------------------------------------------------------------------- PLaC-2.8 Example: Floating point numbers in Pascal Syntax diagram 1 2 3 4 7 0 d . d E d 5 + 6 - {1} {1, 2, 4} {3} {3, 4, 0} {5, 6, 7} {7} {7, 0} d d . E d d E + - d d d + d . d E d 1 2 3 4 5 6 7 - d d d E d deterministic finite state machine (c) 2007 bei Prof. Dr. Uwe Kastens -------------------------------------------------------------------------------- Lecture Programming Languages and Compilers WS 2011/12 / Slide 208 Objectives: Understand the construction method In the lecture: The construction process of the previous slide is explained using this example. -------------------------------------------------------------------------------- PLaC-2.9 Composition of token automata Construct one finite state machine for each token. Compose them forming a single FSM: * Identify the initial states of the single automata and identical structures evolving from there (transitions with the same character and states). * Keep the final states of single automata distinct, they classify the tokens. * Add automata for comments and irrelevant characters (white space) l, E, d a _ Example: tokens of Lax * * [Waite, Goos: 2 3 4 5 Compiler Construction] * c l, E, d ) l, E 1 6 ( . d d d 20 eof 0 b d 8 9 character classes: s 12 a all but * 19 : = d / d . E d c all but * or ) 17 14 13 E d digits 7 10 11 l all letters but E +, - / = = s + - * < > ; , ) [ ] ^ b blank tab newline 18 16 15 d (c) 2003 bei Prof. Dr. Uwe Kastens -------------------------------------------------------------------------------- Lecture Programming Languages and Compilers WS 2011/12 / Slide 209 Objectives: Construct a multi-token automaton In the lecture: Use the example to * discuss the composition steps, * introduce the abbreviation by character classes, * to see a non-trivial complete automaton. Suggested reading: Kastens / Übersetzerbau, Section 3.2 Questions: Describe the notation of Lax tokens and comments in English. -------------------------------------------------------------------------------- PLaC-2.10 Rule of the longest match An automaton may contain transitions from final states: When does the automaton stop? ... ... Rule of the longest match: * The automaton continues as long as there is a transition with the next character. * After having stopped it sets back to the most recently passed final state. * If no final state has been passed an error message is issued. Consequence: Some kinds of tokens have to be separated explicitly. Check the concrete grammar for tokens that may occur adjacent! (c) 2003 bei Prof. Dr. Uwe Kastens -------------------------------------------------------------------------------- Lecture Programming Languages and Compilers WS 2011/12 / Slide 210 Objectives: Understand the consequences of the rule In the lecture: * Discuss examples for the rule of the longest match. * Discuss different cases of token separation. Suggested reading: Kastens / Übersetzerbau, Section 3.2 Questions: * Point out applications of the rule in the Lax automaton, which arose from the composition of sub-automata. * Which tokens have to be separated by white space? -------------------------------------------------------------------------------- PLaC-2.11 Scanner: Aspects of implementation * Runtime is proportional to the number of characters in the program * Operations per character must be fast - otherwise the Scanner dominates compilation time * Table driven automata are too slow: Loop interprets table, 2-dimensional array access, branches * Directly programmed automata is faster; transform transitions into control flow: sequence repeat loop branch, switch * Fast loops for sequences of irrelevant blanks. * Implementation of character classes: bit pattern or indexing - avoid slow operations with sets of characters. * Do not copy characters from input buffer - maintain a pointer into the buffer, instead. (c) 2003 bei Prof. Dr. Uwe Kastens -------------------------------------------------------------------------------- Lecture Programming Languages and Compilers WS 2011/12 / Slide 211 Objectives: Runtime efficiency is important In the lecture: * Advantages of directly programmed automata. Compare to table driven. Suggested reading: Kastens / Übersetzerbau, Section 3.3 Assignments: * Generate directly programmed automata Exercise 7 Questions: * Are there advantages for table-driven automata? Check your arguments carefully! -------------------------------------------------------------------------------- PLaC-2.11b Characteristics of Input Data significant numbers of characters W. M. Waite: The Cost of Lexical Analysis. Software- Practice and Experience, 16(5):473-488, May 1986. (c) 2007 bei Prof. Dr. Uwe Kastens -------------------------------------------------------------------------------- Lecture Programming Languages and Compilers WS 2011/12 / Slide 211b Objectives: Profile how characters contribute to tokens In the lecture: * Measurements on occurrences of symbols: Single spaces, identifiers, keywords, squences of spaces are most frequent. Comments contribute most characters. Suggested reading: Kastens / Übersetzerbau, Section 3.3 -------------------------------------------------------------------------------- PLaC-2.12 Identifier module and literal modules * Uniform interface for all scanner support modules: Input parameters: pointer to token text and its length; Output parameters: syntax code, attribute * Identifier module encodes identifier occurrences bijective (1:1), and recognizes keywords Implementation: hash vector, extensible table, collision lists * Literal modules for floating point numbers, integral numbers, strings Variants for representation in memory: token text; value converted into compiler data; value converted into target data Caution: Avoid overflow on conversion! Cross compiler: compiler representation may differ from target representation * Character string memory: stores strings without limits on their lengths, used by the identifier module and the literal modules (c) 2003 bei Prof. Dr. Uwe Kastens -------------------------------------------------------------------------------- Lecture Programming Languages and Compilers WS 2011/12 / Slide 212 Objectives: Safe and efficient standard implementations are available In the lecture: * Give reasons for the implementation techniques. * Show different representations of floating point numbers. * Escape characters in strings need conversion. Suggested reading: Kastens / Übersetzerbau, Section 3.3 Questions: * Give examples why the analysis phase needs to know values of integral literals. * Give examples for representation of literals and their conversion. -------------------------------------------------------------------------------- PLaC-2.13 Scanner generators generate the central function of lexical analysis GLA University of Colorado, Boulder; component of the Eli system Lex Unix standard tool Flex Successor of Lex Rex GMD Karlsruhe Token specification: regular expressions GLA library of precoined specifications; recognizers for some tokens may be programmed Lex, Flex, Rex transitions may be made conditional Interface: GLA as described in this chapter; cooperates with other Eli components Lex, Flex, Rex actions may be associated with tokens (statement sequences) interface to parser generator Yacc Implementation: GLA directly programmed automaton in C Lex, Flex, Rex table-driven automaton in C Rex table-driven automaton in C or in Modula-2 Flex, Rex faster, smaller implementations than generated by Lex (c) 2003 bei Prof. Dr. Uwe Kastens -------------------------------------------------------------------------------- Lecture Programming Languages and Compilers WS 2011/12 / Slide 213 Objectives: Know about the most common generators In the lecture: Explain specific properties mentioned here. Suggested reading: Kastens / Übersetzerbau, Section 3.4 Assignments: Use GLA and Lex Exercise 7 --------------------------------------------------------------------------------