Overview - ustc.edu.cn

Overview - ustc.edu.cn

Lexical Analysis Compiler Baojian Hua [email protected] Compiler source program compiler target program Front and Back Ends source program front end IR

back end target program Front End source code lexical analyzer tokens parser abstract syntax tree semantic analyzer

IR Lexical Analyzer The lexical analyzer translates the source program into a stream of lexical tokens Source program: stream of characters vary from language to language (ASCII or Unicode, or ) Lexical token:

compiler internal data structure that represents the occurrence of a terminal symbol vary from compiler to compiler Conceptually character lexical sequence analyzer token sequence Example Recall the min-ML language in code3 prog -> decs

decs -> dec; decs | dec -> val id = exp | val _ = printInt exp exp -> id | num | exp + exp | true | false | if (exp) then exp else exp | (exp) Example val x = 3; val y = 4; val z = if (2) then (x) else y; val _ = printInt z; lexical analysis VAL IDENT(x) ASSIGN INT(3) SEMICOLON VAL IDENT(y) ASSIGN INT(4) SEMICOLON

VAL IDENT(z) ASSIGN IF LPAREN INT(2) RPAREN THEN L PAREN IDENT(x) RPAREN ELSE IDENT(y) SEMICOLON VAL UNDERSCORE ASSIGN PRINTINT INDENT(z) SEMICOLON EOF Lexer Implementation Options: Write a lexer by hand from scratch boring, error-prone, and too much work see dragon book sec3.4 Automatic lexer generator Quick and easy

Lexer Implementation declarative specification lexical analyzer Regular Expressions How to specify a lexer? Develop another language Regular expressions Whats a lexer-generator? Another compiler

Basic Definitions Alphabet: the char set (say ASCII or Unicode) String: a finite sequence of char from alphabet Language: a set of strings finite or infinite say the C language Regular Expression (RE) Construction by induction

each c \in alphabet empty \eps (a|b) = {a, b} for M and N, then MN {} for M and N, then M|N

{a} (a|b)(c|d) = {ac, ad, bc, bd} for M, then M* (Kleen closure) (a|b)* = {\eps, a, aa, b, ab, abb, baa, } Regular Expression Or more formally: e -> | | | | {} c e | e e e

e* Example Cs indentifier: starts with a letter (_ counts as a letter) followed by zero or more of letter or digit () () (_|a|b||z|A|B||Z) () (_|a|b||z|A|B||Z)(_|a|b||z|A|B||Z|0||9) (_|a|b||z|A|B||Z)(_|a|b||z|A|B||Z|0||9)* Its really error-prone and tedious Syntax Sugar More syntax sugar:

[a-z] == a|b||z e+ == one or more of e e? == zero or one of e a* == a* itself e{i, j} == more than i and less than j of e . == any char except \n All these can be translated into core R E Example Revisted

Cs indentifier: starts with a letter (_ counts as a letter) followed by zero or more of letter or digit () () (_|a|b||z|A|B||Z) () (_|a|b||z|A|B||Z)(_|a|b||z|A|B||Z|0||9) [_a-zA-Z][_a-zA-Z0-9]* What about the key word if? Ambiguous Rule A single RE is not ambiguous But in a language, there may be m any REs?

[_a-zA-Z][_a-zA-Z0-9]* if So, for a string, which RE to match ? Ambiguous Rule Two conventions: Longest match: The regular expression that matches the longest string takes precedence. Rule Priority: The regular expressions identifying tokens are written down in sequence. If two regular expressions match

the same (longest) string, the first regular expression in the sequence takes precedence. Lexer Generator History Lexical analysis was once a performance bottleneck certainly not true today! As a result, early research investigated methods for efficient lexical analysis While the performance concerns are largely irrelevant today, the tools resulting from this research are still in wide use

History: A long-standing goal In this early period, a considerable amount of study went into the goal of creating an automatic compiler generator (aka compiler-compiler) declarative compiler specification compiler History: Unix and C In the mid-1960s at Bell Labs, Ritchie and oth ers were developing Unix A key part of this project was the development

of C and a compiler for it Johnson, in 1968, proposed the use of finite sta te machines for lexical analysis and developed Lex [CACM 11(12), 1968] read the accompanying paper on course page Lex realized a part of the compiler-compiler go al by automatically generating fast lexical anal yzers The Lex tool The original Lex generated lexers writte n in C (C in C) Today every major language has its own lex tool(s):

sml-lex, ocamllex, JLex, C#lex, Our topic next: sml-lex concepts and techniques apply to other tool s SML-Lex Specification Lexical specification consists of 3 parts (yet another programming language): User Declarations (plain SML types, values, functions) %% SML-LEX Definitions (RE abbreviations, special stuff)

%% Rules (association of REs with tokens) (each token will be represented in plain SML) User Declarations User Declarations: User can define various values that are a vailable to the action fragments. Two values must be defined in this sectio n: type lexresult

type of the value returned by each rule action. fun eof () called by lexer when end of input stream is reache d. (EOF) SML-LEX Definitions ML-LEX Definitions: User can define regular expression ab breviations: digits = [0-9] +; letter = [a-zA-Z]; Define multiple lexers to work togethe r. Each is given a unique name. %s lex1

lex2 lex3; Rules Rules: regularExp => (action) ; A rule consists of a pattern and an action: Pattern in a regular expression. Action is a fragment of ordinary SML code. Longest match & rule priority used for disamb iguation

Rules may be prefixed with the list of lex ers that are allowed to use this rule. Rules Rule actions can use any value defined in the U ser Declarations section, including type lexresult val eof : unit -> lexresult type of value returned by each rule action called by lexer when end of input stream reached

special variables: yytext: input substring matched by regular expression yypos: file position of the beginning of matched string continue (): doesnt return token; recursively calls lex er Example #1 (* A language called Toy *) prog -> word prog -> word -> symbol -> number symbol -> [_a-zA-Z][_0-9a-zA-Z]* number -> [0-9]+ Example #1 (* Lexer Toy, see the accompany code for detail *) datatype token = Symbol of string * int | Number of string * int exception End

type lexresult = unit fun eof () = raise End fun output x = ; %% letter = [_a-zA-Z]; digit = [0-9]; ld = {letter}|{digit}; symbol = {letter} {ld}*; number = {digit}+; %% {symbol} =>(output (Symbol(yytext, yypos))); {number} =>(output (Number(yytext, yypos))); Example #2 (* Expression Language * C-style comment, i.e. /* */ *) prog -> stms stms -> stm; stms -> stm -> id = e -> print e e -> id

-> num -> e bop e -> (e) bop -> + | - | * | / Sample Program x = 4; y = 5; z = x+y*3; print z; Example #2 (* All terminals *) prog -> stms stms -> stm; stms -> stm -> id = e -> print e e -> id -> num -> e bop e -> (e) bop -> + | - | * | /

Example #2 in Lex (* Expression language, see the accompany code * for detail. * Part 1: user code *) datatype token = Id of string * int | Number of string * int | Print of string * int | Plus of string * int | (* all other stuffs *) exception End type lexresult = unit fun eof () = raise End fun output x = ; Example #2 in Lex, cont (* Expression language, see the accompany code * for detail. * Part 2: lex definition *) %% letter = [_a-zA-Z];

digit = [0-9]; ld = {letter}|{digit}; sym = {letter} {ld}*; num = {digit}+; ws = [\ \t]; nl = [\n]; Example #2 in Lex, cont (* Expression language, see the accompany code * for detail. * Part 3: rules *) %% {ws} =>(continue ()); {nl} =>(continue ()); + =>(output (Plus (yytext, yypos))); - =>(output (Minus (yytext, yypos))); * =>(output (Times (yytext, yypos))); / =>(output (Divide (yytext, yypos))); ( =>(output (Lparen (yytext, yypos))); ) =>(output (Rparen (yytext, yypos))); = =>(output (Assign (yytext, yypos))); ; =>(output (Semi (yytext, yypos)));

Example #2 in Lex, cont (* Expression language, see the accompany code * for detail. * Part 3: rules cont *) print=>(output (Print(yytext, yypos))); {sym} =>(output (Id (yytext, yypos))); {num} =>(output (Number(yytext, yypos))); /* => (YYBEGIN COMMENT; continue ()); */ => (YYBEGIN INITIAL; continue ()); {nl} => (continue ()); . => (continue ()); . => (error ()); Lex Implementation Lex accepts regular expressions (al ong with others) So SML-lex is a compiler from RE to a lexer Internal: RE NFA DFA table-driven alog

Finite-state Automata (FA) Input String {Yes, No} M M = (, S, q0, F, ) Input alphabet State set Initial state Final states Transition function

Transition functions DFA : S S NFA : S (S) DFA example Which strings of as and bs are accept ed? a a 2 0 1 b

b Transition function: { (q0,a)q1, (q0,b)q0, (q1,a)q2, (q1,b)q1, (q2,a)q2, (q2,b)q2 } a,b NFA example a,b 0 b 1 a

b Transition function: {(q0,a){q0,q1}, (q0,b){q1}, (q1, a), (q1,b){q0,q1}} RE -> NFA: Thompson algorithm Break RE down to atoms construct small NFAs directly for atom s inductively construct larger NFAs from small NFAs

Easy to implement a small recursion algorithm RE -> NFA: Thompson algorithm e -> -> -> -> -> c e1 e2 e1 | e2 e1* c

e1 e2 RE -> NFA: Thompson algorithm e -> -> -> -> -> c e1 e2 e1 | e2 e1* e2

e1 e1 Example %% letter = [_a-zA-Z]; digit = [0-9]; id = {letter} ({letter}|{digit})* ; %% if => (IF (yytext, yypos)); {id} => (Id (yytext, yypos)); (* Equivalent to: *

if | {id} *) Example if => (IF (yytext, yypos)); {id} => (Id (yytext, yypos)); i f NFA -> DFA: Subset construction algorithm (* subset construction: workList algorithm *) q0 <- e-closure (n0)

Q <- {q0} workList <- q0 while (workList != \phi) remove q from workList foreach (character c) t <- -closure (move (q, c)) D[q, c] <- t if (t\not\in Q) add t to Q and workList NFA -> DFA: -closure (* -closure: fixpoint algorithm *) (* Dragon Fig 3.33 gives a DFS-like algorithm. * Here we give a recursive version. (Simpler) *) X <- \phi fun eps (t) = X <- X {t} {t} foreach (s \in one-eps(t)) if (s \not\in X) then eps (s) NFA -> DFA:

-closure (* -closure: fixpoint algorithm *) (* dragon Fig 3.33 gives a DFS-like algorithm. * Here we give a recursive version. (Simpler) *) fun e-closure (T) = X <- T foreach (t \in T) X <- X {t} eps(t) NFA -> DFA: -closure (* -closure: fixpoint algorithm *) (* A BFS-like algorithm. *) X <- empty; fun e-closure (T) = Q <- T X <- T while (Q not empty) q <- deQueue (Q) foreach (s \in one-eps(q)) if (s \not\in X) enQueue (Q, s)

X <- X {t} s Example if => (IF (yytext, yypos)); {id} => (Id (yytext, yypos)); 1 0 i 2 [_a-zA-Z] 5 6

3 f 7 [_a-zA-Z0-9] 4 8 Example q0 = {0, 1, 5} D[q0, i] = {2, 3, 6, 7, 8} D[q0, _] = {6, 7, 8} D[q1, f] = {4, 7, 8} 0

i 1 [_a-zA-Z] 5 q1 i q0 _ f 2 6

Q Q Q Q 3 = {t} {t} {t} {q0} q1 q2 q3 f 7 q3 [_a-zA-Z0-9]

q2 4 8 Example D[q1, D[q2, D[q3, D[q4, 0 _] _] _] _] =

= = = {7, 8} {7, 8} {7, 8} {7, 8} i 1 2 [_a-zA-Z] 5 f q1 i q0

_ q2 f 7 q3 _ _ 6 Q {t} q4 Q Q Q 3 [_a-zA-Z0-9]

_ q4 _ 4 8 Example q0 = {0, 1, 5} q2 = {6, 7, 8} 8} i 1 0 q1 = {2, 3, 6, 7, 8} q3 = {4, 7, 8}

q4 = {7, f 2 3 4 [_a-zA-Z] 5 f q3 i q1 q0 letter-i 6 ld ld-f q2

7 [_a-zA-Z0-9] q4 ld ld 8 Example q0 = {0, 1, 5} q2 = {6, 7, 8} 8} i 1 0

q1 = {2, 3, 6, 7, 8} q3 = {4, 7, 8} q4 = {7, f 2 3 4 [_a-zA-Z] 5 f q3 i q1 q0 letter-i 6 ld

ld-f q2 7 [_a-zA-Z0-9] q4 ld ld 8 Table-driven Algorithm Conceptually, an FA is a directed graph

Pragmatically, many different strategies to encode an FA: Matrix (adjacency matrix) Array of list (adjacency list) Hash table Jump table (switch statements) sml-lex flex Balance between time and space

Example if => (IF (yytext, yypos)); {id} => (Id (yytext, yypos)); state\ i f letter- other char i-f q0 q1 q2 q2 error q1

q4 q3 q4 error q2 q4 q4 q4 error q3

q4 q4 q4 error q4 q3 q4 q4 error q4

f i q1 q0 letter-i ld ld-f q2 state q0 q1 q2 q3 q4 q4 ld ld actio n Id Id

IF Id DFA Minimization: Hopcrofts Algorithm f q3 ld i q1 ld-f q0 letter-i q4 ld ld

q2 state q0 q1 q2 q3 q4 actio n Id Id IF Id DFA Minimization: Hopcrofts Algorithm f q3 ld i q1 ld-f

q0 letter-i q4 ld ld q2 state q0 q1 q2 q3 q4 actio n Id Id IF Id DFA Minimization:

Hopcrofts Algorithm f q3 i q1 ld-f q0 ld q2, q4 letter-i ld state action q0 q1

q2, q4 q3 Id Id IF Summary A Lexer: Writing lexers by hand is boring, so we use a lexer generator: ml-lex

input: stream of characters output: stream of tokens RE -> NFA -> DFA -> table-driven algo Moral: dont underestimate your theory cla sses! great application of cool theory developed in m athematics. well see more cool apps as the course progress es

Recently Viewed Presentations

  • What are the main beliefs of other belief

    What are the main beliefs of other belief

    Shinto also teaches people to develop harmony with nature through worship. Similar to Shinto, Taoism is a Chinese philosophy that teaches people to develop harmony with nature and accept it's "way". "Yin and Yang" Laozi (founder of Taoism) * *...
  • Cajun and Creole Folktales - WKU

    Cajun and Creole Folktales - WKU

    Cajun and Creole Folktales The French Oral Tradition of South Louisiana Dr. S. Kay Gandy Western Kentucky University * Barry Jean Ancelet—premier authority of Cajun and Creole Tales Creole Stories: hearty stories, doctors and ailments, ethnic stories, village chronicles, jokes,...
  • Working Group 7: Botnet Remediation

    Working Group 7: Botnet Remediation

    Working Group 7: Botnet Remediation Status Update June 6, 2012 Michael O'Reirdan (MAAWG) - Chair Peter Fonash (DHS) - Vice-Chair * WG 7 Objectives Working Group 7 - Botnet Remediation Description: This Working Group will review the efforts undertaken within...
  • Wage and Hour Issues for Employees and Supervisors

    Wage and Hour Issues for Employees and Supervisors

    Welcome to the Wage and Hour Issues for Employees and Supervisors Module offered by the Office of the Vice President for Ethics and Compliance in conjunction with the Office of the Vice President for Human Resources as a part of...
  • Medieval Inquisition II

    Medieval Inquisition II

    In the early 1300's the village of Montaillou-and the surrounding mountainous region of Southern France-was full of heretics. When Jacques Fournier, Bishop of Pamiers, launched an elaborate Inquisition to stamp them out, the peasants and shepherds he interrogated revealed, along...
  • Understanding Verb Tense - Hinds County School District

    Understanding Verb Tense - Hinds County School District

    To describe events that occur at the same time, use verbs in the same tense. Sara peeked over the fence and saw a cornfield. Sara peeks over the fence and sees a cornfield. past tense past tense present tense present...
  • Mt. San Jacinto Community College District Five Year

    Mt. San Jacinto Community College District Five Year

    Five Year Construction Plan State Capital Outlay . Proposals for the 2016-17 (submitted July 2014) and 2017-18 fiscal years will be evaluated as part of a two-year state capital outlay program. Limit of . one project . per campus for...
  • Personal Taxation Gaps 74 - 78

    Personal Taxation Gaps 74 - 78

    Personal Taxation Gaps 74 - 78 ... Stamp duty land tax Who is liable Transactions subject to tax Rates of tax Main reliefs Stamp Duty Stamp Duty is an indirect transaction tax on land, buildings shares and securities Stamp Duty...