3 년 전 · b8dc7fa64c
--- a/book.bak
+++ b/book.bak
--- a/book.bib
+++ b/book.bib
@@ -1,4 +1,25 @@
 
				-
			
 
				+@techreport{Lesk:1975uq,
			
 
				+	author = {M. E. Lesk and E. Schmidt},
			
 
				+	date-added = {2007-08-27 13:37:27 -0600},
			
 
				+	date-modified = {2009-08-25 22:28:17 -0600},
			
 
				+	institution = {Bell Laboratories},
			
 
				+	month = {July},
			
 
				+	title = {Lex - A Lexical Analyzer Generator},
			
 
				+	year = {1975},
			
 
				+	Bdsk-File-1 = {YnBsaXN0MDDRAQJccmVsYXRpdmVQYXRoV2xleC5wZGYICxgAAAAAAAABAQAAAAAAAAADAAAAAAAAAAAAAAAAAAAAIA==}}
			
 
				+
			
 
				+@incollection{Johnson:1979qy,
			
 
				+	author = {Stephen C. Johnson},
			
 
				+	booktitle = {{UNIX} Programmer's Manual},
			
 
				+	date-added = {2007-08-27 13:19:51 -0600},
			
 
				+	date-modified = {2007-08-27 13:23:00 -0600},
			
 
				+	organization = {AT\&T},
			
 
				+	pages = {353--387},
			
 
				+	publisher = {Holt, Rinehart, and Winston},
			
 
				+	title = {YACC: Yet another compiler-compiler},
			
 
				+	volume = {2},
			
 
				+	year = {1979}}
			
 
				+	
			
 
				 @book{Pierce:2004fk,
			
 
				 	editor = {Benjamin C. Pierce},
			
 
				 	publisher = {MIT Press},
			
--- a/book.tex
+++ b/book.tex
@@ -499,13 +499,14 @@ perform.\index{subject}{concrete syntax}\index{subject}{abstract
 
				   syntax}\index{subject}{abstract syntax
			
 
				   tree}\index{subject}{AST}\index{subject}{program}\index{subject}{parse}
			
 
				 The process of translating from concrete syntax to abstract syntax is
			
 
				-called \emph{parsing}~\citep{Aho:2006wb}. This book does not cover the
			
 
				-theory and implementation of parsing.
			
 
				+called \emph{parsing}~\citep{Aho:2006wb}\python{ and is studied in
			
 
				+  chapter~\ref{ch:parsing-Lvar}}.
			
 
				+\racket{This book does not cover the theory and implementation of parsing.}%
			
 
				 %
			
 
				 \racket{A parser is provided in the support code for translating from
			
 
				-  concrete to abstract syntax.}
			
 
				+  concrete to abstract syntax.}%
			
 
				 %
			
 
				-\python{We use Python's \code{ast} module to translate from concrete
			
 
				+\python{For now we use Python's \code{ast} module to translate from concrete
			
 
				   to abstract syntax.}
			
 
				 
			
 
				 ASTs can be represented inside the compiler in many different ways,
			
@@ -4079,683 +4080,175 @@ all, fast code is useless if it produces incorrect results!
 
				 
			
 
				 
			
 
				 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
			
 
				-\if\edition\pythonEd
			
 
				 \chapter{Parsing}
			
 
				-\label{ch:parsing}
			
 
				+\label{ch:parsing-Lvar}
			
 
				 \setcounter{footnote}{0}
			
 
				 
			
 
				-The main ideas covered in this \Part{} are
			
 
				-\begin{description}
			
 
				-\item[lexical analysis] the identification of tokens (i.e., words)
			
 
				-  within sequences of characters.
			
 
				-\item[parsing] the identification of sentence structure within
			
 
				-  sequences of tokens.
			
 
				-\end{description}
			
 
				-
			
 
				-In general, the syntax of the source code for a language is called its
			
 
				-\emph{concrete syntax}. The concrete syntax of $P_0$ specifies which
			
 
				-programs, expressed as sequences of characters, are $P_0$ programs.
			
 
				-The process of transforming a program written in the concrete syntax
			
 
				-(a sequence of characters) into an abstract syntax tree is
			
 
				-traditionally subdivided into two parts: \emph{lexical analysis}
			
 
				-(often called scanning) and \emph{parsing}. The lexical analysis phase
			
 
				-translates the sequence of characters into a sequence of
			
 
				-\emph{tokens}, where each token consists of several characters. The
			
 
				-parsing phase organizes the tokens into a \emph{parse tree} as
			
 
				-directed by the grammar of the language and then translates the parse
			
 
				-tree into an abstract syntax tree.
			
 
				-
			
 
				-It is feasible to implement a compiler without doing lexical analysis,
			
 
				-instead just parsing.  However, scannerless parsers tend to be slower,
			
 
				-which mattered back when computers were slow, and sometimes still
			
 
				-matters for very large files.
			
 
				-
			
 
				-
			
 
				-
			
 
				-%(If you need a refresher on how a context-free grammar specifies a
			
 
				-%language, read Section 3.1 of~\cite{Appel:2003fk}.)
			
 
				-
			
 
				-
			
 
				-The Python Lex-Yacc tool, abbreviated PLY~\cite{Beazley:fk}, is an
			
 
				-easy-to-use Python imitation of the original \texttt{lex} and
			
 
				-\texttt{yacc} C programs. Lex was written by Eric Schmidt and Mike
			
 
				-Lesk~\cite{Lesk:1975uq} at Bell Labs, and is the standard lexical
			
 
				-analyzer generator on many Unix systems. 
			
 
				-%
			
 
				-%The input to \texttt{lex} is
			
 
				-%a specification consisting of a list of the kinds of tokens and a
			
 
				-%regular expression for each.  The output of \texttt{lex} is a program
			
 
				-%that analyzes a text file, turning it into a sequence of tokens. 
			
 
				-%
			
 
				-YACC stands from Yet Another Compiler Compiler and was originally
			
 
				-written by Stephen C. Johnson at AT\&T~\cite{Johnson:1979qy}.
			
 
				-%
			
 
				-%The input to
			
 
				-%\texttt{yacc} is a context-free grammar together with an action (a
			
 
				-%chunk of code) for each production. The output of \texttt{yacc} is a
			
 
				-%program that parses a text file and fires the appropriate actions when
			
 
				-%a production is applied. 
			
 
				-%
			
 
				-The PLY tool combines the functionality of both \texttt{lex} and
			
 
				-\texttt{yacc}. In this \Part{} we will use the PLY tool to generate
			
 
				-a lexer and parser for the $P_0$ subset of Python.
			
 
				-
			
 
				+\index{subject}{parsing}
			
 
				+
			
 
				+
			
 
				+In this chapter we learn how to use the Lark parser generator to
			
 
				+translate the concrete syntax of \LangVar{} (a sequence of characters)
			
 
				+into an abstract syntax tree.  A parser generator takes in a
			
 
				+specification of the concrete syntax and produces a parser. Even
			
 
				+though a parser generator does most of the work for us, using one
			
 
				+properly requires considerable knowledge about parsing algorithms.  In
			
 
				+particular, we must learn about the specification languages used by
			
 
				+parser generators and we must learn how to deal with ambiguity in our
			
 
				+language specifications.
			
 
				+
			
 
				+The process of parsing is traditionally subdivided into two phases:
			
 
				+\emph{lexical analysis} (also called scanning) and
			
 
				+\emph{parsing}. The lexical analysis phase translates the sequence of
			
 
				+characters into a sequence of \emph{tokens}, that is, words consisting
			
 
				+of several characters. The parsing phase organizes the tokens into a
			
 
				+\emph{parse tree} that captures how the tokens were matched by rules
			
 
				+in the grammar of the language. The reason for the subdivision into
			
 
				+two phases is to enable the use of a faster but less powerful
			
 
				+algorithm for lexical analysis and the use of a slower but more
			
 
				+powerful algorithm for parsing.
			
 
				+%
			
 
				+Likewise, parser generators typical come in pairs, with separate
			
 
				+generators for the lexical analyzer (or lexer for short) and for the
			
 
				+parser.  A paricularly influential pair of generators were
			
 
				+\texttt{lex} and \texttt{yacc}. The \texttt{lex} generator was written
			
 
				+by \citet{Lesk:1975uq} at Bell Labs. The \texttt{yacc} generator was
			
 
				+written by \citet{Johnson:1979qy} at AT\&T and stands for Yet Another
			
 
				+Compiler Compiler.
			
 
				+
			
 
				+The Lark parse generator that we use in this chapter includes both a
			
 
				+lexical analyzer and a parser. The next section discusses lexical
			
 
				+analysis and the remainder of the chapter discusses parsing.
			
 
				 
			
 
				 
			
 
				 \section{Lexical analysis}
			
 
				 \label{sec:lex}
			
 
				 
			
 
				-The lexical analyzer turns a sequence of characters (a string) into a
			
 
				-sequence of tokens. For example, the string
			
 
				+The lexical analyzers produced by Lark turn a sequence of characters
			
 
				+(a string) into a sequence of token objects. For example, converting the string
			
 
				 \begin{lstlisting}
			
 
				-'print 1 + 3'
			
 
				+'print(1 + 3)'
			
 
				 \end{lstlisting}
			
 
				-\noindent will be converted into the list of tokens
			
 
				+\noindent into the following sequence of token objects
			
 
				 \begin{lstlisting}
			
 
				-['print','1','+','3']
			
 
				+Token('PRINT', 'print')
			
 
				+Token('LPAR', '(')
			
 
				+Token('INT', '1')
			
 
				+Token('PLUS', '+')
			
 
				+Token('INT', '3')
			
 
				+Token('RPAR', ')')
			
 
				+Token('NEWLINE', '\n')
			
 
				 \end{lstlisting}
			
 
				-Actually, to be more accurate, each token will contain the token
			
 
				-\texttt{type} and the token's \texttt{value}, which is the string from
			
 
				-the input that matched the token.
			
 
				+Each token includes a field for its \code{type}, such as \code{'INT'},
			
 
				+and a field for its \code{value}, such as \code{'1'}.
			
 
				 
			
 
				-With the PLY tool, the types of the tokens must be specified by
			
 
				-initializing the \texttt{tokens} variable. For example,
			
 
				-
			
 
				-\begin{lstlisting}
			
 
				-tokens = ('PRINT','INT','PLUS')
			
 
				-\end{lstlisting}
			
 
				-
			
 
				-Next we must specify which sequences of characters will map to each
			
 
				-type of token. We do this using regular expression. The term
			
 
				-``regular'' comes from ``regular languages'', which are the
			
 
				-(particularly simple) set of languages that can be recognized by a
			
 
				+Following in the tradition of \code{lex}, the specification language
			
 
				+for Lark's lexical analysis generator is one regular expression for
			
 
				+each type of the token. The term \emph{regular} comes from \emph{regular
			
 
				+languages}, which are the languages that can be recognized by a
			
 
				 finite automata. A \emph{regular expression} is a pattern formed of
			
 
				-the following core elements:
			
 
				+the following core elements:\index{subject}{regular expression}
			
 
				 
			
 
				-\begin{enumerate}
			
 
				-\item a character, e.g. \texttt{a}. The only string that matches this
			
 
				-  regular expression is \texttt{a}.
			
 
				-\item two regular expressions, one followed by the other
			
 
				-  (concatenation), e.g. \texttt{bc}.  The only string that matches
			
 
				-  this regular expression is \texttt{bc}.
			
 
				-\item one regular expression or another (alternation), e.g.
			
 
				-  \texttt{a|bc}.  Both the string \texttt{'a'} and \texttt{'bc'} would
			
 
				+\begin{itemize}
			
 
				+\item A single character, e.g. \code{"a"}. The only string that matches this
			
 
				+  regular expression is \code{'a'}.
			
 
				+\item Two regular expressions, one followed by the other
			
 
				+  (concatenation), e.g. \code{"bc"}.  The only string that matches
			
 
				+  this regular expression is \code{'bc'}.
			
 
				+\item One regular expression or another (alternation), e.g.
			
 
				+  \code{"a|bc"}.  Both the string \code{'a'} and \code{'bc'} would
			
 
				   be matched by this pattern.
			
 
				-\item a regular expression repeated zero or more times (Kleene
			
 
				-  closure), e.g. \texttt{(a|bc)*}.  The string \texttt{'bcabcbc'}
			
 
				-  would match this pattern, but not \texttt{'bccba'}.
			
 
				-\item the empty sequence (epsilon)
			
 
				-\end{enumerate}
			
 
				-
			
 
				-The Python support for regular expressions goes beyond the core
			
 
				-elements and include many other convenient short-hands, for example
			
 
				-\texttt{+} is for repetition one or more times. If you want to refer
			
 
				-to the actual character \texttt{+}, use a backslash to escape it.
			
 
				-Section \href{http://docs.python.org/lib/re-syntax.html}{4.2.1 Regular
			
 
				-  Expression Syntax} of the Python Library Reference gives an in-depth
			
 
				-description of the extended regular expressions supported by Python.
			
 
				-
			
 
				-Normal Python strings give a special interpretation to backslashes,
			
 
				-which can interfere with their interpretation as regular expressions.
			
 
				-To avoid this problem, use Python's raw strings instead of normal
			
 
				-strings by prefixing the string with an \texttt{r}.  For example, the
			
 
				-following specifies the regular expression for the \texttt{'PLUS'}
			
 
				-token.
			
 
				-
			
 
				-\begin{lstlisting}
			
 
				-t_PLUS =   r'\+'
			
 
				-\end{lstlisting}
			
 
				-
			
 
				-\noindent The \lstinline{t_} is a naming convention that PLY uses to know when
			
 
				-you are defining the regular expression for a token. 
			
 
				-
			
 
				-Sometimes you need to do some extra processing for certain kinds of
			
 
				-tokens.  For example, for the \texttt{INT} token it is nice to convert
			
 
				-the matched input string into a Python integer. With PLY you can do
			
 
				-this by defining a function for the token. The function must have the
			
 
				-regular expression as its documentation string and the body of the
			
 
				-function should overwrite in the \texttt{value} field of the token.  Here's
			
 
				-how it would look for the \texttt{INT} token. The \lstinline{\d} regular
			
 
				-expression stands for any decimal numeral (0-9).
			
 
				-
			
 
				-\begin{lstlisting}
			
 
				-def t_INT(t):
			
 
				-    r'\d+'
			
 
				-    try:
			
 
				-      t.value = int(t.value)
			
 
				-    except ValueError:
			
 
				-      print "integer value too large", t.value
			
 
				-      t.value = 0
			
 
				-    return t
			
 
				-\end{lstlisting}
			
 
				-
			
 
				-In addition to defining regular expressions for each of the tokens,
			
 
				-you'll often want to perform special handling of newlines and
			
 
				-whitespace. The following is the code for counting newlines and for
			
 
				-telling the lexer to ignore whitespace. (Python has complex rules
			
 
				-for dealing with whitespace that we'll ignore for now.)
			
 
				-% (We'll need to reconsider this later to handle Python indentation rules.)
			
 
				-
			
 
				-\begin{lstlisting}
			
 
				-def t_newline(t):
			
 
				-    r'\n+'
			
 
				-    t.lexer.lineno += len(t.value)
			
 
				-
			
 
				-t_ignore  = ' \t'  
			
 
				-\end{lstlisting}
			
 
				-
			
 
				-If a portion of the input string is not matched by any of the tokens,
			
 
				-then the lexer calls the error function that you provide. The following
			
 
				-is an example error function.
			
 
				-
			
 
				-\begin{lstlisting}
			
 
				-def t_error(t):
			
 
				-    print "Illegal character '%s'" % t.value[0]
			
 
				-    t.lexer.skip(1)  
			
 
				-\end{lstlisting}
			
 
				-
			
 
				-\noindent Last but not least, you'll need to instruct PLY to generate
			
 
				-the lexer from your specification with the following code.
			
 
				-
			
 
				-\begin{lstlisting}
			
 
				-import ply.lex as lex
			
 
				-lex.lex()
			
 
				-\end{lstlisting}
			
 
				-
			
 
				-\noindent Figure~\ref{fig:lex} shows the complete code for an example
			
 
				-lexer.
			
 
				-
			
 
				-\begin{figure}[htbp]
			
 
				-  \centering
			
 
				-  \begin{tabular}{|cl}
			
 
				-&
			
 
				-\begin{lstlisting}
			
 
				-tokens = ('PRINT','INT','PLUS')
			
 
				-
			
 
				-t_PRINT = r'print'
			
 
				-
			
 
				-t_PLUS =   r'\+'
			
 
				-
			
 
				-def t_INT(t):
			
 
				-    r'\d+'
			
 
				-    try:
			
 
				-      t.value = int(t.value)
			
 
				-    except ValueError:
			
 
				-      print "integer value too large", t.value
			
 
				-      t.value = 0
			
 
				-    return t
			
 
				-
			
 
				-t_ignore  = ' \t'  
			
 
				-
			
 
				-def t_newline(t):
			
 
				-  r'\n+'
			
 
				-  t.lexer.lineno += t.value.count("\n")
			
 
				-
			
 
				-def t_error(t):
			
 
				-  print "Illegal character '%s'" % t.value[0]
			
 
				-  t.lexer.skip(1)
			
 
				-
			
 
				-import ply.lex as lex
			
 
				-lex.lex()
			
 
				-\end{lstlisting}
			
 
				-\end{tabular}
			
 
				-  \caption{Example lexer implemented using the PLY lexer generator.}
			
 
				-  \label{fig:lex}
			
 
				-\end{figure}
			
 
				-
			
 
				-\begin{exercise}
			
 
				-  Write a PLY lexer specification for $P_0$ and test it on a few input
			
 
				-  programs, looking at the output list of tokens to see if they make
			
 
				-  sense.
			
 
				-\end{exercise}
			
 
				-
			
 
				-%\section{Parsing}
			
 
				-%\label{sec:parsing}
			
 
				-
			
 
				-%Explain LR (shift-reduce parsing).
			
 
				-%Show an example PLY parser.
			
 
				-%Explain actions and AST construction.
			
 
				-%Start symbols.
			
 
				-%Specifying precedence.
			
 
				-%Looking at the parser.out file.
			
 
				-%Debugging shift/reduce and reduce/reduce errors.
			
 
				-
			
 
				-%We start with some background on context-free grammars
			
 
				-%(Section~\ref{sec:cfg}), then discuss how to use PLY to do parsing
			
 
				-%(Section~\ref{sec:ply-parsing}). 
			
 
				-
			
 
				-%, so we
			
 
				-%discuss the algorithm it uses in Sections \ref{sec:lalr} and
			
 
				-%\ref{sec:table}. This section concludes with a discussion of using
			
 
				-%precedence levels to resolve parsing conflicts.
			
 
				-
			
 
				-
			
 
				-\section{Background on CFGs and the $P_0$ grammar. }
			
 
				-\label{sec:cfg}
			
 
				-
			
 
				-A \emph{context-free grammar} (CFG) consists of a set of \emph{rules} (also
			
 
				-called productions) that describes how to categorize strings of
			
 
				-various forms. There are two kinds of categories, \emph{terminals} and
			
 
				-\emph{non-terminals}.  The terminals correspond to the tokens from the
			
 
				-lexical analysis. Non-terminals are used to categorize different parts
			
 
				-of a language, such as the distinction between statements and
			
 
				-expressions in Python and C. The term \emph{symbol} refers to both
			
 
				-terminals and non-terminals.  A grammar rule has two parts, the
			
 
				-left-hand side is a non-terminal and the right-hand side is a sequence
			
 
				-of zero or more symbols. The notation \lstinline{::=} is used to
			
 
				-separate the left-hand side from the right-hand side. The following is
			
 
				-a rule that could be used to specify the syntax for an addition
			
 
				-operator.
			
 
				-%
			
 
				-\begin{lstlisting}
			
 
				-$(1)$ expression ::= expression PLUS expression
			
 
				-\end{lstlisting}
			
 
				-%
			
 
				-This rule says that if a string can be divided into three parts, where
			
 
				-the first part can be categorized as an expression, the second part is
			
 
				-the \texttt{PLUS} non-terminal (token), and the third part can be
			
 
				-categorized as an expression, then the entire string can be
			
 
				-categorized as an expression.  The next example rule has the
			
 
				-non-terminal \texttt{INT} on the right-hand side and says that a
			
 
				-string that is categorized as an integer (by the lexer) can also be
			
 
				-categorized as an expression.  As is apparent here, a string can be
			
 
				-categorized by more than one non-terminal.
			
 
				-\begin{lstlisting}
			
 
				-$(2)$ expression ::= INT
			
 
				-\end{lstlisting}
			
 
				-
			
 
				-To \emph{parse} a string is to determine how the string can be
			
 
				-categorized according to a given grammar.  Suppose we have the string
			
 
				-``\lstinline{1 + 3}''.  Both the \texttt{1} and the \texttt{3} can be
			
 
				-categorized as expressions using rule $2$.  We can then use rule 1 to
			
 
				-categorize the entire string as an expression.  A \emph{parse tree} is
			
 
				-a good way to visualize the parsing process. (You will be tempted to
			
 
				-confuse parse trees and abstract syntax tress, but the excellent
			
 
				-students will carefully study the difference to avoid this confusion.)
			
 
				-A parse tree for ``\lstinline{1 + 3}'' is shown in
			
 
				-Figure~\ref{fig:parse-tree}. The best way to start drawing a parse
			
 
				-tree is to first list the tokenized string at the bottom of the page.
			
 
				-These tokens correspond to terminals and will form the leaves of the
			
 
				-parse tree. You can then start to categorize non-terminals, or
			
 
				-sequences of non-terminals, using the parsing rules.  For example, we
			
 
				-can categorize the integer ``\texttt{1}'' as an expression using rule
			
 
				-$(2)$, so we create a new node above ``\texttt{1}'', label the node
			
 
				-with the left-hand side terminal, in this case \texttt{expression},
			
 
				-and draw a line down from the new node down to ``\texttt{1}''. As an
			
 
				-optional step, we can record which rule we used in parenthesis after
			
 
				-the name of the terminal.  We then repeat this process until all of
			
 
				-the leaves have been connected into a single tree, or until no more
			
 
				-rules apply.
			
 
				-
			
 
				-\begin{figure}[htbp]
			
 
				-  \centering
			
 
				-\includegraphics[width=2.5in]{simple-parse-tree}  
			
 
				-  \caption{The parse tree for ``\texttt{1 + 3}''.}
			
 
				-  \label{fig:parse-tree}
			
 
				-\end{figure}
			
 
				-
			
 
				-
			
 
				-There can be more than one parse tree for the same string if the
			
 
				-grammar is ambiguous. For example, the string ``\texttt{1 + 2 + 3}''
			
 
				-can be parsed two different ways using rules 1 and 2, as shown in
			
 
				-Figure~\ref{fig:ambig}. In Section~\ref{sec:precedence} we'll discuss
			
 
				-ways to avoid ambiguity through the use of precedence levels and
			
 
				-associativity.
			
 
				-
			
 
				-\begin{figure}[htbp]
			
 
				-  \centering
			
 
				-\includegraphics[width=5in]{ambig-parse-tree}  
			
 
				-  \caption{Two parse trees for ``\texttt{1 + 2 + 3}''.}
			
 
				-  \label{fig:ambig}
			
 
				-\end{figure}
			
 
				-
			
 
				-The process describe above for creating a parse-tree was
			
 
				-``bottom-up''. We started at the leaves of the tree and then worked
			
 
				-back up to the root. An alternative way to build parse-trees is the
			
 
				-``top-down'' \emph{derivation} approach. This approach is not a
			
 
				-practical way to parse a particular string but it is helpful for
			
 
				-thinking about all possible strings that are in the language described
			
 
				-by the grammar.  To perform a derivation, start by drawing a single
			
 
				-node labeled with the starting non-terminal for the grammar. This is
			
 
				-often the \texttt{program} non-terminal, but in our case we simply
			
 
				-have \texttt{expression}. We then select at random any grammar rule
			
 
				-that has \texttt{expression} on the left-hand side and add new edges
			
 
				-and nodes to the tree according to the right-hand side of the rule.
			
 
				-The derivation process then repeats by selecting another non-terminal
			
 
				-that does not yet have children. Figure~\ref{fig:derivation} shows the
			
 
				-process of building a parse tree by derivation. A \emph{left-most
			
 
				-  derivation} is one in which the left-most non-terminal is always
			
 
				-chosen as the next non-terminal to expand. A \texttt{right-most
			
 
				-  derivation} is one in which the right-most non-terminal is always
			
 
				-chosen as the next non-terminal to expand. The derivation in
			
 
				-Figure~\ref{fig:derivation} is a right-most derivation.
			
 
				-
			
 
				-\begin{figure}[htbp]
			
 
				-  \centering
			
 
				-\includegraphics[width=5in]{derivation}  
			
 
				-  \caption{Building a parse-tree by derivation.}
			
 
				-  \label{fig:derivation}
			
 
				-\end{figure}
			
 
				+\item A regular expression repeated zero or more times (Kleene
			
 
				+  closure), e.g. \code{"(a|bc)*"}.  The string \code{'bcabcbc'}
			
 
				+  would match this pattern, but not \code{'bccba'}.
			
 
				+\item The empty sequence.
			
 
				+\end{itemize}
			
 
				+Parentheses can be used to control the grouping within a regular
			
 
				+expression.
			
 
				 
			
 
				+For our convenience, Lark also accepts an extended set of regular
			
 
				+expressions that are automatically translates into the core regular
			
 
				+expressions.
			
 
				 
			
 
				-For each subset of Python in this course, we will specify which
			
 
				-language features are in a given subset of Python using context-free
			
 
				-grammars.  The notation we'll use for grammars is
			
 
				-\href{http://en.wikipedia.org/wiki/Extended_Backus\%E2\%80\%93Naur_form}{Extended
			
 
				-  Backus-Naur Form (EBNF)}.  The grammar for $P_0$ is shown in
			
 
				-Figure~\ref{fig:concrete-P0}.  This notation does not correspond
			
 
				-exactly to the notation for grammars used by PLY, but it should not be
			
 
				-too difficult for the reader to figure out the PLY grammar given the
			
 
				-EBNF grammar.
			
 
				+\begin{itemize}
			
 
				+\item Match one of a set of characters, for example, \code{[abc]}
			
 
				+  is equivalent to \code{a|b|c}.
			
 
				+\item Match one of a range of characters, for example, \code{[a-z]}
			
 
				+  matches any lowercase letter in the alphabet.
			
 
				+\item Repetition one or more times, for example, \code{[a-z]+}
			
 
				+  will match any sequence of one or more lowercase letters,
			
 
				+  such as \code{'b'} and \code{'bzca'}.
			
 
				+\item Zero or one matches, for example, \code{a? b}  matches
			
 
				+  both \code{'ab'} and \code{'b'}.
			
 
				+\item A string, such as \code{"hello"}, which matches itself,
			
 
				+    that is, \code{'hello'}.
			
 
				+\end{itemize}
			
 
				 
			
 
				+In a Lark grammar file, specify a name for each type of token followed
			
 
				+by a colon and then a regular expression surrounded by \code{/}
			
 
				+characters. For example, the \code{DIGIT}, \code{INT}, \code{NEWLINE},
			
 
				+and \code{PRINT} types of tokens are specified in the following way.
			
 
				 
			
 
				-\begin{figure}[htbp]
			
 
				-  \centering
			
 
				-  \begin{tabular}{|cl}
			
 
				-&
			
 
				 \begin{lstlisting}
			
 
				-program ::= module
			
 
				-module ::= simple_statement+
			
 
				-simple_statement ::= "print" expression
			
 
				-                   | name "=" expression
			
 
				-                   | expression
			
 
				-expression ::= name
			
 
				-             | decimalinteger
			
 
				-             | "-" expression 
			
 
				-             | expression "+" expression
			
 
				-             | "(" expression ")"
			
 
				-             | "input" "(" ")"
			
 
				+DIGIT: /[0-9]/
			
 
				+INT: DIGIT+
			
 
				+NEWLINE: (/\r/? /\n/)+
			
 
				+PRINT: "print"
			
 
				 \end{lstlisting}
			
 
				-  \end{tabular}
			
 
				-  \caption{Context-free grammar for the $P_0$ subset of Python.}
			
 
				-  \label{fig:concrete-P0}
			
 
				-\end{figure}
			
 
				-
			
 
				-
			
 
				-\section{Generating parser with PLY}
			
 
				-\label{sec:ply-parsing}
			
 
				-
			
 
				-Figure~\ref{fig:parser1} shows an example use of PLY to generate a
			
 
				-parser. The code specifies a grammar and it specifies actions for each
			
 
				-rule. For each grammar rule there is a function whose name must begin
			
 
				-with \lstinline{p_}.  The document string of the function contains the
			
 
				-specification of the grammar rule.  PLY uses just a colon
			
 
				-\lstinline{:} instead of the usual \lstinline{::=} to separate the
			
 
				-left and right-hand sides of a grammar production. The left-hand side
			
 
				-symbol for the first function (as it appears in the Python file) is
			
 
				-considered the start symbol.  The body of these functions contains
			
 
				-code that carries out the action for the production.
			
 
				-
			
 
				-Typically, what you want to do in the actions is build an abstract
			
 
				-syntax tree, as we do here. The parameter \lstinline{t} of the
			
 
				-function contains the results from the actions that were carried out
			
 
				-to parse the right-hand side of the production. You can index into
			
 
				-\lstinline{t} to access these results, starting with \lstinline{t[1]}
			
 
				-for the first symbol of the right-hand side. To specify the result of
			
 
				-the current action, assign the result into \lstinline{t[0]}. So, for
			
 
				-example, in the production \lstinline{expression : INT}, we build a
			
 
				-\lstinline{Const} node containing an integer that we obtain from
			
 
				-\lstinline{t[1]}, and we assign the \lstinline{Const} node to
			
 
				-\lstinline{t[0]}.
			
 
				 
			
 
				+\noindent (In Lark, the regular expression operators can be used both
			
 
				+inside a regular expression, that is, between the \code{/} characters,
			
 
				+and they can be used to combine regular expressions, outside the
			
 
				+\code{/} characters.)
			
 
				 
			
 
				-\begin{figure}[htbp]
			
 
				-  \centering
			
 
				-  \centering
			
 
				-  \begin{tabular}{|cl}
			
 
				-&
			
 
				-\begin{lstlisting}
			
 
				-from compiler.ast import Printnl, Add, Const
			
 
				-
			
 
				-def p_print_statement(t):
			
 
				-  'statement : PRINT expression'
			
 
				-  t[0] = Printnl([t[2]], None)
			
 
				-  
			
 
				-def p_plus_expression(t):
			
 
				-  'expression : expression PLUS expression'
			
 
				-  t[0] = Add((t[1], t[3]))
			
 
				-
			
 
				-def p_int_expression(t):
			
 
				-  'expression : INT'
			
 
				-  t[0] = Const(t[1])
			
 
				+\section{Grammars and Parse Trees}
			
 
				+\label{sec:CFG}
			
 
				 
			
 
				-def p_error(t):
			
 
				-  print "Syntax error at '%s'" % t.value
			
 
				+In section~\ref{sec:grammar} we learned how to use grammar rules to
			
 
				+specify the abstract syntax of a language. We now use grammar rules to
			
 
				+specify the concrete syntax. Recall that each rule has a left-hand
			
 
				+side and a right-hand side. However, this time each right-hand side
			
 
				+expresses a pattern to match against a string, instead of matching
			
 
				+against an abstract syntax tree. In particular, each right-hand side
			
 
				+is a sequence of \emph{symbols}\index{subject}{symbol}, where a symbol
			
 
				+is either a terminal or nonterminal. A
			
 
				+\emph{terminal}\index{subject}{terminal} is either a string or the
			
 
				+name of a type of token. The nonterminals play the same role as
			
 
				+before, defining categories of syntax.
			
 
				 
			
 
				-import ply.yacc as yacc
			
 
				-yacc.yacc()
			
 
				-\end{lstlisting}
			
 
				-\end{tabular}
			
 
				-  \caption{First attempt at writing a parser using PLY.}
			
 
				-  \label{fig:parser1}
			
 
				-\end{figure}
			
 
				-
			
 
				-The PLY parser generator takes your grammar and generates a parser
			
 
				-that uses the LALR(1) shift-reduce algorithm, which is the most common
			
 
				-parsing algorithm in use today. LALR(1) stands for Look Ahead
			
 
				-Left-to-right with Rightmost-derivation and 1 token of lookahead.
			
 
				-Unfortunately, the LALR(1) algorithm cannot handle all context-free
			
 
				-grammars, so sometimes you will get error messages from PLY. To understand
			
 
				-these errors and know how to avoid them, you have to know a little bit
			
 
				-about the parsing algorithm.
			
 
				-
			
 
				-\section{The LALR(1) algorithm}
			
 
				-\label{sec:lalr}
			
 
				-
			
 
				-To understand the error messages of PLY, one needs to understand the
			
 
				-underlying parsing algorithm.
			
 
				-%
			
 
				-The LALR(1) algorithm uses a stack and a finite automata.  Each
			
 
				-element of the stack is a pair: a state number and a symbol. The
			
 
				-symbol characterizes the input that has been parsed so-far and the
			
 
				-state number is used to remember how to proceed once the next
			
 
				-symbol-worth of input has been parsed.  Each state in the finite
			
 
				-automata represents where the parser stands in the parsing process
			
 
				-with respect to certain grammar rules.  Figure~\ref{fig:shift-reduce}
			
 
				-shows an example LALR(1) parse table generated by PLY for the grammar
			
 
				-specified in Figure~\ref{fig:parser1}. When PLY generates a parse
			
 
				-table, it also outputs a textual representation of the parse table to
			
 
				-the file \texttt{parser.out} which is useful for debugging purposes.
			
 
				-
			
 
				-Consider state 1 in Figure~\ref{fig:shift-reduce}. The parser has just
			
 
				-read in a \lstinline{PRINT} token, so the top of the stack is
			
 
				-\lstinline{(1,PRINT)}. The parser is part of the way through parsing
			
 
				-the input according to grammar rule 1, which is signified by showing
			
 
				-rule 1 with a dot after the PRINT token and before the expression
			
 
				-non-terminal.  A rule with a dot in it is called an \emph{item}. There
			
 
				-are several rules that could apply next, both rule 2 and 3, so state 1
			
 
				-also shows those rules with a dot at the beginning of their right-hand
			
 
				-sides. The edges between states indicate which transitions the
			
 
				-automata should make depending on the next input token. So, for
			
 
				-example, if the next input token is INT then the parser will push INT
			
 
				-and the target state 4 on the stack and transition to state 4.
			
 
				-Suppose we are now at the end of the input. In state 4 it says we
			
 
				-should reduce by rule 3, so we pop from the stack the same number of
			
 
				-items as the number of symbols in the right-hand side of the rule, in
			
 
				-this case just one.  We then momentarily jump to the state at the top
			
 
				-of the stack (state 1) and then follow the goto edge that corresponds
			
 
				-to the left-hand side of the rule we just reduced by, in this case
			
 
				-\lstinline{expression}, so we arrive at state 3.  (A slightly longer
			
 
				-example parse is shown in Figure~\ref{fig:shift-reduce}.)
			
 
				-
			
 
				-
			
 
				-\begin{figure}[htbp]
			
 
				-  \centering
			
 
				-\includegraphics[width=5.0in]{shift-reduce-conflict}  
			
 
				-  \caption{An LALR(1) parse table and a trace of an example run.}
			
 
				-  \label{fig:shift-reduce}
			
 
				-\end{figure}
			
 
				-
			
 
				-In general, the shift-reduce algorithm works as follows. Look at the
			
 
				-next input token.
			
 
				-\begin{itemize}
			
 
				-\item If there there is a shift edge for the input token, push the
			
 
				-  edge's target state and the input token on the stack and proceed to
			
 
				-  the edge's target state.
			
 
				-\item If there is a reduce action for the input token, pop $k$
			
 
				-  elements from the stack, where $k$ is the number of symbols in the
			
 
				-  right-hand side of the rule being reduced. Jump to the state at the
			
 
				-  top of the stack and then follow the goto edge for the non-terminal
			
 
				-  that matches the left-hand side of the rule we're reducing by. Push
			
 
				-  the edge's target state and the non-terminal on the stack.
			
 
				-\end{itemize}
			
 
				-
			
 
				-Notice that in state 6 of Figure~\ref{fig:shift-reduce} there is both
			
 
				-a shift and a reduce action for the token \lstinline{PLUS}, so the
			
 
				-algorithm does not know which action to take in this case. When a
			
 
				-state has both a shift and a reduce action for the same token, we say
			
 
				-there is a \emph{shift/reduce conflict}.  In this case, the conflict
			
 
				-will arise, for example, when trying to parse the input
			
 
				-\lstinline{print 1 + 2 + 3}.  After having consumed
			
 
				-\lstinline{print 1 + 2} the parser will be in state 6, and it will not 
			
 
				-know whether to reduce to form an expression of \lstinline{1 + 2}, 
			
 
				-or whether it should proceed by shifting the next \lstinline{+} from 
			
 
				-the input.
			
 
				-
			
 
				-A similar kind of problem, known as a \emph{reduce/reduce} conflict,
			
 
				-arises when there are two reduce actions in a state for the same
			
 
				-token. To understand which grammars gives rise to shift/reduce and
			
 
				-reduce/reduce conflicts, it helps to know how the parse table is
			
 
				-generated from the grammar, which we discuss next.
			
 
				-
			
 
				-\subsection{Parse table generation}
			
 
				-\label{sec:table}
			
 
				-
			
 
				-The parse table is generated one state at a time. State 0 represents
			
 
				-the start of the parser. We add the production for the start symbol to
			
 
				-this state with a dot at the beginning of the right-hand side.  If the
			
 
				-dot appears immediately before another non-terminal, we add all the
			
 
				-productions with that non-terminal on the left-hand side. Again, we
			
 
				-place a dot at the beginning of the right-hand side of each the new
			
 
				-productions. This process called \emph{state closure} is continued
			
 
				-until there are no more productions to add. We then examine each item
			
 
				-in the current state $I$. Suppose an item has the form $A ::=
			
 
				-\alpha.X\beta$, where $A$ and $X$ are symbols and $\alpha$ and $\beta$
			
 
				-are sequences of symbols. We create a new state, call it $J$.  If $X$
			
 
				-is a terminal, we create a shift edge from $I$ to $J$, whereas if $X$
			
 
				-is a non-terminal, we create a goto edge from $I$ to $J$.  We then
			
 
				-need to add some items to state $J$. We start by adding all items from
			
 
				-state $I$ that have the form $B ::= \gamma.X\kappa$ (where $B$ is any
			
 
				-symbol and $\gamma$ and $\kappa$ are arbitrary sequences of symbols),
			
 
				-but with the dot moved past the $X$. We then perform state closure on
			
 
				-$J$.  This process repeats until there are no more states or edges to
			
 
				-add.
			
 
				-
			
 
				-We then mark states as accepting states if they have an item that is
			
 
				-the start production with a dot at the end.  Also, to add in the
			
 
				-reduce actions, we look for any state containing an item with a dot at
			
 
				-the end. Let $n$ be the rule number for this item. We then put a
			
 
				-reduce $n$ action into that state for every token $Y$. For example, in
			
 
				-Figure~\ref{fig:shift-reduce} state 4 has an item with a dot at the
			
 
				-end. We therefore put a reduce by rule 3 action into state 4 for every
			
 
				-token. (Figure~\ref{fig:shift-reduce} does not show a reduce rule for
			
 
				-INT in state 4 because this grammar does not allow
			
 
				-two consecutive INT tokens in the input. We will not go into how this
			
 
				-can be figured out, but in any event it does no harm to have a reduce
			
 
				-rule for INT in state 4; it just means the input will be rejected at a
			
 
				-later point in the parsing process.)
			
 
				-
			
 
				-\begin{exercise}
			
 
				-On a piece of paper, walk through the parse table generation 
			
 
				-process for the grammar in Figure~\ref{fig:parser1} and check
			
 
				-your results against Figure~\ref{fig:shift-reduce}. 
			
 
				-\end{exercise}
			
 
				-
			
 
				-
			
 
				-\subsection{Resolving conflicts with precedence declarations}
			
 
				-\label{sec:precedence}
			
 
				+As an example, let us recall the \LangInt{} language, which included
			
 
				+the following rules for its abstract syntax.
			
 
				+\begin{align*}
			
 
				+  \Exp &::= \INT{\Int}\\
			
 
				+  \Exp &::= \ADD{\Exp}{\Exp}
			
 
				+\end{align*}
			
 
				+The corresponding rules for its concrete syntax are as follows. 
			
 
				+\begin{align}
			
 
				+  \Exp &::= \code{INT} \label{eq:parse-int}\\
			
 
				+  \Exp &::= \Exp\; \code{"+"} \; \Exp \label{eq:parse-plus}
			
 
				+\end{align}
			
 
				+The rule \eqref{eq:parse-int} says that any string that matches the
			
 
				+regular expression for \code{INT} can also be categorized, that is, parsed
			
 
				+as an expression. The rule \eqref{eq:parse-plus} says that any string that
			
 
				+parses as an expression, followed by the \code{+} character, followed
			
 
				+by another expression, can itself be parsed as an expression.
			
 
				+For example, the string \code{'1+3'} is an \Exp{} because
			
 
				+\code{'1'} and \code{'3'} are both \Exp{} by rule \eqref{eq:parse-int},
			
 
				+and then rule \eqref{eq:parse-plus} applies to categorize
			
 
				+\code{'1+3'} as an \Exp{}. We can visualize the application of grammar
			
 
				+rules to categorize a string using a
			
 
				+\emph{parse tree}\index{subject}{parse tree}. Each internal node in the tree
			
 
				+is an application of a grammar rule and the leaf nodes are substrings of the
			
 
				+input program.
			
 
				 
			
 
				-To solve the shift/reduce conflict in state 6, we can add the
			
 
				-following precedence rules, which says addition associates to the left
			
 
				-and takes precedence over printing. This will cause state 6 to choose
			
 
				-reduce over shift.
			
 
				 
			
 
				-\begin{lstlisting}
			
 
				-precedence = (
			
 
				-    ('nonassoc','PRINT'),
			
 
				-    ('left','PLUS')
			
 
				-    )
			
 
				-\end{lstlisting}
			
 
				 
			
 
				-In general, the precedence variable should be assigned a tuple of
			
 
				-tuples. The first element of each inner tuple should be an
			
 
				-associativity (nonassoc, left, or right) and the rest of the elements
			
 
				-should be tokens.  The tokens that appear in the same inner tuple have
			
 
				-the same precedence, whereas tokens that appear in later tuples have a
			
 
				-higher precedence.  Thus, for the typical precedence for arithmetic
			
 
				-operations, we would specify the following:
			
 
				 
			
 
				-\begin{lstlisting}
			
 
				-precedence = (
			
 
				-    ('left','PLUS','MINUS'),
			
 
				-    ('left','TIMES','DIVIDE')
			
 
				-    )
			
 
				-\end{lstlisting}
			
 
				 
			
 
				-Figure~\ref{fig:parser-resolved} shows the Python code for generating
			
 
				-a lexer and parser using PLY.
			
 
				 
			
 
				-\begin{figure}[htbp]
			
 
				-  \centering
			
 
				-\begin{lstlisting}[basicstyle=\footnotesize\ttfamily]
			
 
				-# Lexer
			
 
				-tokens = ('PRINT','INT','PLUS')
			
 
				-t_PRINT = r'print'
			
 
				-t_PLUS = r'\+'
			
 
				-def t_INT(t):
			
 
				-    r'\d+'
			
 
				-    try:
			
 
				-        t.value = int(t.value)
			
 
				-    except ValueError:
			
 
				-        print "integer value too large", t.value
			
 
				-        t.value = 0
			
 
				-    return t
			
 
				-t_ignore  = ' \t'  
			
 
				-def t_newline(t):
			
 
				-    r'\n+'
			
 
				-    t.lexer.lineno += t.value.count("\n")
			
 
				-def t_error(t):
			
 
				-    print "Illegal character '%s'" % t.value[0]
			
 
				-    t.lexer.skip(1)
			
 
				-import ply.lex as lex
			
 
				-lex.lex()
			
 
				-# Parser
			
 
				-from compiler.ast import Printnl, Add, Const
			
 
				-precedence = (
			
 
				-    ('nonassoc','PRINT'),
			
 
				-    ('left','PLUS')
			
 
				-    )
			
 
				-def p_print_statement(t):
			
 
				-    'statement : PRINT expression'
			
 
				-    t[0] = Printnl([t[2]], None)
			
 
				-def p_plus_expression(t):
			
 
				-    'expression : expression PLUS expression'
			
 
				-    t[0] = Add((t[1], t[3]))
			
 
				-def p_int_expression(t):
			
 
				-    'expression : INT'
			
 
				-    t[0] = Const(t[1])
			
 
				-def p_error(t):
			
 
				-    print "Syntax error at '%s'" % t.value
			
 
				-import ply.yacc as yacc
			
 
				-yacc.yacc()
			
 
				-\end{lstlisting}
			
 
				-  \caption{Example parser with precedence declarations to resolve conflicts.}
			
 
				-  \label{fig:parser-resolved}
			
 
				-\end{figure}
			
 
				 
			
 
				-\begin{exercise}
			
 
				-  Write a PLY grammar specification for $P_0$ and update your compiler
			
 
				-  so that it uses the generated lexer and parser instead of using the
			
 
				-  parser in the \lstinline{compiler} module. In addition to handling
			
 
				-  the grammar in Figure~\ref{fig:concrete-P0}, you also need to handle
			
 
				-  Python-style comments, everything following a \texttt{\#} symbol up
			
 
				-  to the newline should be ignored.  Perform regression testing on
			
 
				-  your compiler to make sure that it still passes all of the tests
			
 
				-  that you created for $P_0$.
			
 
				-\end{exercise}
			
 
				 
			
 
				 
			
 
				-\fi
			
 
				 
			
 
				 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
			
 
				 \chapter{Register Allocation}