Jeremy Siek 2 년 전
부모
커밋
b8dc7fa64c
3개의 변경된 파일734개의 추가작업 그리고 1022개의 파일을 삭제
  1. 573 375
      book.bak
  2. 22 1
      book.bib
  3. 139 646
      book.tex

파일 크기가 너무 크기때문에 변경 상태를 표시하지 않습니다.
+ 573 - 375
book.bak


+ 22 - 1
book.bib

@@ -1,4 +1,25 @@
-
+@techreport{Lesk:1975uq,
+	author = {M. E. Lesk and E. Schmidt},
+	date-added = {2007-08-27 13:37:27 -0600},
+	date-modified = {2009-08-25 22:28:17 -0600},
+	institution = {Bell Laboratories},
+	month = {July},
+	title = {Lex - A Lexical Analyzer Generator},
+	year = {1975},
+	Bdsk-File-1 = {YnBsaXN0MDDRAQJccmVsYXRpdmVQYXRoV2xleC5wZGYICxgAAAAAAAABAQAAAAAAAAADAAAAAAAAAAAAAAAAAAAAIA==}}
+
+@incollection{Johnson:1979qy,
+	author = {Stephen C. Johnson},
+	booktitle = {{UNIX} Programmer's Manual},
+	date-added = {2007-08-27 13:19:51 -0600},
+	date-modified = {2007-08-27 13:23:00 -0600},
+	organization = {AT\&T},
+	pages = {353--387},
+	publisher = {Holt, Rinehart, and Winston},
+	title = {YACC: Yet another compiler-compiler},
+	volume = {2},
+	year = {1979}}
+	
 @book{Pierce:2004fk,
 	editor = {Benjamin C. Pierce},
 	publisher = {MIT Press},

+ 139 - 646
book.tex

@@ -499,13 +499,14 @@ perform.\index{subject}{concrete syntax}\index{subject}{abstract
   syntax}\index{subject}{abstract syntax
   tree}\index{subject}{AST}\index{subject}{program}\index{subject}{parse}
 The process of translating from concrete syntax to abstract syntax is
-called \emph{parsing}~\citep{Aho:2006wb}. This book does not cover the
-theory and implementation of parsing.
+called \emph{parsing}~\citep{Aho:2006wb}\python{ and is studied in
+  chapter~\ref{ch:parsing-Lvar}}.
+\racket{This book does not cover the theory and implementation of parsing.}%
 %
 \racket{A parser is provided in the support code for translating from
-  concrete to abstract syntax.}
+  concrete to abstract syntax.}%
 %
-\python{We use Python's \code{ast} module to translate from concrete
+\python{For now we use Python's \code{ast} module to translate from concrete
   to abstract syntax.}
 
 ASTs can be represented inside the compiler in many different ways,
@@ -4079,683 +4080,175 @@ all, fast code is useless if it produces incorrect results!
 
 
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\if\edition\pythonEd
 \chapter{Parsing}
-\label{ch:parsing}
+\label{ch:parsing-Lvar}
 \setcounter{footnote}{0}
 
-The main ideas covered in this \Part{} are
-\begin{description}
-\item[lexical analysis] the identification of tokens (i.e., words)
-  within sequences of characters.
-\item[parsing] the identification of sentence structure within
-  sequences of tokens.
-\end{description}
-
-In general, the syntax of the source code for a language is called its
-\emph{concrete syntax}. The concrete syntax of $P_0$ specifies which
-programs, expressed as sequences of characters, are $P_0$ programs.
-The process of transforming a program written in the concrete syntax
-(a sequence of characters) into an abstract syntax tree is
-traditionally subdivided into two parts: \emph{lexical analysis}
-(often called scanning) and \emph{parsing}. The lexical analysis phase
-translates the sequence of characters into a sequence of
-\emph{tokens}, where each token consists of several characters. The
-parsing phase organizes the tokens into a \emph{parse tree} as
-directed by the grammar of the language and then translates the parse
-tree into an abstract syntax tree.
-
-It is feasible to implement a compiler without doing lexical analysis,
-instead just parsing.  However, scannerless parsers tend to be slower,
-which mattered back when computers were slow, and sometimes still
-matters for very large files.
-
-
-
-%(If you need a refresher on how a context-free grammar specifies a
-%language, read Section 3.1 of~\cite{Appel:2003fk}.)
-
-
-The Python Lex-Yacc tool, abbreviated PLY~\cite{Beazley:fk}, is an
-easy-to-use Python imitation of the original \texttt{lex} and
-\texttt{yacc} C programs. Lex was written by Eric Schmidt and Mike
-Lesk~\cite{Lesk:1975uq} at Bell Labs, and is the standard lexical
-analyzer generator on many Unix systems. 
-%
-%The input to \texttt{lex} is
-%a specification consisting of a list of the kinds of tokens and a
-%regular expression for each.  The output of \texttt{lex} is a program
-%that analyzes a text file, turning it into a sequence of tokens. 
-%
-YACC stands from Yet Another Compiler Compiler and was originally
-written by Stephen C. Johnson at AT\&T~\cite{Johnson:1979qy}.
-%
-%The input to
-%\texttt{yacc} is a context-free grammar together with an action (a
-%chunk of code) for each production. The output of \texttt{yacc} is a
-%program that parses a text file and fires the appropriate actions when
-%a production is applied. 
-%
-The PLY tool combines the functionality of both \texttt{lex} and
-\texttt{yacc}. In this \Part{} we will use the PLY tool to generate
-a lexer and parser for the $P_0$ subset of Python.
-
+\index{subject}{parsing}
+
+
+In this chapter we learn how to use the Lark parser generator to
+translate the concrete syntax of \LangVar{} (a sequence of characters)
+into an abstract syntax tree.  A parser generator takes in a
+specification of the concrete syntax and produces a parser. Even
+though a parser generator does most of the work for us, using one
+properly requires considerable knowledge about parsing algorithms.  In
+particular, we must learn about the specification languages used by
+parser generators and we must learn how to deal with ambiguity in our
+language specifications.
+
+The process of parsing is traditionally subdivided into two phases:
+\emph{lexical analysis} (also called scanning) and
+\emph{parsing}. The lexical analysis phase translates the sequence of
+characters into a sequence of \emph{tokens}, that is, words consisting
+of several characters. The parsing phase organizes the tokens into a
+\emph{parse tree} that captures how the tokens were matched by rules
+in the grammar of the language. The reason for the subdivision into
+two phases is to enable the use of a faster but less powerful
+algorithm for lexical analysis and the use of a slower but more
+powerful algorithm for parsing.
+%
+Likewise, parser generators typical come in pairs, with separate
+generators for the lexical analyzer (or lexer for short) and for the
+parser.  A paricularly influential pair of generators were
+\texttt{lex} and \texttt{yacc}. The \texttt{lex} generator was written
+by \citet{Lesk:1975uq} at Bell Labs. The \texttt{yacc} generator was
+written by \citet{Johnson:1979qy} at AT\&T and stands for Yet Another
+Compiler Compiler.
+
+The Lark parse generator that we use in this chapter includes both a
+lexical analyzer and a parser. The next section discusses lexical
+analysis and the remainder of the chapter discusses parsing.
 
 
 \section{Lexical analysis}
 \label{sec:lex}
 
-The lexical analyzer turns a sequence of characters (a string) into a
-sequence of tokens. For example, the string
+The lexical analyzers produced by Lark turn a sequence of characters
+(a string) into a sequence of token objects. For example, converting the string
 \begin{lstlisting}
-'print 1 + 3'
+'print(1 + 3)'
 \end{lstlisting}
-\noindent will be converted into the list of tokens
+\noindent into the following sequence of token objects
 \begin{lstlisting}
-['print','1','+','3']
+Token('PRINT', 'print')
+Token('LPAR', '(')
+Token('INT', '1')
+Token('PLUS', '+')
+Token('INT', '3')
+Token('RPAR', ')')
+Token('NEWLINE', '\n')
 \end{lstlisting}
-Actually, to be more accurate, each token will contain the token
-\texttt{type} and the token's \texttt{value}, which is the string from
-the input that matched the token.
+Each token includes a field for its \code{type}, such as \code{'INT'},
+and a field for its \code{value}, such as \code{'1'}.
 
-With the PLY tool, the types of the tokens must be specified by
-initializing the \texttt{tokens} variable. For example,
-
-\begin{lstlisting}
-tokens = ('PRINT','INT','PLUS')
-\end{lstlisting}
-
-Next we must specify which sequences of characters will map to each
-type of token. We do this using regular expression. The term
-``regular'' comes from ``regular languages'', which are the
-(particularly simple) set of languages that can be recognized by a
+Following in the tradition of \code{lex}, the specification language
+for Lark's lexical analysis generator is one regular expression for
+each type of the token. The term \emph{regular} comes from \emph{regular
+languages}, which are the languages that can be recognized by a
 finite automata. A \emph{regular expression} is a pattern formed of
-the following core elements:
+the following core elements:\index{subject}{regular expression}
 
-\begin{enumerate}
-\item a character, e.g. \texttt{a}. The only string that matches this
-  regular expression is \texttt{a}.
-\item two regular expressions, one followed by the other
-  (concatenation), e.g. \texttt{bc}.  The only string that matches
-  this regular expression is \texttt{bc}.
-\item one regular expression or another (alternation), e.g.
-  \texttt{a|bc}.  Both the string \texttt{'a'} and \texttt{'bc'} would
+\begin{itemize}
+\item A single character, e.g. \code{"a"}. The only string that matches this
+  regular expression is \code{'a'}.
+\item Two regular expressions, one followed by the other
+  (concatenation), e.g. \code{"bc"}.  The only string that matches
+  this regular expression is \code{'bc'}.
+\item One regular expression or another (alternation), e.g.
+  \code{"a|bc"}.  Both the string \code{'a'} and \code{'bc'} would
   be matched by this pattern.
-\item a regular expression repeated zero or more times (Kleene
-  closure), e.g. \texttt{(a|bc)*}.  The string \texttt{'bcabcbc'}
-  would match this pattern, but not \texttt{'bccba'}.
-\item the empty sequence (epsilon)
-\end{enumerate}
-
-The Python support for regular expressions goes beyond the core
-elements and include many other convenient short-hands, for example
-\texttt{+} is for repetition one or more times. If you want to refer
-to the actual character \texttt{+}, use a backslash to escape it.
-Section \href{http://docs.python.org/lib/re-syntax.html}{4.2.1 Regular
-  Expression Syntax} of the Python Library Reference gives an in-depth
-description of the extended regular expressions supported by Python.
-
-Normal Python strings give a special interpretation to backslashes,
-which can interfere with their interpretation as regular expressions.
-To avoid this problem, use Python's raw strings instead of normal
-strings by prefixing the string with an \texttt{r}.  For example, the
-following specifies the regular expression for the \texttt{'PLUS'}
-token.
-
-\begin{lstlisting}
-t_PLUS =   r'\+'
-\end{lstlisting}
-
-\noindent The \lstinline{t_} is a naming convention that PLY uses to know when
-you are defining the regular expression for a token. 
-
-Sometimes you need to do some extra processing for certain kinds of
-tokens.  For example, for the \texttt{INT} token it is nice to convert
-the matched input string into a Python integer. With PLY you can do
-this by defining a function for the token. The function must have the
-regular expression as its documentation string and the body of the
-function should overwrite in the \texttt{value} field of the token.  Here's
-how it would look for the \texttt{INT} token. The \lstinline{\d} regular
-expression stands for any decimal numeral (0-9).
-
-\begin{lstlisting}
-def t_INT(t):
-    r'\d+'
-    try:
-      t.value = int(t.value)
-    except ValueError:
-      print "integer value too large", t.value
-      t.value = 0
-    return t
-\end{lstlisting}
-
-In addition to defining regular expressions for each of the tokens,
-you'll often want to perform special handling of newlines and
-whitespace. The following is the code for counting newlines and for
-telling the lexer to ignore whitespace. (Python has complex rules
-for dealing with whitespace that we'll ignore for now.)
-% (We'll need to reconsider this later to handle Python indentation rules.)
-
-\begin{lstlisting}
-def t_newline(t):
-    r'\n+'
-    t.lexer.lineno += len(t.value)
-
-t_ignore  = ' \t'  
-\end{lstlisting}
-
-If a portion of the input string is not matched by any of the tokens,
-then the lexer calls the error function that you provide. The following
-is an example error function.
-
-\begin{lstlisting}
-def t_error(t):
-    print "Illegal character '%s'" % t.value[0]
-    t.lexer.skip(1)  
-\end{lstlisting}
-
-\noindent Last but not least, you'll need to instruct PLY to generate
-the lexer from your specification with the following code.
-
-\begin{lstlisting}
-import ply.lex as lex
-lex.lex()
-\end{lstlisting}
-
-\noindent Figure~\ref{fig:lex} shows the complete code for an example
-lexer.
-
-\begin{figure}[htbp]
-  \centering
-  \begin{tabular}{|cl}
-&
-\begin{lstlisting}
-tokens = ('PRINT','INT','PLUS')
-
-t_PRINT = r'print'
-
-t_PLUS =   r'\+'
-
-def t_INT(t):
-    r'\d+'
-    try:
-      t.value = int(t.value)
-    except ValueError:
-      print "integer value too large", t.value
-      t.value = 0
-    return t
-
-t_ignore  = ' \t'  
-
-def t_newline(t):
-  r'\n+'
-  t.lexer.lineno += t.value.count("\n")
-
-def t_error(t):
-  print "Illegal character '%s'" % t.value[0]
-  t.lexer.skip(1)
-
-import ply.lex as lex
-lex.lex()
-\end{lstlisting}
-\end{tabular}
-  \caption{Example lexer implemented using the PLY lexer generator.}
-  \label{fig:lex}
-\end{figure}
-
-\begin{exercise}
-  Write a PLY lexer specification for $P_0$ and test it on a few input
-  programs, looking at the output list of tokens to see if they make
-  sense.
-\end{exercise}
-
-%\section{Parsing}
-%\label{sec:parsing}
-
-%Explain LR (shift-reduce parsing).
-%Show an example PLY parser.
-%Explain actions and AST construction.
-%Start symbols.
-%Specifying precedence.
-%Looking at the parser.out file.
-%Debugging shift/reduce and reduce/reduce errors.
-
-%We start with some background on context-free grammars
-%(Section~\ref{sec:cfg}), then discuss how to use PLY to do parsing
-%(Section~\ref{sec:ply-parsing}). 
-
-%, so we
-%discuss the algorithm it uses in Sections \ref{sec:lalr} and
-%\ref{sec:table}. This section concludes with a discussion of using
-%precedence levels to resolve parsing conflicts.
-
-
-\section{Background on CFGs and the $P_0$ grammar. }
-\label{sec:cfg}
-
-A \emph{context-free grammar} (CFG) consists of a set of \emph{rules} (also
-called productions) that describes how to categorize strings of
-various forms. There are two kinds of categories, \emph{terminals} and
-\emph{non-terminals}.  The terminals correspond to the tokens from the
-lexical analysis. Non-terminals are used to categorize different parts
-of a language, such as the distinction between statements and
-expressions in Python and C. The term \emph{symbol} refers to both
-terminals and non-terminals.  A grammar rule has two parts, the
-left-hand side is a non-terminal and the right-hand side is a sequence
-of zero or more symbols. The notation \lstinline{::=} is used to
-separate the left-hand side from the right-hand side. The following is
-a rule that could be used to specify the syntax for an addition
-operator.
-%
-\begin{lstlisting}
-$(1)$ expression ::= expression PLUS expression
-\end{lstlisting}
-%
-This rule says that if a string can be divided into three parts, where
-the first part can be categorized as an expression, the second part is
-the \texttt{PLUS} non-terminal (token), and the third part can be
-categorized as an expression, then the entire string can be
-categorized as an expression.  The next example rule has the
-non-terminal \texttt{INT} on the right-hand side and says that a
-string that is categorized as an integer (by the lexer) can also be
-categorized as an expression.  As is apparent here, a string can be
-categorized by more than one non-terminal.
-\begin{lstlisting}
-$(2)$ expression ::= INT
-\end{lstlisting}
-
-To \emph{parse} a string is to determine how the string can be
-categorized according to a given grammar.  Suppose we have the string
-``\lstinline{1 + 3}''.  Both the \texttt{1} and the \texttt{3} can be
-categorized as expressions using rule $2$.  We can then use rule 1 to
-categorize the entire string as an expression.  A \emph{parse tree} is
-a good way to visualize the parsing process. (You will be tempted to
-confuse parse trees and abstract syntax tress, but the excellent
-students will carefully study the difference to avoid this confusion.)
-A parse tree for ``\lstinline{1 + 3}'' is shown in
-Figure~\ref{fig:parse-tree}. The best way to start drawing a parse
-tree is to first list the tokenized string at the bottom of the page.
-These tokens correspond to terminals and will form the leaves of the
-parse tree. You can then start to categorize non-terminals, or
-sequences of non-terminals, using the parsing rules.  For example, we
-can categorize the integer ``\texttt{1}'' as an expression using rule
-$(2)$, so we create a new node above ``\texttt{1}'', label the node
-with the left-hand side terminal, in this case \texttt{expression},
-and draw a line down from the new node down to ``\texttt{1}''. As an
-optional step, we can record which rule we used in parenthesis after
-the name of the terminal.  We then repeat this process until all of
-the leaves have been connected into a single tree, or until no more
-rules apply.
-
-\begin{figure}[htbp]
-  \centering
-\includegraphics[width=2.5in]{simple-parse-tree}  
-  \caption{The parse tree for ``\texttt{1 + 3}''.}
-  \label{fig:parse-tree}
-\end{figure}
-
-
-There can be more than one parse tree for the same string if the
-grammar is ambiguous. For example, the string ``\texttt{1 + 2 + 3}''
-can be parsed two different ways using rules 1 and 2, as shown in
-Figure~\ref{fig:ambig}. In Section~\ref{sec:precedence} we'll discuss
-ways to avoid ambiguity through the use of precedence levels and
-associativity.
-
-\begin{figure}[htbp]
-  \centering
-\includegraphics[width=5in]{ambig-parse-tree}  
-  \caption{Two parse trees for ``\texttt{1 + 2 + 3}''.}
-  \label{fig:ambig}
-\end{figure}
-
-The process describe above for creating a parse-tree was
-``bottom-up''. We started at the leaves of the tree and then worked
-back up to the root. An alternative way to build parse-trees is the
-``top-down'' \emph{derivation} approach. This approach is not a
-practical way to parse a particular string but it is helpful for
-thinking about all possible strings that are in the language described
-by the grammar.  To perform a derivation, start by drawing a single
-node labeled with the starting non-terminal for the grammar. This is
-often the \texttt{program} non-terminal, but in our case we simply
-have \texttt{expression}. We then select at random any grammar rule
-that has \texttt{expression} on the left-hand side and add new edges
-and nodes to the tree according to the right-hand side of the rule.
-The derivation process then repeats by selecting another non-terminal
-that does not yet have children. Figure~\ref{fig:derivation} shows the
-process of building a parse tree by derivation. A \emph{left-most
-  derivation} is one in which the left-most non-terminal is always
-chosen as the next non-terminal to expand. A \texttt{right-most
-  derivation} is one in which the right-most non-terminal is always
-chosen as the next non-terminal to expand. The derivation in
-Figure~\ref{fig:derivation} is a right-most derivation.
-
-\begin{figure}[htbp]
-  \centering
-\includegraphics[width=5in]{derivation}  
-  \caption{Building a parse-tree by derivation.}
-  \label{fig:derivation}
-\end{figure}
+\item A regular expression repeated zero or more times (Kleene
+  closure), e.g. \code{"(a|bc)*"}.  The string \code{'bcabcbc'}
+  would match this pattern, but not \code{'bccba'}.
+\item The empty sequence.
+\end{itemize}
+Parentheses can be used to control the grouping within a regular
+expression.
 
+For our convenience, Lark also accepts an extended set of regular
+expressions that are automatically translates into the core regular
+expressions.
 
-For each subset of Python in this course, we will specify which
-language features are in a given subset of Python using context-free
-grammars.  The notation we'll use for grammars is
-\href{http://en.wikipedia.org/wiki/Extended_Backus\%E2\%80\%93Naur_form}{Extended
-  Backus-Naur Form (EBNF)}.  The grammar for $P_0$ is shown in
-Figure~\ref{fig:concrete-P0}.  This notation does not correspond
-exactly to the notation for grammars used by PLY, but it should not be
-too difficult for the reader to figure out the PLY grammar given the
-EBNF grammar.
+\begin{itemize}
+\item Match one of a set of characters, for example, \code{[abc]}
+  is equivalent to \code{a|b|c}.
+\item Match one of a range of characters, for example, \code{[a-z]}
+  matches any lowercase letter in the alphabet.
+\item Repetition one or more times, for example, \code{[a-z]+}
+  will match any sequence of one or more lowercase letters,
+  such as \code{'b'} and \code{'bzca'}.
+\item Zero or one matches, for example, \code{a? b}  matches
+  both \code{'ab'} and \code{'b'}.
+\item A string, such as \code{"hello"}, which matches itself,
+    that is, \code{'hello'}.
+\end{itemize}
 
+In a Lark grammar file, specify a name for each type of token followed
+by a colon and then a regular expression surrounded by \code{/}
+characters. For example, the \code{DIGIT}, \code{INT}, \code{NEWLINE},
+and \code{PRINT} types of tokens are specified in the following way.
 
-\begin{figure}[htbp]
-  \centering
-  \begin{tabular}{|cl}
-&
 \begin{lstlisting}
-program ::= module
-module ::= simple_statement+
-simple_statement ::= "print" expression
-                   | name "=" expression
-                   | expression
-expression ::= name
-             | decimalinteger
-             | "-" expression 
-             | expression "+" expression
-             | "(" expression ")"
-             | "input" "(" ")"
+DIGIT: /[0-9]/
+INT: DIGIT+
+NEWLINE: (/\r/? /\n/)+
+PRINT: "print"
 \end{lstlisting}
-  \end{tabular}
-  \caption{Context-free grammar for the $P_0$ subset of Python.}
-  \label{fig:concrete-P0}
-\end{figure}
-
-
-\section{Generating parser with PLY}
-\label{sec:ply-parsing}
-
-Figure~\ref{fig:parser1} shows an example use of PLY to generate a
-parser. The code specifies a grammar and it specifies actions for each
-rule. For each grammar rule there is a function whose name must begin
-with \lstinline{p_}.  The document string of the function contains the
-specification of the grammar rule.  PLY uses just a colon
-\lstinline{:} instead of the usual \lstinline{::=} to separate the
-left and right-hand sides of a grammar production. The left-hand side
-symbol for the first function (as it appears in the Python file) is
-considered the start symbol.  The body of these functions contains
-code that carries out the action for the production.
-
-Typically, what you want to do in the actions is build an abstract
-syntax tree, as we do here. The parameter \lstinline{t} of the
-function contains the results from the actions that were carried out
-to parse the right-hand side of the production. You can index into
-\lstinline{t} to access these results, starting with \lstinline{t[1]}
-for the first symbol of the right-hand side. To specify the result of
-the current action, assign the result into \lstinline{t[0]}. So, for
-example, in the production \lstinline{expression : INT}, we build a
-\lstinline{Const} node containing an integer that we obtain from
-\lstinline{t[1]}, and we assign the \lstinline{Const} node to
-\lstinline{t[0]}.
 
+\noindent (In Lark, the regular expression operators can be used both
+inside a regular expression, that is, between the \code{/} characters,
+and they can be used to combine regular expressions, outside the
+\code{/} characters.)
 
-\begin{figure}[htbp]
-  \centering
-  \centering
-  \begin{tabular}{|cl}
-&
-\begin{lstlisting}
-from compiler.ast import Printnl, Add, Const
-
-def p_print_statement(t):
-  'statement : PRINT expression'
-  t[0] = Printnl([t[2]], None)
-  
-def p_plus_expression(t):
-  'expression : expression PLUS expression'
-  t[0] = Add((t[1], t[3]))
-
-def p_int_expression(t):
-  'expression : INT'
-  t[0] = Const(t[1])
+\section{Grammars and Parse Trees}
+\label{sec:CFG}
 
-def p_error(t):
-  print "Syntax error at '%s'" % t.value
+In section~\ref{sec:grammar} we learned how to use grammar rules to
+specify the abstract syntax of a language. We now use grammar rules to
+specify the concrete syntax. Recall that each rule has a left-hand
+side and a right-hand side. However, this time each right-hand side
+expresses a pattern to match against a string, instead of matching
+against an abstract syntax tree. In particular, each right-hand side
+is a sequence of \emph{symbols}\index{subject}{symbol}, where a symbol
+is either a terminal or nonterminal. A
+\emph{terminal}\index{subject}{terminal} is either a string or the
+name of a type of token. The nonterminals play the same role as
+before, defining categories of syntax.
 
-import ply.yacc as yacc
-yacc.yacc()
-\end{lstlisting}
-\end{tabular}
-  \caption{First attempt at writing a parser using PLY.}
-  \label{fig:parser1}
-\end{figure}
-
-The PLY parser generator takes your grammar and generates a parser
-that uses the LALR(1) shift-reduce algorithm, which is the most common
-parsing algorithm in use today. LALR(1) stands for Look Ahead
-Left-to-right with Rightmost-derivation and 1 token of lookahead.
-Unfortunately, the LALR(1) algorithm cannot handle all context-free
-grammars, so sometimes you will get error messages from PLY. To understand
-these errors and know how to avoid them, you have to know a little bit
-about the parsing algorithm.
-
-\section{The LALR(1) algorithm}
-\label{sec:lalr}
-
-To understand the error messages of PLY, one needs to understand the
-underlying parsing algorithm.
-%
-The LALR(1) algorithm uses a stack and a finite automata.  Each
-element of the stack is a pair: a state number and a symbol. The
-symbol characterizes the input that has been parsed so-far and the
-state number is used to remember how to proceed once the next
-symbol-worth of input has been parsed.  Each state in the finite
-automata represents where the parser stands in the parsing process
-with respect to certain grammar rules.  Figure~\ref{fig:shift-reduce}
-shows an example LALR(1) parse table generated by PLY for the grammar
-specified in Figure~\ref{fig:parser1}. When PLY generates a parse
-table, it also outputs a textual representation of the parse table to
-the file \texttt{parser.out} which is useful for debugging purposes.
-
-Consider state 1 in Figure~\ref{fig:shift-reduce}. The parser has just
-read in a \lstinline{PRINT} token, so the top of the stack is
-\lstinline{(1,PRINT)}. The parser is part of the way through parsing
-the input according to grammar rule 1, which is signified by showing
-rule 1 with a dot after the PRINT token and before the expression
-non-terminal.  A rule with a dot in it is called an \emph{item}. There
-are several rules that could apply next, both rule 2 and 3, so state 1
-also shows those rules with a dot at the beginning of their right-hand
-sides. The edges between states indicate which transitions the
-automata should make depending on the next input token. So, for
-example, if the next input token is INT then the parser will push INT
-and the target state 4 on the stack and transition to state 4.
-Suppose we are now at the end of the input. In state 4 it says we
-should reduce by rule 3, so we pop from the stack the same number of
-items as the number of symbols in the right-hand side of the rule, in
-this case just one.  We then momentarily jump to the state at the top
-of the stack (state 1) and then follow the goto edge that corresponds
-to the left-hand side of the rule we just reduced by, in this case
-\lstinline{expression}, so we arrive at state 3.  (A slightly longer
-example parse is shown in Figure~\ref{fig:shift-reduce}.)
-
-
-\begin{figure}[htbp]
-  \centering
-\includegraphics[width=5.0in]{shift-reduce-conflict}  
-  \caption{An LALR(1) parse table and a trace of an example run.}
-  \label{fig:shift-reduce}
-\end{figure}
-
-In general, the shift-reduce algorithm works as follows. Look at the
-next input token.
-\begin{itemize}
-\item If there there is a shift edge for the input token, push the
-  edge's target state and the input token on the stack and proceed to
-  the edge's target state.
-\item If there is a reduce action for the input token, pop $k$
-  elements from the stack, where $k$ is the number of symbols in the
-  right-hand side of the rule being reduced. Jump to the state at the
-  top of the stack and then follow the goto edge for the non-terminal
-  that matches the left-hand side of the rule we're reducing by. Push
-  the edge's target state and the non-terminal on the stack.
-\end{itemize}
-
-Notice that in state 6 of Figure~\ref{fig:shift-reduce} there is both
-a shift and a reduce action for the token \lstinline{PLUS}, so the
-algorithm does not know which action to take in this case. When a
-state has both a shift and a reduce action for the same token, we say
-there is a \emph{shift/reduce conflict}.  In this case, the conflict
-will arise, for example, when trying to parse the input
-\lstinline{print 1 + 2 + 3}.  After having consumed
-\lstinline{print 1 + 2} the parser will be in state 6, and it will not 
-know whether to reduce to form an expression of \lstinline{1 + 2}, 
-or whether it should proceed by shifting the next \lstinline{+} from 
-the input.
-
-A similar kind of problem, known as a \emph{reduce/reduce} conflict,
-arises when there are two reduce actions in a state for the same
-token. To understand which grammars gives rise to shift/reduce and
-reduce/reduce conflicts, it helps to know how the parse table is
-generated from the grammar, which we discuss next.
-
-\subsection{Parse table generation}
-\label{sec:table}
-
-The parse table is generated one state at a time. State 0 represents
-the start of the parser. We add the production for the start symbol to
-this state with a dot at the beginning of the right-hand side.  If the
-dot appears immediately before another non-terminal, we add all the
-productions with that non-terminal on the left-hand side. Again, we
-place a dot at the beginning of the right-hand side of each the new
-productions. This process called \emph{state closure} is continued
-until there are no more productions to add. We then examine each item
-in the current state $I$. Suppose an item has the form $A ::=
-\alpha.X\beta$, where $A$ and $X$ are symbols and $\alpha$ and $\beta$
-are sequences of symbols. We create a new state, call it $J$.  If $X$
-is a terminal, we create a shift edge from $I$ to $J$, whereas if $X$
-is a non-terminal, we create a goto edge from $I$ to $J$.  We then
-need to add some items to state $J$. We start by adding all items from
-state $I$ that have the form $B ::= \gamma.X\kappa$ (where $B$ is any
-symbol and $\gamma$ and $\kappa$ are arbitrary sequences of symbols),
-but with the dot moved past the $X$. We then perform state closure on
-$J$.  This process repeats until there are no more states or edges to
-add.
-
-We then mark states as accepting states if they have an item that is
-the start production with a dot at the end.  Also, to add in the
-reduce actions, we look for any state containing an item with a dot at
-the end. Let $n$ be the rule number for this item. We then put a
-reduce $n$ action into that state for every token $Y$. For example, in
-Figure~\ref{fig:shift-reduce} state 4 has an item with a dot at the
-end. We therefore put a reduce by rule 3 action into state 4 for every
-token. (Figure~\ref{fig:shift-reduce} does not show a reduce rule for
-INT in state 4 because this grammar does not allow
-two consecutive INT tokens in the input. We will not go into how this
-can be figured out, but in any event it does no harm to have a reduce
-rule for INT in state 4; it just means the input will be rejected at a
-later point in the parsing process.)
-
-\begin{exercise}
-On a piece of paper, walk through the parse table generation 
-process for the grammar in Figure~\ref{fig:parser1} and check
-your results against Figure~\ref{fig:shift-reduce}. 
-\end{exercise}
-
-
-\subsection{Resolving conflicts with precedence declarations}
-\label{sec:precedence}
+As an example, let us recall the \LangInt{} language, which included
+the following rules for its abstract syntax.
+\begin{align*}
+  \Exp &::= \INT{\Int}\\
+  \Exp &::= \ADD{\Exp}{\Exp}
+\end{align*}
+The corresponding rules for its concrete syntax are as follows. 
+\begin{align}
+  \Exp &::= \code{INT} \label{eq:parse-int}\\
+  \Exp &::= \Exp\; \code{"+"} \; \Exp \label{eq:parse-plus}
+\end{align}
+The rule \eqref{eq:parse-int} says that any string that matches the
+regular expression for \code{INT} can also be categorized, that is, parsed
+as an expression. The rule \eqref{eq:parse-plus} says that any string that
+parses as an expression, followed by the \code{+} character, followed
+by another expression, can itself be parsed as an expression.
+For example, the string \code{'1+3'} is an \Exp{} because
+\code{'1'} and \code{'3'} are both \Exp{} by rule \eqref{eq:parse-int},
+and then rule \eqref{eq:parse-plus} applies to categorize
+\code{'1+3'} as an \Exp{}. We can visualize the application of grammar
+rules to categorize a string using a
+\emph{parse tree}\index{subject}{parse tree}. Each internal node in the tree
+is an application of a grammar rule and the leaf nodes are substrings of the
+input program.
 
-To solve the shift/reduce conflict in state 6, we can add the
-following precedence rules, which says addition associates to the left
-and takes precedence over printing. This will cause state 6 to choose
-reduce over shift.
 
-\begin{lstlisting}
-precedence = (
-    ('nonassoc','PRINT'),
-    ('left','PLUS')
-    )
-\end{lstlisting}
 
-In general, the precedence variable should be assigned a tuple of
-tuples. The first element of each inner tuple should be an
-associativity (nonassoc, left, or right) and the rest of the elements
-should be tokens.  The tokens that appear in the same inner tuple have
-the same precedence, whereas tokens that appear in later tuples have a
-higher precedence.  Thus, for the typical precedence for arithmetic
-operations, we would specify the following:
 
-\begin{lstlisting}
-precedence = (
-    ('left','PLUS','MINUS'),
-    ('left','TIMES','DIVIDE')
-    )
-\end{lstlisting}
 
-Figure~\ref{fig:parser-resolved} shows the Python code for generating
-a lexer and parser using PLY.
 
-\begin{figure}[htbp]
-  \centering
-\begin{lstlisting}[basicstyle=\footnotesize\ttfamily]
-# Lexer
-tokens = ('PRINT','INT','PLUS')
-t_PRINT = r'print'
-t_PLUS = r'\+'
-def t_INT(t):
-    r'\d+'
-    try:
-        t.value = int(t.value)
-    except ValueError:
-        print "integer value too large", t.value
-        t.value = 0
-    return t
-t_ignore  = ' \t'  
-def t_newline(t):
-    r'\n+'
-    t.lexer.lineno += t.value.count("\n")
-def t_error(t):
-    print "Illegal character '%s'" % t.value[0]
-    t.lexer.skip(1)
-import ply.lex as lex
-lex.lex()
-# Parser
-from compiler.ast import Printnl, Add, Const
-precedence = (
-    ('nonassoc','PRINT'),
-    ('left','PLUS')
-    )
-def p_print_statement(t):
-    'statement : PRINT expression'
-    t[0] = Printnl([t[2]], None)
-def p_plus_expression(t):
-    'expression : expression PLUS expression'
-    t[0] = Add((t[1], t[3]))
-def p_int_expression(t):
-    'expression : INT'
-    t[0] = Const(t[1])
-def p_error(t):
-    print "Syntax error at '%s'" % t.value
-import ply.yacc as yacc
-yacc.yacc()
-\end{lstlisting}
-  \caption{Example parser with precedence declarations to resolve conflicts.}
-  \label{fig:parser-resolved}
-\end{figure}
 
-\begin{exercise}
-  Write a PLY grammar specification for $P_0$ and update your compiler
-  so that it uses the generated lexer and parser instead of using the
-  parser in the \lstinline{compiler} module. In addition to handling
-  the grammar in Figure~\ref{fig:concrete-P0}, you also need to handle
-  Python-style comments, everything following a \texttt{\#} symbol up
-  to the newline should be ignored.  Perform regression testing on
-  your compiler to make sure that it still passes all of the tests
-  that you created for $P_0$.
-\end{exercise}
 
 
-\fi
 
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \chapter{Register Allocation}

이 변경점에서 너무 많은 파일들이 변경되어 몇몇 파일들은 표시되지 않았습니다.