Bläddra i källkod

sections for Early and LALR

Jeremy Siek 3 år sedan
förälder
incheckning
f48751a56c
2 ändrade filer med 180 tillägg och 44 borttagningar
  1. 6 0
      book.bib
  2. 174 44
      book.tex

+ 6 - 0
book.bib

@@ -8,6 +8,12 @@
 	year = {1975},
 	Bdsk-File-1 = {YnBsaXN0MDDRAQJccmVsYXRpdmVQYXRoV2xleC5wZGYICxgAAAAAAAABAQAAAAAAAAADAAAAAAAAAAAAAAAAAAAAIA==}}
 
+@Misc{shinan20:_lark_docs,
+  author = 	 {Erez Shinan},
+  title = 	 {Lark Documentation},
+  url = {https://lark-parser.readthedocs.io/en/latest/index.html},
+  year = 	 2020}
+
 @incollection{Johnson:1979qy,
 	author = {Stephen C. Johnson},
 	booktitle = {{UNIX} Programmer's Manual},

+ 174 - 44
book.tex

@@ -246,6 +246,11 @@ concepts and algorithms used in compilers.
   Chapters~\ref{ch:trees-recur} and \ref{ch:Lvar}, where we introduce
   the fundamental tools of compiler construction: \emph{abstract
     syntax trees} and \emph{recursive functions}. 
+{\if\edition\pythonEd
+\item In Chapter~\ref{ch:parsing-Lvar} we learn how to use a parser
+  generator to create a parser for the language of integer arithmetic
+  and local variables.
+\fi}  
 \item In Chapter~\ref{ch:register-allocation-Lvar} we apply
   \emph{graph coloring} to assign variables to machine registers.
 \item Chapter~\ref{ch:Lif} adds conditional expressions, which
@@ -930,15 +935,15 @@ combine several right-hand sides into a single rule.
 
 The concrete syntax for \LangInt{} is shown in
 figure~\ref{fig:r0-concrete-syntax} and the abstract syntax for
-\LangInt{} is shown in figure~\ref{fig:r0-syntax}.
-
+\LangInt{} is shown in figure~\ref{fig:r0-syntax}.%
+%
 \racket{The \code{read-program} function provided in
   \code{utilities.rkt} of the support code reads a program from a file
   (the sequence of characters in the concrete syntax of Racket) and
   parses it into an abstract syntax tree. Refer to the description of
   \code{read-program} in appendix~\ref{appendix:utilities} for more
   details.}
-
+%
 \python{The \code{parse} function in Python's \code{ast} module
   converts the concrete syntax (represented as a string) into an
   abstract syntax tree.}
@@ -4080,31 +4085,33 @@ all, fast code is useless if it produces incorrect results!
 
 
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\chapter{Parsing}
+{\if\edition\pythonEd
+\chapter{Parser Generation}
 \label{ch:parsing-Lvar}
 \setcounter{footnote}{0}
 \index{subject}{parsing}
 
-In this chapter we learn how to use the Lark parser generator to
-translate the concrete syntax of \LangInt{} and \LangVar{} (a sequence
-of characters) into an abstract syntax tree.  A parser generator takes
-in a specification of the concrete syntax and produces a parser. Even
-though a parser generator does most of the work for us, using one
-properly requires considerable knowledge about parsing algorithms.  In
+In this chapter we learn how to use the Lark parser
+generator~\citep{shinan20:_lark_docs} to translate the concrete syntax
+of \LangInt{} (a sequence of characters) into an abstract syntax tree.
+You will then be asked to use Lark to create a parser for \LangVar{}.
+A parser generator takes in a specification of the concrete syntax and
+produces a parser. Even though a parser generator does most of the
+work for us, using one properly requires some knowledge.  In
 particular, we must learn about the specification languages used by
 parser generators and we must learn how to deal with ambiguity in our
 language specifications.
 
 The process of parsing is traditionally subdivided into two phases:
-\emph{lexical analysis} (also called scanning) and
-\emph{parsing}. The lexical analysis phase translates the sequence of
-characters into a sequence of \emph{tokens}, that is, words consisting
-of several characters. The parsing phase organizes the tokens into a
-\emph{parse tree} that captures how the tokens were matched by rules
-in the grammar of the language. The reason for the subdivision into
-two phases is to enable the use of a faster but less powerful
-algorithm for lexical analysis and the use of a slower but more
-powerful algorithm for parsing.
+\emph{lexical analysis} (also called scanning) and \emph{syntax
+  analysis} (also called parsing). The lexical analysis phase
+translates the sequence of characters into a sequence of
+\emph{tokens}, that is, words consisting of several characters. The
+parsing phase organizes the tokens into a \emph{parse tree} that
+captures how the tokens were matched by rules in the grammar of the
+language. The reason for the subdivision into two phases is to enable
+the use of a faster but less powerful algorithm for lexical analysis
+and the use of a slower but more powerful algorithm for parsing.
 %
 Likewise, parser generators typical come in pairs, with separate
 generators for the lexical analyzer (or lexer for short) and for the
@@ -4119,7 +4126,7 @@ lexical analyzer and a parser. The next section discusses lexical
 analysis and the remainder of the chapter discusses parsing.
 
 
-\section{Lexical analysis}
+\section{Lexical Analysis and Regular Expressions}
 \label{sec:lex}
 
 The lexical analyzers produced by Lark turn a sequence of characters
@@ -4220,10 +4227,10 @@ NEWLINE: (/\r/? /\n/)+
 \end{minipage}
 \end{center}
 
-\noindent (In Lark, the regular expression operators can be used both
+\noindent In Lark, the regular expression operators can be used both
 inside a regular expression, that is, between the \code{/} characters,
 and they can be used to combine regular expressions, outside the
-\code{/} characters.)
+\code{/} characters.
 
 \section{Grammars and Parse Trees}
 \label{sec:CFG}
@@ -4231,15 +4238,16 @@ and they can be used to combine regular expressions, outside the
 In section~\ref{sec:grammar} we learned how to use grammar rules to
 specify the abstract syntax of a language. We now take a closer look
 at using grammar rules to specify the concrete syntax. Recall that
-each rule has a left-hand side and a right-hand side. However, this
-time each right-hand side expresses a pattern to match against a
-string, instead of matching against an abstract syntax tree. In
-particular, each right-hand side is a sequence of
+each rule has a left-hand side and a right-hand side. However, for
+concrete syntax, each right-hand side expresses a pattern to match
+against a string, instead of matching against an abstract syntax
+tree. In particular, each right-hand side is a sequence of
 \emph{symbols}\index{subject}{symbol}, where a symbol is either a
 terminal or nonterminal. A \emph{terminal}\index{subject}{terminal} is
-either a string or the name of a type of token such as \Int{}. The
-nonterminals play the same role as in the abstract syntax, defining
-categories of syntax.
+a string. The nonterminals play the same role as in the abstract
+syntax, defining categories of syntax. The nonterminals of a grammar
+include the tokens defined in the lexer and all the nonterminals
+defined in the grammar rules.
 
 As an example, let us take a closer look at the concrete syntax of the
 \LangInt{} language, repeated here.
@@ -4254,8 +4262,8 @@ As an example, let us take a closer look at the concrete syntax of the
 The Lark syntax for grammar rules differs slightly from the variant of
 BNF that we use in this book. In particular, the notation $::=$ is
 replaced by a single colon and the use of typewriter font for string
-literals is replaced by quotation marks. The following serves as a
-first draft of a Lark grammar for \LangInt{}.
+literals is replaced by quotation marks. The following grammar serves
+as a first draft of a Lark grammar for \LangInt{}.
 \begin{center}
 \begin{minipage}{0.95\textwidth}
 \begin{lstlisting}[escapechar=$]
@@ -4278,9 +4286,9 @@ Let us begin by discussing the rule \code{exp: INT}.  In
 Section~\ref{sec:grammar} we defined the corresponding \Int{}
 nonterminal with an English sentence. Here we specify \code{INT} more
 formally using a type of token \code{INT} and its regular expression
-\code{"-"? DIGIT+}. Thus, the rule \code{exp: INT} says that
-if the lexer matches a string to \code{INT}, then the parser
-also categorizes the string as an \code{exp}.
+\code{"-"? DIGIT+}. Thus, the rule \code{exp: INT} says that if the
+lexer matches a string to \code{INT}, then the parser also categorizes
+the string as an \code{exp}.
 
 The rule \code{exp: exp "+" exp} says that any string that matches
 \code{exp}, followed by the \code{+} character, followed by another
@@ -4290,15 +4298,15 @@ the string \code{'1+3'} is an \code{exp} because \code{'1'} and
 rule for addition applies to categorize \code{'1+3'} as an \Exp{}. We
 can visualize the application of grammar rules to categorize a string
 using a \emph{parse tree}\index{subject}{parse tree}. Each internal
-node in the tree is an application of a grammar rule and labeled with
-the nonterminal of its left-hand side. Each leaf node is a terminal
-symbol that is a substring of the input program.  The parse tree for
-\code{'1+3'} is shown in figure~\ref{fig:simple-parse-tree}.
+node in the tree is an application of a grammar rule and is labeled
+with the nonterminal of its left-hand side. Each leaf node is a
+substring of the input program.  The parse tree for \code{'1+3'} is
+shown in figure~\ref{fig:simple-parse-tree}.
 
 \begin{figure}[tbp]
 \begin{tcolorbox}[colback=white]
 \centering
-\includegraphics[width=2.0in]{figs/simple-parse-tree}
+\includegraphics[width=1.9in]{figs/simple-parse-tree}
 \end{tcolorbox}
 \caption{The parse tree for \code{'1+3'}.}
 \label{fig:simple-parse-tree}
@@ -4354,7 +4362,7 @@ Tree('module',
    Token('NEWLINE', '\n')])
 \end{lstlisting}
 
-\subsection{Ambiguous Grammars}
+\section{Ambiguous Grammars}
 
 A grammar is \emph{ambiguous}\index{subject}{ambiguous} when a string
 can be parsed in more than one way. For example, consider the string
@@ -4373,14 +4381,136 @@ the correct answer is \code{2}.
 \label{fig:ambig-parse-tree}
 \end{figure}
 
-To deal with this problem we can change the grammar to become
-unambiguous by categorizing the syntax in a more fine grained
-fashion. In this case we want to disallow the application of the rule
-\code{exp: exp "-" exp} when the child on the right is an addition.
+To deal with this problem we can change the grammar by categorizing
+the syntax in a more fine grained fashion. In this case we want to
+disallow the application of the rule \code{exp: exp "-" exp} when the
+child on the right is an addition.  To do this we can replace the
+\code{exp} after \code{"-"} with a nonterminal that categorizes all
+the expressions except for addition, as in the following.
+\begin{center}
+\begin{minipage}{0.95\textwidth}
+\begin{lstlisting}[escapechar=$]
+exp: exp "-" exp_no_add     -> sub
+   | exp "+" exp            -> add
+   | exp_no_add
+
+exp_no_add: INT             -> int
+   | "input_int" "(" ")"    -> input_int
+   | "-" exp                -> usub
+   | exp "-" exp_no_add     -> sub
+   | "(" exp ")"            -> paren
+\end{lstlisting}
+\end{minipage}
+\end{center}
+
+However, there remains some ambiguity in the grammar. For example, the
+string \code{'1-2-3'} can still be parsed in two different ways, as
+\code{'(1-2)-3'} (correct) or \code{'1-(2-3)'} (incorrect).  That is
+to say, subtraction is left associative. Likewise, addition in Python
+is left associative. We also need to consider the interaction of unary
+subtraction with both addition and subtraction. How should we parse
+\code{'-1+2'}? Unary subtraction has higher
+\emph{precendence}\index{subject}{precedence} than addition and
+subtraction, so \code{'-1+2'} should parse as \code{'(-1)+2'} and not
+\code{'-(1+2)'}. The grammar in figure~\ref{fig:Lint-lark-grammar}
+handles the associativity of addition and subtraction by using the
+nonterminal \code{exp\_hi} for all the other expressions, and uses
+\code{exp\_hi} for the second child in the rules for addition and
+subtraction. Furthermore, unary subtraction uses \code{exp\_hi} for
+its child.
+
+For languages with more operators with more precedence levels, one
+would need to refine the \code{exp} nonterminal into several
+nonterminals, 
+
+\begin{figure}[tbp]
+\begin{tcolorbox}[colback=white]
+\centering
+\begin{lstlisting}[escapechar=$]
+exp: exp "+" exp_hi         -> add
+   | exp "-" exp_hi         -> sub
+   | exp_hi
+
+exp_hi: INT                 -> int
+      | "input_int" "(" ")" -> input_int
+      | "-" exp_hi          -> usub
+      | "(" exp ")"         -> paren
+
+stmt: "print" "(" exp ")"  -> print
+    | exp                  -> expr
+
+lang_int: (stmt NEWLINE)*  -> module
+\end{lstlisting}
+\end{tcolorbox}
+\caption{An unambiguous Lark grammar for \LangInt{}.}
+\label{fig:Lint-lark-grammar}
+\end{figure}
+
+\section{From Parse Trees to Abstract Syntax Trees}
 
+As we have seen, the output of a Lark parser is a parse tree, that is,
+a tree consisting of \code{Tree} and \code{Token} nodes. So the next
+step is to convert the parse tree to an abstract syntax tree. This can
+be accomplished with a recursive function that inspects the
+\code{data} field of each node and then constructs the corresponding
+AST node, using recursion to handle its children. The following is an
+excerpt of the \code{parse\_tree\_to\_ast} function for \LangInt{}.
 
+\begin{center}
+\begin{minipage}{0.95\textwidth}
+\begin{lstlisting}
+def parse_tree_to_ast(e):
+    if e.data == 'int':
+        return Constant(int(e.children[0].value))
+    elif e.data == 'input_int':
+        return Call(Name('input_int'), [])
+    elif e.data == 'add':
+        e1, e2 = e.children
+        return BinOp(parse_tree_to_ast(e1), Add(), parse_tree_to_ast(e2))
+    ...
+    else:
+        raise Exception('unhandled parse tree', e)
+\end{lstlisting}
+\end{minipage}
+\end{center}
 
+\begin{exercise}
+  \normalfont\normalsize
+
+  Use Lark to create a lexer and parser for \LangVar{}.  We recommend
+  using Lark's default parsing algorithm (Earley) with the
+  \code{ambiguity} option set to \code{'explicit'} so that if your
+  grammar is ambiguous, the output will include multiple parse trees
+  which will indicate to you that there is a problem with your
+  grammar. Your parser should ignore white space so we
+  recommend using Lark's \code{\%ignore} directive as follows.
+\begin{lstlisting}
+WS: /[ \t\f\r\n]/+
+%ignore WS
+\end{lstlisting}
+Change your compiler from chapter~\ref{ch:Lvar} to use your
+Lark-generated parser instead of using the \code{parse} function from
+the \code{ast} module. Test your compiler on all of the \LangVar{}
+programs that you have created and create four additional programs
+that would reveal ambiguities in your grammar.
 
+\end{exercise}
+
+
+\section{The Earley Algorithm}
+
+
+\section{The LALR Algorithm}
+
+
+\section{Further Reading}
+
+UNDER CONSTRUCTION
+
+
+
+
+\fi}
 
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \chapter{Register Allocation}