Bläddra i källkod

edits to parsing chapter

Jeremy Siek 2 år sedan
förälder
incheckning
d6c8fe5f73
1 ändrade filer med 104 tillägg och 105 borttagningar
  1. 104 105
      book.tex

+ 104 - 105
book.tex

@@ -4167,60 +4167,59 @@ Each token includes a field for its \code{type}, such as \code{'INT'},
 and a field for its \code{value}, such as \code{'1'}.
 
 Following in the tradition of \code{lex}~\citep{Lesk:1975uq}, the
-specification language for Lark's lexical analysis generator is one
-regular expression for each type of token. The term \emph{regular}
-comes from the term \emph{regular languages}, which are the languages
-that can be recognized by a finite automata. A \emph{regular
-  expression} is a pattern formed of the following core
-elements:\index{subject}{regular expression}\footnote{Regular
-  expressions traditionally include the empty regular expression that
-  matches any zero-length part of a string, but Lark does not support
-  the empty regular expression.}
+specification language for Lark's lexer is one regular expression for
+each type of token. The term \emph{regular} comes from the term
+\emph{regular languages}, which are the languages that can be
+recognized by a finite state machine. A \emph{regular expression} is a
+pattern formed of the following core elements:\index{subject}{regular
+  expression}\footnote{Regular expressions traditionally include the
+  empty regular expression that matches any zero-length part of a
+  string, but Lark does not support the empty regular expression.}
 \begin{itemize}
 \item A single character $c$ is a regular expression and it only
   matches itself. For example, the regular expression \code{a} only
   matches with the string \code{'a'}.
   
-\item Two regular expressions separated by a vertical bar $R_1 \mid
+\item Two regular expressions separated by a vertical bar $R_1 \ttm{|}
   R_2$ form a regular expression that matches any string that matches
   $R_1$ or $R_2$. For example, the regular expression \code{a|c}
   matches the string \code{'a'} and the string \code{'c'}.
 
 \item Two regular expressions in sequence $R_1 R_2$ form a regular
   expression that matches any string that can be formed by
-  concatenating two strings, where the first matches $R_1$
-  and the second matches $R_2$. For example, the regular expression
+  concatenating two strings, where the first string matches $R_1$ and
+  the second string matches $R_2$. For example, the regular expression
   \code{(a|c)b} matches the strings \code{'ab'} and \code{'cb'}.
   (Parentheses can be used to control the grouping of operators within
   a regular expression.)
 
-\item A regular expression followed by an asterisks $R*$ (called
+\item A regular expression followed by an asterisks $R\ttm{*}$ (called
   Kleene closure) is a regular expression that matches any string that
   can be formed by concatenating zero or more strings that each match
   the regular expression $R$.  For example, the regular expression
-  \code{"((a|c)b)*"} matches the strings \code{'abcbab'} and
-  \code{''}, but not \code{'abc'}.
+  \code{"((a|c)b)*"} matches the strings \code{'abcbab'} but not
+  \code{'abc'}.
 \end{itemize}
 
-For our convenience, Lark also accepts an extended set of regular
-expressions that are automatically translated into the core regular
-expressions.
+For our convenience, Lark also accepts the following extended set of
+regular expressions that are automatically translated into the core
+regular expressions.
 
 \begin{itemize}
 \item A set of characters enclosed in square brackets $[c_1 c_2 \ldots
   c_n]$ is a regular expression that matches any one of the
   characters. So $[c_1 c_2 \ldots c_n]$  is equivalent to
   the regular expression $c_1\mid c_2\mid \ldots \mid c_n$.
-\item A range of characters enclosed in square brackets $[c_1-c_2]$ is
+\item A range of characters enclosed in square brackets $[c_1\ttm{-}c_2]$ is
   a regular expression that matches any character between $c_1$ and
   $c_2$, inclusive. For example, \code{[a-z]} matches any lowercase
   letter in the alphabet.
-\item A regular expression followed by the plus symbol $R+$
+\item A regular expression followed by the plus symbol $R\ttm{+}$
   is a regular expression that matches any string that can
   be formed by concatenating one or more strings that each match $R$.
   So $R+$ is equivalent to $R(R*)$. For example, \code{[a-z]+}
   matches \code{'b'} and \code{'bzca'}.
-\item A regular expression followed by a question mark $R?$
+\item A regular expression followed by a question mark $R\ttm{?}$
   is a regular expression that matches any string that either
   matches $R$ or that is the empty string.
   For example, \code{a?b}  matches both \code{'ab'} and \code{'b'}.
@@ -4253,9 +4252,11 @@ and they can be used to combine regular expressions, outside the
 In section~\ref{sec:grammar} we learned how to use grammar rules to
 specify the abstract syntax of a language. We now take a closer look
 at using grammar rules to specify the concrete syntax. Recall that
-each rule has a left-hand side and a right-hand side. However, for
-concrete syntax, each right-hand side expresses a pattern for a
-string, instead of a patter for an abstract syntax tree. In
+each rule has a left-hand side and a right-hand side where the
+left-hand side is a nonterminal and the right-hand side is a pattern
+that defines what can be parsed as that nonterminal.
+For concrete syntax, each right-hand side expresses a pattern for a
+string, instead of a pattern for an abstract syntax tree. In
 particular, each right-hand side is a sequence of
 \emph{symbols}\index{subject}{symbol}, where a symbol is either a
 terminal or nonterminal. A \emph{terminal}\index{subject}{terminal} is
@@ -4297,13 +4298,13 @@ lang_int: stmt_list
 \end{minipage}
 \end{center}
 
-Let us begin by discussing the rule \code{exp: INT}.  In
+Let us begin by discussing the rule \code{exp: INT} which says that if
+the lexer matches a string to \code{INT}, then the parser also
+categorizes the string as an \code{exp}.  Recall that in
 Section~\ref{sec:grammar} we defined the corresponding \Int{}
 nonterminal with an English sentence. Here we specify \code{INT} more
 formally using a type of token \code{INT} and its regular expression
-\code{"-"? DIGIT+}. Thus, the rule \code{exp: INT} says that if the
-lexer matches a string to \code{INT}, then the parser also categorizes
-the string as an \code{exp}.
+\code{"-"? DIGIT+}.
 
 The rule \code{exp: exp "+" exp} says that any string that matches
 \code{exp}, followed by the \code{+} character, followed by another
@@ -4311,8 +4312,8 @@ string that matches \code{exp}, is itself an \code{exp}.  For example,
 the string \code{'1+3'} is an \code{exp} because \code{'1'} and
 \code{'3'} are both \code{exp} by the rule \code{exp: INT}, and then
 the rule for addition applies to categorize \code{'1+3'} as an
-\Exp{}. We can visualize the application of grammar rules to parse a
-string using a \emph{parse tree}\index{subject}{parse tree}. Each
+\code{exp}. We can visualize the application of grammar rules to parse
+a string using a \emph{parse tree}\index{subject}{parse tree}. Each
 internal node in the tree is an application of a grammar rule and is
 labeled with its left-hand side nonterminal. Each leaf node is a
 substring of the input program.  The parse tree for \code{'1+3'} is
@@ -4363,12 +4364,12 @@ exp: INT                    -> int
    | "(" exp ")"            -> paren
 
 stmt: "print" "(" exp ")"   -> print
-    | exp                   -> expr
+    | exp                    -> expr
 
-stmt_list:                   -> empty_stmt
+stmt_list:                      -> empty_stmt
     | stmt NEWLINE stmt_list -> add_stmt
 
-lang_int: stmt_list          -> module
+lang_int: stmt_list             -> module
 \end{lstlisting}
 \end{minipage}
 \end{center}
@@ -4510,10 +4511,10 @@ WS: /[ \t\f\r\n]/+
 %ignore WS
 \end{lstlisting}
 Change your compiler from chapter~\ref{ch:Lvar} to use your
-Lark-generated parser instead of using the \code{parse} function from
+Lark parser instead of using the \code{parse} function from
 the \code{ast} module. Test your compiler on all of the \LangVar{}
 programs that you have created and create four additional programs
-that would reveal ambiguities in your grammar.
+that test for ambiguities in your grammar.
 \end{exercise}
 
 
@@ -4521,14 +4522,14 @@ that would reveal ambiguities in your grammar.
 \label{sec:earley}
 
 In this section we discuss the parsing algorithm of
-\citet{Earley:1970ly}, which is the default algorithm used by Lark.
-The algorithm is powerful in that it can handle any context-free
-grammar, which makes it easy to use. However, it is not the most
-efficient parsing algorithm: it is $O(n^3)$ for ambiguous grammars and
-$O(n^2)$ for unambiguous grammars, where $n$ is the number of tokens
-in the input string~\citep{Hopcroft06:_automata}.  In
-section~\ref{sec:lalr} we learn about the LALR(1) algorithm, which is
-more efficient but cannot handle all context-free grammars.
+\citet{Earley:1970ly}, the default algorithm used by Lark.  The
+algorithm is powerful in that it can handle any context-free grammar,
+which makes it easy to use. However, it is not the most efficient
+parsing algorithm: it is $O(n^3)$ for ambiguous grammars and $O(n^2)$
+for unambiguous grammars, where $n$ is the number of tokens in the
+input string~\citep{Hopcroft06:_automata}.  In section~\ref{sec:lalr}
+we learn about the LALR(1) algorithm, which is more efficient but
+cannot handle all context-free grammars.
 
 The Earley algorithm can be viewed as an interpreter; it treats the
 grammar as the program being interpreted and it treats the concrete
@@ -4564,7 +4565,7 @@ grammar in figure~\ref{fig:Lint-lark-grammar}, we place
 \begin{lstlisting}
   lang_int: . stmt_list         (0)
 \end{lstlisting}
-in slot $0$ of the chart. The algorithm then proceeds to with
+in slot $0$ of the chart. The algorithm then proceeds with
 \emph{prediction} actions in which it adds more dotted rules to the
 chart based on which nonterminals come immediately after a period. In
 the above, the nonterminal \code{stmt\_list} appears after a period,
@@ -4582,7 +4583,7 @@ stmt:  .  "print" "("  exp ")"   (0)
 stmt:  .  exp                    (0)
 \end{lstlisting}
 This reveals yet more opportunities for prediction, so we add the grammar
-rules for \code{exp} and \code{exp\_hi}.
+rules for \code{exp} and \code{exp\_hi} to slot $0$.
 \begin{lstlisting}[escapechar=$]
 exp: . exp "+" exp_hi         (0)
 exp: . exp "-" exp_hi         (0)
@@ -4596,14 +4597,14 @@ exp_hi: . "(" exp ")"         (0)
 We have exhausted the opportunities for prediction, so the algorithm
 proceeds to \emph{scanning}, in which we inspect the next input token
 and look for a dotted rule at the current position that has a matching
-terminal following the period. In our running example, the first input
-token is \code{"print"} so we identify the rule in slot $0$ of
-the chart whose dot comes before \code{"print"}:
+terminal immediately following the period. In our running example, the
+first input token is \code{"print"} so we identify the rule in slot
+$0$ of the chart where \code{"print"} follows the period:
 \begin{lstlisting}
 stmt:  .  "print" "("  exp ")"       (0)
 \end{lstlisting}
-and add the following rule to slot $1$ of the chart, with the period
-moved forward past \code{"print"}.
+We advance the period past \code{"print"} and add the resulting rule
+to slot $1$ of the chart:
 \begin{lstlisting}
 stmt:  "print" . "("  exp ")"        (0)
 \end{lstlisting}
@@ -4629,9 +4630,9 @@ exp_hi: . "input_int" "(" ")" (2)
 exp_hi: . "-" exp_hi          (2)
 exp_hi: . "(" exp ")"         (2)
 \end{lstlisting}
-With that prediction complete, we return to scanning, noting that the
+With this prediction complete, we return to scanning, noting that the
 next input token is \code{"1"} which the lexer parses as an
-\code{INT}. There is a matching rule is slot $2$:
+\code{INT}. There is a matching rule in slot $2$:
 \begin{lstlisting}
 exp_hi: . INT             (2)
 \end{lstlisting}
@@ -4644,7 +4645,7 @@ the end of a dotted rule, we recognize that the substring
 has matched the nonterminal on the left-hand side of the rule, in this case
 \code{exp\_hi}. We therefore need to advance the periods in any dotted
 rules in slot $2$ (the starting position for the finished rule) if
-period is immediately followed by \code{exp\_hi}. So we identify
+the period is immediately followed by \code{exp\_hi}. So we identify
 \begin{lstlisting}
 exp: . exp_hi                 (2)
 \end{lstlisting}
@@ -4738,17 +4739,16 @@ algorithm.
 \item The algorithm repeatedly applies the following three kinds of
   actions for as long as there are opportunities to do so.
   \begin{itemize}
-  \item Prediction: if there is a dotted rule in slot $k$ whose period
-    comes before a nonterminal, add all the rules for that nonterminal
-    into slot $k$, placing a period at the beginning of their
-    right-hand sides, and recording their starting position as
-    $k$.
+  \item Prediction: if there is a rule in slot $k$ whose period comes
+    before a nonterminal, add the rules for that nonterminal into slot
+    $k$, placing a period at the beginning of their right-hand sides
+    and recording their starting position as $k$.
   \item Scanning: If the token at position $k$ of the input string
     matches the symbol after the period in a dotted rule in slot $k$
-    of the chart, advance the prior in the dotted rule, adding
+    of the chart, advance the period in the dotted rule, adding
     the result to slot $k+1$.
   \item Completion: If a dotted rule in slot $k$ has a period at the
-    end, consider the rules in the slot corresponding to the starting
+    end, inspect the rules in the slot corresponding to the starting
     position of the completed rule. If any of those rules have a
     nonterminal following their period that matches the left-hand side
     of the completed rule, then advance their period, placing the new
@@ -4766,23 +4766,28 @@ shared packed parse forest~\citep{Tomita:1985qr}.  The simple idea is
 to attach a partial parse tree to every dotted rule in the chart.
 Initially, the tree node associated with a dotted rule has no
 children. As the period moves to the right, the nodes from the
-subparses are added as children to this tree node.
+subparses are added as children to the tree node.
 
 As mentioned at the beginning of this section, the Earley algorithm is
 $O(n^2)$ for unambiguous grammars, which means that it can parse input
 files that contain thousands of tokens in a reasonable amount of time,
-but not millions. In the next section we discuss the LALR(1) parsing
-algorithm, which has time complexity $O(n)$, making it practical to
-use with even the largest of input files.
+but not millions.
+%
+In the next section we discuss the LALR(1) parsing algorithm, which is
+efficient enough to use with even the largest of input files.
+
 
 \section{The LALR(1) Algorithm}
 \label{sec:lalr}
 
 The LALR(1) algorithm~\citep{DeRemer69,Anderson73} can be viewed as a
 two phase approach in which it first compiles the grammar into a state
-machine and then runs the state machine to parse an input string.
+machine and then runs the state machine to parse an input string.  The
+second phase has time complexity $O(n)$ where $n$ is the number of
+tokens in the input, so LALR(1) is the best one could hope for with
+respect to efficiency.
 %
-A particularly influential implementation of LALR(1) was the
+A particularly influential implementation of LALR(1) is the
 \texttt{yacc} parser generator by \citet{Johnson:1979qy}, which stands
 for Yet Another Compiler Compiler.
 %
@@ -4806,25 +4811,24 @@ stmt: "print" exp
 start: stmt
 \end{lstlisting}
 Consider state 1 in Figure~\ref{fig:shift-reduce}. The parser has just
-read in a \lstinline{PRINT} token, so the top of the stack is
-\lstinline{(1,PRINT)}. The parser is part of the way through parsing
+read in a \lstinline{"print"} token, so the top of the stack is
+\lstinline{(1,"print")}. The parser is part of the way through parsing
 the input according to grammar rule 1, which is signified by showing
-rule 1 with a period after the \code{PRINT} token and before the
-\code{exp} nonterminal.  A rule with a period in it is called an
-\emph{item}. There are several rules that could apply next, both rule
-2 and 3, so state 1 also shows those rules with a period at the
-beginning of their right-hand sides. The edges between states indicate
-which transitions the machine should make depending on the next input
-token. So, for example, if the next input token is \code{INT} then the
-parser will push \code{INT} and the target state 4 on the stack and
-transition to state 4.  Suppose we are now at the end of the input. In
-state 4 it says we should reduce by rule 3, so we pop from the stack
-the same number of items as the number of symbols in the right-hand
-side of the rule, in this case just one.  We then momentarily jump to
-the state at the top of the stack (state 1) and then follow the goto
-edge that corresponds to the left-hand side of the rule we just
-reduced by, in this case \code{exp}, so we arrive at state 3.  (A
-slightly longer example parse is shown in
+rule 1 with a period after the \code{"print"} token and before the
+\code{exp} nonterminal. There are several rules that could apply next,
+both rule 2 and 3, so state 1 also shows those rules with a period at
+the beginning of their right-hand sides. The edges between states
+indicate which transitions the machine should make depending on the
+next input token. So, for example, if the next input token is
+\code{INT} then the parser will push \code{INT} and the target state 4
+on the stack and transition to state 4.  Suppose we are now at the end
+of the input. In state 4 it says we should reduce by rule 3, so we pop
+from the stack the same number of items as the number of symbols in
+the right-hand side of the rule, in this case just one.  We then
+momentarily jump to the state at the top of the stack (state 1) and
+then follow the goto edge that corresponds to the left-hand side of
+the rule we just reduced by, in this case \code{exp}, so we arrive at
+state 3.  (A slightly longer example parse is shown in
 Figure~\ref{fig:shift-reduce}.)
 
 \begin{figure}[htbp]
@@ -4834,18 +4838,19 @@ Figure~\ref{fig:shift-reduce}.)
   \label{fig:shift-reduce}
 \end{figure}
 
-In general, the algorithm works as follows. Look at the next input
-token.
+In general, the algorithm works as follows. Set the current state to
+state $0$. Then repeat the following, looking at the next input token.
 \begin{itemize}
-\item If there there is a shift edge for the input token, push the
-  edge's target state and the input token on the stack and proceed to
-  the edge's target state.
-\item If there is a reduce action for the input token, pop $k$
-  elements from the stack, where $k$ is the number of symbols in the
-  right-hand side of the rule being reduced. Jump to the state at the
-  top of the stack and then follow the goto edge for the nonterminal
-  that matches the left-hand side of the rule that we reducing
-  by. Push the edge's target state and the nonterminal on the stack.
+\item If there there is a shift edge for the input token in the
+  current state, push the edge's target state and the input token on
+  the stack and proceed to the edge's target state.
+\item If there is a reduce action for the input token in the current
+  state, pop $k$ elements from the stack, where $k$ is the number of
+  symbols in the right-hand side of the rule being reduced. Jump to
+  the state at the top of the stack and then follow the goto edge for
+  the nonterminal that matches the left-hand side of the rule that we
+  reducing by. Push the edge's target state and the nonterminal on the
+  stack.
 \end{itemize}
 
 Notice that in state 6 of Figure~\ref{fig:shift-reduce} there is both
@@ -4856,7 +4861,7 @@ there is a \emph{shift/reduce conflict}.  In this case, the conflict
 will arise, for example, when trying to parse the input
 \lstinline{print 1 + 2 + 3}. After having consumed \lstinline{print 1 + 2}
 the parser will be in state 6, and it will not know whether to
-reduce to form an \emph{exp} of \lstinline{1 + 2}, or whether it
+reduce to form an \code{exp} of \lstinline{1 + 2}, or whether it
 should proceed by shifting the next \lstinline{+} from the input.
 
 A similar kind of problem, known as a \emph{reduce/reduce} conflict,
@@ -4872,7 +4877,7 @@ similar to the initialization phase of the Earley parser.  If the
 period appears immediately before another nonterminal, we add all the
 rules with that nonterminal on the left-hand side. Again, we place a
 period at the beginning of the right-hand side of each the new
-rules. This process called \emph{state closure} is continued
+rules. This process, called \emph{state closure}, is continued
 until there are no more rules to add (similar to the prediction
 actions of an Earley parser). We then examine each dotted rule in the
 current state $I$. Suppose a dotted rule has the form $A ::=
@@ -4897,12 +4902,6 @@ $Y$. For example, in Figure~\ref{fig:shift-reduce} state 4 has an
 dotted rule with a period at the end. We therefore put a reduce by
 rule 3 action into state 4 for every
 token.
-%% (Figure~\ref{fig:shift-reduce} does not show a reduce rule for
-%% \code{INT} in state 4 because this grammar does not allow two
-%% consecutive \code{INT} tokens in the input. We will not go into how
-%% this can be figured out, but in any event it does no harm to have a
-%% reduce rule for \code{INT} in state 4; it just means the input will be
-%% rejected at a later point in the parsing process.)
 
 When inserting reduce actions, take care to spot any shift/reduce or
 reduce/reduce conflicts. If there are any, abort the construction of
@@ -5177,8 +5176,8 @@ During liveness analysis we know which variables are call-live because
 we compute which variables are in use at every instruction
 (section~\ref{sec:liveness-analysis-Lvar}). When we build the
 interference graph (section~\ref{sec:build-interference}), we can
-place an edge between each call-live variable and the caller-saved
-registers in the interference graph. This will prevent the graph
+place an edge in the interference graph between each call-live
+variable and the caller-saved registers. This will prevent the graph
 coloring algorithm from assigning call-live variables to caller-saved
 registers.