3 年之前 · ac0a47b7b0
--- a/book.tex
+++ b/book.tex
@@ -4103,14 +4103,17 @@ framework~\citep{shinan20:_lark_docs} to translate the concrete syntax
 
				 of \LangInt{} (a sequence of characters) into an abstract syntax tree.
			
 
				 You will then be asked to use Lark to create a parser for \LangVar{}.
			
 
				 We also describe the parsing algorithms used inside Lark, studying the
			
 
				-\citet{Earley:1970ly} and LALR(1) algorithms.
			
 
				+\citet{Earley:1970ly} and LALR(1) algorithms~\citep{DeRemer69,Anderson73}.
			
 
				 
			
 
				 A parser framework such as Lark takes in a specification of the
			
 
				-concrete syntax and the input program and produces a parse tree. Even
			
 
				+concrete syntax and an input program and produces a parse tree. Even
			
 
				 though a parser framework does most of the work for us, using one
			
 
				 properly requires some knowledge.  In particular, we must learn about
			
 
				 its specification languages and we must learn how to deal with
			
 
				-ambiguity in our language specifications.
			
 
				+ambiguity in our language specifications. Also, some algorithms, such
			
 
				+as LALR(1) place restrictions on the grammars they can handle, in
			
 
				+which case it helps to know the algorithm when trying to decipher the
			
 
				+error messages.
			
 
				 
			
 
				 The process of parsing is traditionally subdivided into two phases:
			
 
				 \emph{lexical analysis} (also called scanning) and \emph{syntax
			
@@ -4125,13 +4128,13 @@ and the use of a slower but more powerful algorithm for parsing.
 
				 %
			
 
				 %% Likewise, parser generators typical come in pairs, with separate
			
 
				 %% generators for the lexical analyzer (or lexer for short) and for the
			
 
				-%% parser.  A paricularly influential pair of generators were
			
 
				+%% parser.  A particularly influential pair of generators were
			
 
				 %% \texttt{lex} and \texttt{yacc}. The \texttt{lex} generator was written
			
 
				 %% by \citet{Lesk:1975uq} at Bell Labs. The \texttt{yacc} generator was
			
 
				 %% written by \citet{Johnson:1979qy} at AT\&T and stands for Yet Another
			
 
				 %% Compiler Compiler.
			
 
				 %
			
 
				-The Lark parse framwork that we use in this chapter includes both
			
 
				+The Lark parser framework that we use in this chapter includes both
			
 
				 lexical analyzers and parsers. The next section discusses lexical
			
 
				 analysis and the remainder of the chapter discusses parsing.
			
 
				 
			
@@ -4162,15 +4165,16 @@ Token('NEWLINE', '\n')
 
				 Each token includes a field for its \code{type}, such as \code{'INT'},
			
 
				 and a field for its \code{value}, such as \code{'1'}.
			
 
				 
			
 
				-Following in the tradition of \code{lex}, the specification language
			
 
				-for Lark's lexical analysis generator is one regular expression for
			
 
				-each type of token. The term \emph{regular} comes from the term
			
 
				-\emph{regular languages}, which are the languages that can be
			
 
				-recognized by a finite automata. A \emph{regular expression} is a
			
 
				-pattern formed of the following core elements:\index{subject}{regular
			
 
				-  expression}\footnote{Regular expressions traditionally include the
			
 
				-  empty regular expression that matches any zero-length part of a
			
 
				-  string, but Lark does not support the empty regular expression.}
			
 
				+Following in the tradition of \code{lex}~\citep{Lesk:1975uq}, the
			
 
				+specification language for Lark's lexical analysis generator is one
			
 
				+regular expression for each type of token. The term \emph{regular}
			
 
				+comes from the term \emph{regular languages}, which are the languages
			
 
				+that can be recognized by a finite automata. A \emph{regular
			
 
				+  expression} is a pattern formed of the following core
			
 
				+elements:\index{subject}{regular expression}\footnote{Regular
			
 
				+  expressions traditionally include the empty regular expression that
			
 
				+  matches any zero-length part of a string, but Lark does not support
			
 
				+  the empty regular expression.}
			
 
				 \begin{itemize}
			
 
				 \item A single character $c$ is a regular expression and it only
			
 
				   matches itself. For example, the regular expression \code{a} only
			
@@ -4249,15 +4253,15 @@ In section~\ref{sec:grammar} we learned how to use grammar rules to
 
				 specify the abstract syntax of a language. We now take a closer look
			
 
				 at using grammar rules to specify the concrete syntax. Recall that
			
 
				 each rule has a left-hand side and a right-hand side. However, for
			
 
				-concrete syntax, each right-hand side expresses a pattern to match
			
 
				-against a string, instead of matching against an abstract syntax
			
 
				-tree. In particular, each right-hand side is a sequence of
			
 
				+concrete syntax, each right-hand side expresses a pattern for a
			
 
				+string, instead of a patter for an abstract syntax tree. In
			
 
				+particular, each right-hand side is a sequence of
			
 
				 \emph{symbols}\index{subject}{symbol}, where a symbol is either a
			
 
				 terminal or nonterminal. A \emph{terminal}\index{subject}{terminal} is
			
 
				 a string. The nonterminals play the same role as in the abstract
			
 
				 syntax, defining categories of syntax. The nonterminals of a grammar
			
 
				 include the tokens defined in the lexer and all the nonterminals
			
 
				-defined in the grammar rules.
			
 
				+defined by the grammar rules.
			
 
				 
			
 
				 As an example, let us take a closer look at the concrete syntax of the
			
 
				 \LangInt{} language, repeated here.
			
@@ -4304,12 +4308,12 @@ The rule \code{exp: exp "+" exp} says that any string that matches
 
				 \code{exp}, followed by the \code{+} character, followed by another
			
 
				 string that matches \code{exp}, is itself an \code{exp}.  For example,
			
 
				 the string \code{'1+3'} is an \code{exp} because \code{'1'} and
			
 
				-\code{'3'} are both \code{exp} by the rule \code{exp: INT}, and then the
			
 
				-rule for addition applies to categorize \code{'1+3'} as an \Exp{}. We
			
 
				-can visualize the application of grammar rules to categorize a string
			
 
				-using a \emph{parse tree}\index{subject}{parse tree}. Each internal
			
 
				-node in the tree is an application of a grammar rule and is labeled
			
 
				-with the nonterminal of its left-hand side. Each leaf node is a
			
 
				+\code{'3'} are both \code{exp} by the rule \code{exp: INT}, and then
			
 
				+the rule for addition applies to categorize \code{'1+3'} as an
			
 
				+\Exp{}. We can visualize the application of grammar rules to parse a
			
 
				+string using a \emph{parse tree}\index{subject}{parse tree}. Each
			
 
				+internal node in the tree is an application of a grammar rule and is
			
 
				+labeled with its left-hand side nonterminal. Each leaf node is a
			
 
				 substring of the input program.  The parse tree for \code{'1+3'} is
			
 
				 shown in figure~\ref{fig:simple-parse-tree}.
			
 
				 
			
@@ -4424,17 +4428,17 @@ is left associative. We also need to consider the interaction of unary
 
				 subtraction with both addition and subtraction. How should we parse
			
 
				 \code{'-1+2'}? Unary subtraction has higher
			
 
				 \emph{precendence}\index{subject}{precedence} than addition and
			
 
				-subtraction, so \code{'-1+2'} should parse as \code{'(-1)+2'} and not
			
 
				-\code{'-(1+2)'}. The grammar in figure~\ref{fig:Lint-lark-grammar}
			
 
				-handles the associativity of addition and subtraction by using the
			
 
				-nonterminal \code{exp\_hi} for all the other expressions, and uses
			
 
				-\code{exp\_hi} for the second child in the rules for addition and
			
 
				-subtraction. Furthermore, unary subtraction uses \code{exp\_hi} for
			
 
				-its child.
			
 
				-
			
 
				-For languages with more operators and more precedence levels, one
			
 
				-would need to refine the \code{exp} nonterminal into several
			
 
				-nonterminals, one for each precedence level.
			
 
				+subtraction, so \code{'-1+2'} should parse the same as \code{'(-1)+2'}
			
 
				+and not \code{'-(1+2)'}. The grammar in
			
 
				+figure~\ref{fig:Lint-lark-grammar} handles the associativity of
			
 
				+addition and subtraction by using the nonterminal \code{exp\_hi} for
			
 
				+all the other expressions, and uses \code{exp\_hi} for the second
			
 
				+child in the rules for addition and subtraction. Furthermore, unary
			
 
				+subtraction uses \code{exp\_hi} for its child.
			
 
				+
			
 
				+For languages with more operators and more precedence levels, one must
			
 
				+refine the \code{exp} nonterminal into several nonterminals, one for
			
 
				+each precedence level.
			
 
				 
			
 
				 \begin{figure}[tbp]
			
 
				 \begin{tcolorbox}[colback=white]
			
@@ -4520,11 +4524,10 @@ In this section we discuss the parsing algorithm of
 
				 The algorithm is powerful in that it can handle any context-free
			
 
				 grammar, which makes it easy to use. However, it is not the most
			
 
				 efficient parsing algorithm: it is $O(n^3)$ for ambiguous grammars and
			
 
				-$O(n^2)$ for unambiguous grammars~\citep{Hopcroft06:_automata}, where
			
 
				-$n$ is the number of tokens in the input string.  In
			
 
				-section~\ref{sec:lalr} we learn about the LALR algorithm, which is
			
 
				-more efficient but can only handle a subset of the context-free
			
 
				-grammars.
			
 
				+$O(n^2)$ for unambiguous grammars, where $n$ is the number of tokens
			
 
				+in the input string~\citep{Hopcroft06:_automata}.  In
			
 
				+section~\ref{sec:lalr} we learn about the LALR(1) algorithm, which is
			
 
				+more efficient but cannot handle all context-free grammars.
			
 
				 
			
 
				 The Earley algorithm can be viewed as an interpreter; it treats the
			
 
				 grammar as the program being interpreted and it treats the concrete
			
@@ -4546,26 +4549,26 @@ represents a partial parse that has matched an \code{exp} followed by
 
				 \code{+}.
			
 
				 %
			
 
				 The Earley algorithm starts with an initialization phase, and then
			
 
				-repeats three actions (prediction, scanning, and completion) for as
			
 
				-long as opportunities arise for those actions. We demonstrate the
			
 
				-Earley algorithm on a running example, parsing the following program:
			
 
				+repeats three actions---prediction, scanning, and completion---for as
			
 
				+long as opportunities arise. We demonstrate the Earley algorithm on a
			
 
				+running example, parsing the following program:
			
 
				 \begin{lstlisting}
			
 
				   print(1 + 3)
			
 
				 \end{lstlisting}
			
 
				 The algorithm's initialization phase creates dotted rules for all the
			
 
				 grammar rules whose left-hand side is the start symbol and places them
			
 
				-in slot $0$ of the chart. It also records the starting position of the
			
 
				-dotted rule, in parentheses on the right. For example, given the
			
 
				+in slot $0$ of the chart. We also record the starting position of the
			
 
				+dotted rule in parentheses on the right. For example, given the
			
 
				 grammar in figure~\ref{fig:Lint-lark-grammar}, we place
			
 
				 \begin{lstlisting}
			
 
				   lang_int: . stmt_list         (0)
			
 
				 \end{lstlisting}
			
 
				 in slot $0$ of the chart. The algorithm then proceeds to with
			
 
				 \emph{prediction} actions in which it adds more dotted rules to the
			
 
				-chart based on which nonterminal come after a period. In the above,
			
 
				-the nonterminal \code{stmt\_list} appears after a period, so we add all
			
 
				-the rules for \code{stmt\_list} to slot $0$, with a period at the
			
 
				-beginning of their right-hand sides, as follows:
			
 
				+chart based on which nonterminals come immediately after a period. In
			
 
				+the above, the nonterminal \code{stmt\_list} appears after a period,
			
 
				+so we add all the rules for \code{stmt\_list} to slot $0$, with a
			
 
				+period at the beginning of their right-hand sides, as follows:
			
 
				 \begin{lstlisting}
			
 
				 stmt_list:  .                             (0)
			
 
				 stmt_list:  .  stmt  NEWLINE  stmt_list   (0)
			
@@ -4577,7 +4580,7 @@ period, so we add all the rules for \code{stmt}.
 
				 stmt:  .  "print" "("  exp ")"   (0)
			
 
				 stmt:  .  exp                    (0)
			
 
				 \end{lstlisting}
			
 
				-This reveals more opportunities for prediction, so we add the grammar
			
 
				+This reveals yet more opportunities for prediction, so we add the grammar
			
 
				 rules for \code{exp} and \code{exp\_hi}.
			
 
				 \begin{lstlisting}[escapechar=$]
			
 
				 exp: . exp "+" exp_hi         (0)
			
@@ -4593,8 +4596,8 @@ We have exhausted the opportunities for prediction, so the algorithm
 
				 proceeds to \emph{scanning}, in which we inspect the next input token
			
 
				 and look for a dotted rule at the current position that has a matching
			
 
				 terminal following the period. In our running example, the first input
			
 
				-token is \code{"print"} so we identify the dotted rule in slot $0$ of
			
 
				-the chart:
			
 
				+token is \code{"print"} so we identify the rule in slot $0$ of
			
 
				+the chart whose dot comes before \code{"print"}:
			
 
				 \begin{lstlisting}
			
 
				 stmt:  .  "print" "("  exp ")"       (0)
			
 
				 \end{lstlisting}
			
@@ -4626,7 +4629,7 @@ exp_hi: . "-" exp_hi          (2)
 
				 exp_hi: . "(" exp ")"         (2)
			
 
				 \end{lstlisting}
			
 
				 With that prediction complete, we return to scanning, noting that the
			
 
				-next input token is \code{"1"} which the lexer categorized as an
			
 
				+next input token is \code{"1"} which the lexer parses as an
			
 
				 \code{INT}. There is a matching rule is slot $2$:
			
 
				 \begin{lstlisting}
			
 
				 exp_hi: . INT             (2)
			
@@ -4636,10 +4639,10 @@ so we advance the period and put the following rule is slot $3$.
 
				 exp_hi: INT .             (2)
			
 
				 \end{lstlisting}
			
 
				 This brings us to \emph{completion} actions.  When the period reaches
			
 
				-the end of a dotted rule, we have finished parsing a substring
			
 
				-according to the left-hand side of the rule, in this case
			
 
				+the end of a dotted rule, we recognize that the substring
			
 
				+has matched the nonterminal on the left-hand side of the rule, in this case
			
 
				 \code{exp\_hi}. We therefore need to advance the periods in any dotted
			
 
				-rules in slot $2$ (the starting position for the finished rule) the
			
 
				+rules in slot $2$ (the starting position for the finished rule) if
			
 
				 period is immediately followed by \code{exp\_hi}. So we identify
			
 
				 \begin{lstlisting}
			
 
				 exp: . exp_hi                 (2)
			
@@ -4756,11 +4759,11 @@ algorithm.
 
				 
			
 
				 We have described how the Earley algorithm recognizes that an input
			
 
				 string matches a grammar, but we have not described how it builds a
			
 
				-parse tree. The basic idea is simple, but it turns out that building
			
 
				-parse trees in an efficient way is more complex, requiring a data
			
 
				-structure called a shared packed parse forest~\citep{Tomita:1985qr}.
			
 
				-The simple idea is to attach a partial parse tree to every dotted
			
 
				-rule.  Initially, the tree node associated with a dotted rule has no
			
 
				+parse tree. The basic idea is simple, but building parse trees in an
			
 
				+efficient way is more complex, requiring a data structure called a
			
 
				+shared packed parse forest~\citep{Tomita:1985qr}.  The simple idea is
			
 
				+to attach a partial parse tree to every dotted rule in the chart.
			
 
				+Initially, the tree node associated with a dotted rule has no
			
 
				 children. As the period moves to the right, the nodes from the
			
 
				 subparses are added as children to this tree node.
			
 
				 
			
@@ -4774,29 +4777,33 @@ use with even the largest of input files.
 
				 \section{The LALR(1) Algorithm}
			
 
				 \label{sec:lalr}
			
 
				 
			
 
				-The LALR(1) algorithm can be viewed as a two phase approach in which
			
 
				-it first compiles the grammar into a state machine and then runs the
			
 
				-state machine to parse the input string.  The state machine also uses
			
 
				-a stack to record its progress in parsing the input string.  Each
			
 
				-element of the stack is a pair: a state number and a grammar symbol (a
			
 
				-terminal or nonterminal). The symbol characterizes the input that has
			
 
				-been parsed so-far and the state number is used to remember how to
			
 
				-proceed once the next symbol-worth of input has been parsed.  Each
			
 
				-state in the machine represents where the parser stands in the parsing
			
 
				-process with respect to certain grammar rules. In particular, each
			
 
				-state is associated with a set of dotted rules.
			
 
				-
			
 
				-Figure~\ref{fig:shift-reduce} shows an example LALR(1) parse table
			
 
				-generated by Lark for the following simple but amiguous grammar:
			
 
				+The LALR(1) algorithm~\citep{DeRemer69,Anderson73} can be viewed as a
			
 
				+two phase approach in which it first compiles the grammar into a state
			
 
				+machine and then runs the state machine to parse an input string.
			
 
				+%
			
 
				+A particularly influential implementation of LALR(1) was the
			
 
				+\texttt{yacc} parser generator by \citet{Johnson:1979qy}, which stands
			
 
				+for Yet Another Compiler Compiler.
			
 
				+%
			
 
				+The LALR(1) state machine uses a stack to record its progress in
			
 
				+parsing the input string.  Each element of the stack is a pair: a
			
 
				+state number and a grammar symbol (a terminal or nonterminal). The
			
 
				+symbol characterizes the input that has been parsed so-far and the
			
 
				+state number is used to remember how to proceed once the next
			
 
				+symbol-worth of input has been parsed.  Each state in the machine
			
 
				+represents where the parser stands in the parsing process with respect
			
 
				+to certain grammar rules. In particular, each state is associated with
			
 
				+a set of dotted rules.
			
 
				+
			
 
				+Figure~\ref{fig:shift-reduce} shows an example LALR(1) state machine
			
 
				+(also called parse table) for the following simple but ambiguous
			
 
				+grammar:
			
 
				 \begin{lstlisting}[escapechar=$]
			
 
				 exp: INT
			
 
				    | exp "+" exp
			
 
				 stmt: "print" exp
			
 
				 start: stmt
			
 
				 \end{lstlisting}
			
 
				-%% When PLY generates a parse table, it also
			
 
				-%% outputs a textual representation of the parse table to the file
			
 
				-%% \texttt{parser.out} which is useful for debugging purposes.
			
 
				 Consider state 1 in Figure~\ref{fig:shift-reduce}. The parser has just
			
 
				 read in a \lstinline{PRINT} token, so the top of the stack is
			
 
				 \lstinline{(1,PRINT)}. The parser is part of the way through parsing
			
@@ -4819,7 +4826,6 @@ reduced by, in this case \code{exp}, so we arrive at state 3.  (A
 
				 slightly longer example parse is shown in
			
 
				 Figure~\ref{fig:shift-reduce}.)
			
 
				 
			
 
				-
			
 
				 \begin{figure}[htbp]
			
 
				   \centering
			
 
				 \includegraphics[width=5.0in]{figs/shift-reduce-conflict}  
			
@@ -4827,8 +4833,8 @@ Figure~\ref{fig:shift-reduce}.)
 
				   \label{fig:shift-reduce}
			
 
				 \end{figure}
			
 
				 
			
 
				-In general, the shift-reduce algorithm works as follows. Look at the
			
 
				-next input token.
			
 
				+In general, the algorithm works as follows. Look at the next input
			
 
				+token.
			
 
				 \begin{itemize}
			
 
				 \item If there there is a shift edge for the input token, push the
			
 
				   edge's target state and the input token on the stack and proceed to
			
@@ -4837,8 +4843,8 @@ next input token.
 
				   elements from the stack, where $k$ is the number of symbols in the
			
 
				   right-hand side of the rule being reduced. Jump to the state at the
			
 
				   top of the stack and then follow the goto edge for the nonterminal
			
 
				-  that matches the left-hand side of the rule we're reducing by. Push
			
 
				-  the edge's target state and the nonterminal on the stack.
			
 
				+  that matches the left-hand side of the rule that we reducing
			
 
				+  by. Push the edge's target state and the nonterminal on the stack.
			
 
				 \end{itemize}
			
 
				 
			
 
				 Notice that in state 6 of Figure~\ref{fig:shift-reduce} there is both
			
@@ -4847,7 +4853,7 @@ algorithm does not know which action to take in this case. When a
 
				 state has both a shift and a reduce action for the same token, we say
			
 
				 there is a \emph{shift/reduce conflict}.  In this case, the conflict
			
 
				 will arise, for example, when trying to parse the input
			
 
				-\lstinline{print 1 + 2 + 3}.  After having consumed \lstinline{print 1 + 2}
			
 
				+\lstinline{print 1 + 2 + 3}. After having consumed \lstinline{print 1 + 2}
			
 
				 the parser will be in state 6, and it will not know whether to
			
 
				 reduce to form an \emph{exp} of \lstinline{1 + 2}, or whether it
			
 
				 should proceed by shifting the next \lstinline{+} from the input.
			
@@ -4926,17 +4932,17 @@ your results against parse table in figure~\ref{fig:shift-reduce}.
 
				 \section{Further Reading}
			
 
				 
			
 
				 In this chapter we have just scratched the surface of the field of
			
 
				-parsing, with the study of a very general put less efficient algorithm
			
 
				+parsing, with the study of a very general but less efficient algorithm
			
 
				 (Earley) and with a more limited but highly efficient algorithm
			
 
				 (LALR). There are many more algorithms, and classes of grammars, that
			
 
				-fall between these two. We recommend the reader to \citet{Aho:2006wb}
			
 
				-for a thorough treatment of parsing.
			
 
				+fall between these two ends of the spectrum. We recommend the reader
			
 
				+to \citet{Aho:2006wb} for a thorough treatment of parsing.
			
 
				 
			
 
				 Regarding lexical analysis, we described the specification language,
			
 
				 the regular expressions, but not the algorithms for recognizing them.
			
 
				 In short, regular expressions can be translated to nondeterministic
			
 
				 finite automata, which in turn are translated to finite automata.  We
			
 
				-refer the reader again to \citet{Aho:2006wb} for all the details of
			
 
				+refer the reader again to \citet{Aho:2006wb} for all the details on
			
 
				 lexical analysis.
			
 
				 
			
 
				 \fi}
			
@@ -23556,8 +23562,10 @@ registers.
 
				 % LocalWords:  TupleProxy RawTuple InjectTuple InjectTupleProxy vecof
			
 
				 % LocalWords:  InjectList InjectListProxy unannotated Lgradualr poly
			
 
				 % LocalWords:  GenericVar AllType Inst builtin ap pps aps pp deepcopy
			
 
				-% LocalWords:  liskov clu Liskov dk Napier um inst popl jg seq ith
			
 
				-% LocalWords:  racketEd subparts subpart nonterminal nonterminals
			
 
				-% LocalWords:  pseudocode underapproximation underapproximations
			
 
				-% LocalWords:  semilattices overapproximate incrementing
			
 
				-% LocalWords:  multilanguage
			
 
				+% LocalWords:  liskov clu Liskov dk Napier um inst popl jg seq ith qy
			
 
				+% LocalWords:  racketEd subparts subpart nonterminal nonterminals Dyn
			
 
				+% LocalWords:  pseudocode underapproximation underapproximations LALR
			
 
				+% LocalWords:  semilattices overapproximate incrementing Earley docs
			
 
				+% LocalWords:  multilanguage Prelim shinan DeRemer lexer Lesk LPAR cb
			
 
				+% LocalWords:  RPAR abcbab abc bzca usub paren expr lang WS Tomita qr
			
 
				+% LocalWords:  subparses
			
--- a/python.bib
+++ b/python.bib
@@ -1,3 +1,18 @@
 
				+
			
 
				+@Article{Anderson73,
			
 
				+  author = 	 {T. Anderson and J. Eve and J. Horning},
			
 
				+  title = 	 {Efficient LR(1) parsers},
			
 
				+  journal = 	 {Acta Informatica},
			
 
				+  year = 	 1973,
			
 
				+  volume = 	 2,
			
 
				+  pages = 	 {2--39}}
			
 
				+
			
 
				+@PhdThesis{DeRemer69,
			
 
				+  author = 	 {Frank DeRemer},
			
 
				+  title = 	 {Practical Translators for LR(k) languages},
			
 
				+  school = 	 {MIT},
			
 
				+  year = 	 1969}
			
 
				+
			
 
				 @book{Barry:2016vj,
			
 
				 	author = {Paul Barry},
			
 
				 	publisher = {O'Reilly},