Jeremy Siek 2 gadi atpakaļ
vecāks
revīzija
ac5f48a492
2 mainītis faili ar 356 papildinājumiem un 29 dzēšanām
  1. 10 0
      book.bib
  2. 346 29
      book.tex

+ 10 - 0
book.bib

@@ -1,3 +1,13 @@
+@book{Tomita:1985qr,
+	address = {Norwell, MA, USA},
+	author = {Masaru Tomita},
+	date-added = {2008-12-02 14:16:33 -0700},
+	date-modified = {2008-12-02 14:16:39 -0700},
+	isbn = {0898382025},
+	publisher = {Kluwer Academic Publishers},
+	title = {Efficient Parsing for Natural Language: A Fast Algorithm for Practical Systems},
+	year = {1985}}
+
 @article{Earley:1970ly,
 	acmid = {362035},
 	address = {New York, NY, USA},

+ 346 - 29
book.tex

@@ -4298,7 +4298,7 @@ The rule \code{exp: exp "+" exp} says that any string that matches
 \code{exp}, followed by the \code{+} character, followed by another
 string that matches \code{exp}, is itself an \code{exp}.  For example,
 the string \code{'1+3'} is an \code{exp} because \code{'1'} and
-\code{'3'} are both \code{exp} by rule \code{exp: INT}, and then the
+\code{'3'} are both \code{exp} by the rule \code{exp: INT}, and then the
 rule for addition applies to categorize \code{'1+3'} as an \Exp{}. We
 can visualize the application of grammar rules to categorize a string
 using a \emph{parse tree}\index{subject}{parse tree}. Each internal
@@ -4426,9 +4426,9 @@ nonterminal \code{exp\_hi} for all the other expressions, and uses
 subtraction. Furthermore, unary subtraction uses \code{exp\_hi} for
 its child.
 
-For languages with more operators with more precedence levels, one
+For languages with more operators and more precedence levels, one
 would need to refine the \code{exp} nonterminal into several
-nonterminals, 
+nonterminals, one for each precedence level.
 
 \begin{figure}[tbp]
 \begin{tcolorbox}[colback=white]
@@ -4514,7 +4514,8 @@ In this section we discuss the parsing algorithm of
 The algorithm is powerful in that it can handle any context-free
 grammar, which makes it easy to use. However, it is not the most
 efficient parsing algorithm: it is $O(n^3)$ for ambiguous grammars and
-$O(n^2)$ for unambiguous grammars~\citep{Hopcroft06:_automata}.  In
+$O(n^2)$ for unambiguous grammars~\citep{Hopcroft06:_automata}, where
+$n$ is the number of tokens in the input string.  In
 section~\ref{sec:lalr} we learn about the LALR algorithm, which is
 more efficient but can only handle a subset of the context-free
 grammars.
@@ -4531,16 +4532,24 @@ been parsed. For example, the dotted rule
 \begin{lstlisting}
 exp: exp "+" . exp_hi
 \end{lstlisting}
-represents a partial parse that has matched an expression followed by
-\code{+}, but has not yet parsed an expression to the right of
+represents a partial parse that has matched an \code{exp} followed by
+\code{+}, but has not yet parsed an \code{exp} to the right of
 \code{+}.
-
-The algorithm begins by creating dotted rules for all the grammar
-rules whose left-hand side is the start symbol and placing then in
-slot $0$ of the chart.  For example, given the grammar in
-figure~\ref{fig:Lint-lark-grammar}, we would place
+%
+The Earley algorithm starts with an initialization phase, and then
+repeats three actions (prediction, scanning, and completion) for as
+long as opportunities arise for those actions. We demonstrate the
+Earley algorithm on a running example, parsing the following program:
 \begin{lstlisting}
-  lang_int: . stmt_list
+  print(1 + 3)
+\end{lstlisting}
+The algorithm's initialization phase creates dotted rules for all the
+grammar rules whose left-hand side is the start symbol and places them
+in slot $0$ of the chart. It also records the starting position of the
+dotted rule, in parentheses on the right. For example, given the
+grammar in figure~\ref{fig:Lint-lark-grammar}, we place
+\begin{lstlisting}
+  lang_int: . stmt_list         (0)
 \end{lstlisting}
 in slot $0$ of the chart. The algorithm then proceeds to its
 \emph{prediction} phase in which it adds more dotted rules to the
@@ -4549,32 +4558,340 @@ the nonterminal \code{stmt\_list} appears after a period, so we add all
 the rules for \code{stmt\_list} to slot $0$, with a period at the
 beginning of their right-hand sides, as follows:
 \begin{lstlisting}
-stmt_list:  . 
-stmt_list:  .  stmt  NEWLINE  stmt_list
+stmt_list:  .                             (0)
+stmt_list:  .  stmt  NEWLINE  stmt_list   (0)
+\end{lstlisting}
+We continue to perform prediction actions as more opportunities
+arise. For example, the \code{stmt} nonterminal now appears after a
+period, so we add all the rules for \code{stmt}.
+\begin{lstlisting}
+stmt:  .  "print" "("  exp ")"   (0)
+stmt:  .  exp                    (0)
+\end{lstlisting}
+This reveals more opportunities for prediction, so we add the grammar
+rules for \code{exp} and \code{exp\_hi}.
+\begin{lstlisting}[escapechar=$]
+exp: . exp "+" exp_hi         (0)
+exp: . exp "-" exp_hi         (0)
+exp: . exp_hi                 (0)
+exp_hi: . INT                 (0)
+exp_hi: . "input_int" "(" ")" (0)
+exp_hi: . "-" exp_hi          (0)
+exp_hi: . "(" exp ")"         (0)
+\end{lstlisting}
+
+We have exhausted the opportunities for prediction, so the algorithm
+proceeds to \emph{scanning}, in which we inspect the next input token
+and look for a dotted rule at the current position that has a matching
+terminal following the period. In our running example, the first input
+token is \code{"print"} so we identify the dotted rule in slot $0$ of
+the chart:
+\begin{lstlisting}
+stmt:  .  "print" "("  exp ")"       (0)
+\end{lstlisting}
+and add the following rule to slot $1$ of the chart, with the period
+moved forward past \code{"print"}.
+\begin{lstlisting}
+stmt:  "print" . "("  exp ")"        (0)
+\end{lstlisting}
+If the new dotted rule had a nonterminal after the period, we would
+need to carry out a prediction action, adding more dotted rules into
+slot $1$. That is not the case, so we continue scanning. The next
+input token is \code{"("}, so we add the following to slot $2$ of the
+chart.
+\begin{lstlisting}
+stmt:  "print" "(" . exp ")"         (0)
+\end{lstlisting}
+
+Now we have a nonterminal after the period, so we carry out several
+prediction actions, adding dotted rules for \code{exp} and
+\code{exp\_hi} to slot $2$ with a period at the beginning and with
+starting position $2$.
+\begin{lstlisting}[escapechar=$]
+exp: . exp "+" exp_hi         (2)
+exp: . exp "-" exp_hi         (2)
+exp: . exp_hi                 (2)
+exp_hi: . INT                 (2)
+exp_hi: . "input_int" "(" ")" (2)
+exp_hi: . "-" exp_hi          (2)
+exp_hi: . "(" exp ")"         (2)
+\end{lstlisting}
+With that prediction complete, we return to scanning, noting that the
+next input token is \code{"1"} which the lexer categorized as an
+\code{INT}. There is a matching rule is slot $2$:
+\begin{lstlisting}
+exp_hi: . INT             (2)
+\end{lstlisting}
+so we advance the period and put the following rule is slot $3$.
+\begin{lstlisting}
+exp_hi: INT .             (2)
 \end{lstlisting}
-The prediction phase continues to add dotted rules as more
-opportunities arise. For example, the \code{stmt} nonterminal now
-appears after a period, so we add all the rules for \code{stmt}.
+This brings us to \emph{completion} actions.  When the period reaches
+the end of a dotted rule, we have finished parsing a substring
+according to the left-hand side of the rule, in this case
+\code{exp\_hi}. We therefore need to advance the periods in any dotted
+rules in slot $2$ (the starting position for the finished rule) the
+period is immediately followed by \code{exp\_hi}. So we identify
 \begin{lstlisting}
-stmt:  .  "print" "("  exp ")"
-stmt:  .  exp
+exp: . exp_hi                 (2)
+\end{lstlisting}
+and add the following dotted rule to slot $3$
+\begin{lstlisting}
+exp: exp_hi .                 (2)
+\end{lstlisting}
+This triggers another completion step for the nonterminal \code{exp},
+adding two more dotted rules to slot $3$.
+\begin{lstlisting}[escapechar=$]
+exp: exp . "+" exp_hi         (2)
+exp: exp . "-" exp_hi         (2)
+\end{lstlisting}
+
+Returning to scanning, the next input token is \code{"+"}, so
+we add the following to slot $4$.
+\begin{lstlisting}[escapechar=$]
+exp: exp "+" . exp_hi         (2)
+\end{lstlisting}
+The period precedes the nonterminal \code{exp\_hi}, so prediction adds
+the following dotted rules to slot $4$ of the chart.
+\begin{lstlisting}[escapechar=$]
+exp_hi: . INT                 (4)
+exp_hi: . "input_int" "(" ")" (4)
+exp_hi: . "-" exp_hi          (4)
+exp_hi: . "(" exp ")"         (4)
+\end{lstlisting}
+The next input token is \code{"3"} which the lexer categorized as an
+\code{INT}, so we advance the period past \code{INT} for the rules in
+slot $4$, of which there is just one, and put the following in slot $5$.
+\begin{lstlisting}[escapechar=$]
+exp_hi: INT .                 (4)
+\end{lstlisting}
+
+The period at the end of the rule triggers a completion action for the
+rules in slot $4$, one of which has a period before \code{exp\_hi}.
+So we advance the period and put the following in slot $5$.
+\begin{lstlisting}[escapechar=$]
+exp: exp "+" exp_hi .         (2)
 \end{lstlisting}
-To finish the preduction phase, we add the grammar rules for
-\code{exp} and \code{exp\_hi}.
+This triggers another completion action for the rules in slot $2$ that
+have a period before \code{exp}.
 \begin{lstlisting}[escapechar=$]
-exp: . exp "+" exp_hi
-exp: . exp "-" exp_hi
-exp: . exp_hi
-exp_hi: . INT
-exp_hi: . "input_int" "(" ")"
-exp_hi: . "-" exp_hi
-exp_hi: . "(" exp ")"
+stmt:  "print" "(" exp . ")"  (0)
+exp: exp . "+" exp_hi         (2)
+exp: exp . "-" exp_hi         (2)
 \end{lstlisting}
 
+We scan the next input token \code{")"}, placing the following dotted
+rule in slot $6$.
+\begin{lstlisting}[escapechar=$]
+stmt:  "print" "(" exp ")" .  (0)
+\end{lstlisting}
+This triggers the completion of \code{stmt} in slot $0$
+\begin{lstlisting}
+stmt_list:  stmt . NEWLINE  stmt_list   (0)
+\end{lstlisting}
+The last input token is a \code{NEWLINE}, so we advance the period
+and place the new dotted rule in slot $7$.
+\begin{lstlisting}
+stmt_list:  stmt NEWLINE .  stmt_list  (0)
+\end{lstlisting}
+We are close to the end of parsing the input!
+The period is before the \code{stmt\_list} nonterminal, so we
+apply prediction for \code{stmt\_list} and then \code{stmt}.
+\begin{lstlisting}
+stmt_list:  .                             (7)
+stmt_list:  .  stmt  NEWLINE  stmt_list   (7)
+stmt:  .  "print" "("  exp ")"            (7)
+stmt:  .  exp                             (7)
+\end{lstlisting}
+There is immediately an opportunity for completion of \code{stmt\_list},
+so we add the following to slot $7$.
+\begin{lstlisting}
+stmt_list:  stmt NEWLINE stmt_list .  (0)
+\end{lstlisting}
+This triggers another completion action for \code{stmt\_list} in slot $0$
+\begin{lstlisting}
+lang_int: stmt_list .               (0)
+\end{lstlisting}
+which in turn completes \code{lang\_int}, the start symbol of the
+grammar, so the parsing of the input is complete.
+
+For reference, we now give a general description of the Earley
+algorithm.
+\begin{enumerate}
+\item The algorithm begins by initializing slot $0$ of the chart with the
+  grammar rule for the start symbol, placing a period at the beginning
+  of the right-hand side, and recording its starting position as $0$.
+  
+\item The algorithm repeatedly applies the following three kinds of
+  actions for as long as there are opportunities to do so.
+  \begin{itemize}
+  \item Prediction: if there is a dotted rule in slot $k$ whose period
+    comes before a nonterminal, add all the rules for that nonterminal
+    into slot $k$, placing a period at the beginning of their
+    right-hand sides, and recording their starting position as
+    $k$.
+  \item Scanning: If the token at position $k$ of the input string
+    matches the symbol after the period in a dotted rule in slot $k$
+    of the chart, advance the prior in the dotted rule, adding
+    the result to slot $k+1$.
+  \item Completion: If a dotted rule in slot $k$ has a period at the
+    end, consider the rules in the slot corresponding to the starting
+    position of the completed rule. If any of those rules have a
+    nonterminal following their period that matches the left-hand side
+    of the completed rule, then advance their period, placing the new
+    dotted rule in slot $k$.
+  \end{itemize}
+  While repeating these three actions, take care to never add
+  duplicate dotted rules to the chart.
+\end{enumerate}
 
-\section{The LALR Algorithm}
+We have described how the Earley algorithm recognizes that an input
+string matches a grammar, but we have not described how it builds a
+parse tree. The basic idea is simple, but it turns out that building
+parse trees in an efficient way is more complex, requiring a data
+structure called a shared packed parse forest~\citep{Tomita:1985qr}.
+The simple idea is to attach a partial parse tree to every dotted
+rule.  Initially, the tree node associated with a dotted rule has no
+children. As the period moves to the right, the nodes from the
+subparses are added as children to this tree node.
+
+As mentioned at the beginning of this section, the Earley algorithm is
+$O(n^2)$ for unambiguous grammars, which means that it can parse input
+files that contain thousands of tokens in a reasonable amount of time,
+but not millions. In the next section we discuss the LALR(1) parsing
+algorithm, which has time complexity $O(n)$, making it practical to
+use with even the largest of input files.
+
+\section{The LALR(1) Algorithm}
 \label{sec:lalr}
 
+The LALR(1) algorithm consists of a finite automata and a stack to
+record its progress in parsing the input string.  Each element of the
+stack is a pair: a state number and a grammar symbol (a terminal or
+nonterminal). The symbol characterizes the input that has been parsed
+so-far and the state number is used to remember how to proceed once
+the next symbol-worth of input has been parsed.  Each state in the
+finite automata represents where the parser stands in the parsing
+process with respect to certain grammar rules. In particular, each
+state is associated with a set of dotted rules.
+
+Figure~\ref{fig:shift-reduce} shows an example LALR(1) parse table
+generated by Lark for the following simple but amiguous grammar:
+\begin{lstlisting}[escapechar=$]
+exp: INT
+   | exp "+" exp
+stmt: "print" exp
+start: stmt
+\end{lstlisting}
+%% When PLY generates a parse table, it also
+%% outputs a textual representation of the parse table to the file
+%% \texttt{parser.out} which is useful for debugging purposes.
+Consider state 1 in Figure~\ref{fig:shift-reduce}. The parser has just
+read in a \lstinline{PRINT} token, so the top of the stack is
+\lstinline{(1,PRINT)}. The parser is part of the way through parsing
+the input according to grammar rule 1, which is signified by showing
+rule 1 with a period after the \code{PRINT} token and before the
+\code{exp} nonterminal.  A rule with a period in it is called an
+\emph{item}. There are several rules that could apply next, both rule
+2 and 3, so state 1 also shows those rules with a period at the
+beginning of their right-hand sides. The edges between states indicate
+which transitions the automata should make depending on the next input
+token. So, for example, if the next input token is \code{INT} then the
+parser will push \code{INT} and the target state 4 on the stack and
+transition to state 4.  Suppose we are now at the end of the input. In
+state 4 it says we should reduce by rule 3, so we pop from the stack
+the same number of items as the number of symbols in the right-hand
+side of the rule, in this case just one.  We then momentarily jump to
+the state at the top of the stack (state 1) and then follow the goto
+edge that corresponds to the left-hand side of the rule we just
+reduced by, in this case \code{exp}, so we arrive at state 3.  (A
+slightly longer example parse is shown in
+Figure~\ref{fig:shift-reduce}.)
+
+
+\begin{figure}[htbp]
+  \centering
+\includegraphics[width=5.0in]{figs/shift-reduce-conflict}  
+  \caption{An LALR(1) parse table and a trace of an example run.}
+  \label{fig:shift-reduce}
+\end{figure}
+
+In general, the shift-reduce algorithm works as follows. Look at the
+next input token.
+\begin{itemize}
+\item If there there is a shift edge for the input token, push the
+  edge's target state and the input token on the stack and proceed to
+  the edge's target state.
+\item If there is a reduce action for the input token, pop $k$
+  elements from the stack, where $k$ is the number of symbols in the
+  right-hand side of the rule being reduced. Jump to the state at the
+  top of the stack and then follow the goto edge for the nonterminal
+  that matches the left-hand side of the rule we're reducing by. Push
+  the edge's target state and the nonterminal on the stack.
+\end{itemize}
+
+Notice that in state 6 of Figure~\ref{fig:shift-reduce} there is both
+a shift and a reduce action for the token \lstinline{PLUS}, so the
+algorithm does not know which action to take in this case. When a
+state has both a shift and a reduce action for the same token, we say
+there is a \emph{shift/reduce conflict}.  In this case, the conflict
+will arise, for example, when trying to parse the input
+\lstinline{print 1 + 2 + 3}.  After having consumed \lstinline{print 1 + 2}
+the parser will be in state 6, and it will not know whether to
+reduce to form an \emph{exp} of \lstinline{1 + 2}, or whether it
+should proceed by shifting the next \lstinline{+} from the input.
+
+A similar kind of problem, known as a \emph{reduce/reduce} conflict,
+arises when there are two reduce actions in a state for the same
+token. To understand which grammars gives rise to shift/reduce and
+reduce/reduce conflicts, it helps to know how the parse table is
+generated from the grammar, which we discuss next.
+
+The parse table is generated one state at a time. State 0 represents
+the start of the parser. We add the grammar rule for the start symbol
+to this state with a period at the beginning of the right-hand side,
+similar to the initialization phase of the Earley parser.  If the
+period appears immediately before another nonterminal, we add all the
+rules with that nonterminal on the left-hand side. Again, we place a
+period at the beginning of the right-hand side of each the new
+rules. This process called \emph{state closure} is continued
+until there are no more rules to add (similar to the prediction
+actions of an Earley parser). We then examine each dotted rule in the
+current state $I$. Suppose a dotted rule has the form $A ::=
+\alpha.X\beta$, where $A$ and $X$ are symbols and $\alpha$ and $\beta$
+are sequences of symbols. We create a new state, call it $J$.  If $X$
+is a terminal, we create a shift edge from $I$ to $J$ (analogous to
+scanning in Earley), whereas if $X$ is a nonterminal, we create a
+goto edge from $I$ to $J$.  We then need to add some dotted rules to
+state $J$. We start by adding all dotted rules from state $I$ that
+have the form $B ::= \gamma.X\kappa$ (where $B$ is any nonterminal and
+$\gamma$ and $\kappa$ are arbitrary sequences of symbols), but with
+the period moved past the $X$.  (This is analogous to completion in
+the Earley algorithm.)  We then perform state closure on $J$.  This
+process repeats until there are no more states or edges to add.
+
+We then mark states as accepting states if they have a dotted rule
+that is the start rule with a period at the end.  Also, to add
+in the reduce actions, we look for any state containing a dotted rule
+with a period at the end. Let $n$ be the rule number for this dotted
+rule. We then put a reduce $n$ action into that state for every token
+$Y$. For example, in Figure~\ref{fig:shift-reduce} state 4 has an
+dotted rule with a period at the end. We therefore put a reduce by
+rule 3 action into state 4 for every
+token. (Figure~\ref{fig:shift-reduce} does not show a reduce rule for
+\code{INT} in state 4 because this grammar does not allow two
+consecutive \code{INT} tokens in the input. We will not go into how
+this can be figured out, but in any event it does no harm to have a
+reduce rule for \code{INT} in state 4; it just means the input will be
+rejected at a later point in the parsing process.)
+
+\begin{exercise}
+On a piece of paper, walk through the parse table generation 
+process for the grammar in Figure~\ref{fig:parser1} and check
+your results against Figure~\ref{fig:shift-reduce}. 
+\end{exercise}
+
+
 \section{Further Reading}
 
 UNDER CONSTRUCTION