|
@@ -4298,7 +4298,7 @@ The rule \code{exp: exp "+" exp} says that any string that matches
|
|
\code{exp}, followed by the \code{+} character, followed by another
|
|
\code{exp}, followed by the \code{+} character, followed by another
|
|
string that matches \code{exp}, is itself an \code{exp}. For example,
|
|
string that matches \code{exp}, is itself an \code{exp}. For example,
|
|
the string \code{'1+3'} is an \code{exp} because \code{'1'} and
|
|
the string \code{'1+3'} is an \code{exp} because \code{'1'} and
|
|
-\code{'3'} are both \code{exp} by rule \code{exp: INT}, and then the
|
|
|
|
|
|
+\code{'3'} are both \code{exp} by the rule \code{exp: INT}, and then the
|
|
rule for addition applies to categorize \code{'1+3'} as an \Exp{}. We
|
|
rule for addition applies to categorize \code{'1+3'} as an \Exp{}. We
|
|
can visualize the application of grammar rules to categorize a string
|
|
can visualize the application of grammar rules to categorize a string
|
|
using a \emph{parse tree}\index{subject}{parse tree}. Each internal
|
|
using a \emph{parse tree}\index{subject}{parse tree}. Each internal
|
|
@@ -4426,9 +4426,9 @@ nonterminal \code{exp\_hi} for all the other expressions, and uses
|
|
subtraction. Furthermore, unary subtraction uses \code{exp\_hi} for
|
|
subtraction. Furthermore, unary subtraction uses \code{exp\_hi} for
|
|
its child.
|
|
its child.
|
|
|
|
|
|
-For languages with more operators with more precedence levels, one
|
|
|
|
|
|
+For languages with more operators and more precedence levels, one
|
|
would need to refine the \code{exp} nonterminal into several
|
|
would need to refine the \code{exp} nonterminal into several
|
|
-nonterminals,
|
|
|
|
|
|
+nonterminals, one for each precedence level.
|
|
|
|
|
|
\begin{figure}[tbp]
|
|
\begin{figure}[tbp]
|
|
\begin{tcolorbox}[colback=white]
|
|
\begin{tcolorbox}[colback=white]
|
|
@@ -4514,7 +4514,8 @@ In this section we discuss the parsing algorithm of
|
|
The algorithm is powerful in that it can handle any context-free
|
|
The algorithm is powerful in that it can handle any context-free
|
|
grammar, which makes it easy to use. However, it is not the most
|
|
grammar, which makes it easy to use. However, it is not the most
|
|
efficient parsing algorithm: it is $O(n^3)$ for ambiguous grammars and
|
|
efficient parsing algorithm: it is $O(n^3)$ for ambiguous grammars and
|
|
-$O(n^2)$ for unambiguous grammars~\citep{Hopcroft06:_automata}. In
|
|
|
|
|
|
+$O(n^2)$ for unambiguous grammars~\citep{Hopcroft06:_automata}, where
|
|
|
|
+$n$ is the number of tokens in the input string. In
|
|
section~\ref{sec:lalr} we learn about the LALR algorithm, which is
|
|
section~\ref{sec:lalr} we learn about the LALR algorithm, which is
|
|
more efficient but can only handle a subset of the context-free
|
|
more efficient but can only handle a subset of the context-free
|
|
grammars.
|
|
grammars.
|
|
@@ -4531,16 +4532,24 @@ been parsed. For example, the dotted rule
|
|
\begin{lstlisting}
|
|
\begin{lstlisting}
|
|
exp: exp "+" . exp_hi
|
|
exp: exp "+" . exp_hi
|
|
\end{lstlisting}
|
|
\end{lstlisting}
|
|
-represents a partial parse that has matched an expression followed by
|
|
|
|
-\code{+}, but has not yet parsed an expression to the right of
|
|
|
|
|
|
+represents a partial parse that has matched an \code{exp} followed by
|
|
|
|
+\code{+}, but has not yet parsed an \code{exp} to the right of
|
|
\code{+}.
|
|
\code{+}.
|
|
-
|
|
|
|
-The algorithm begins by creating dotted rules for all the grammar
|
|
|
|
-rules whose left-hand side is the start symbol and placing then in
|
|
|
|
-slot $0$ of the chart. For example, given the grammar in
|
|
|
|
-figure~\ref{fig:Lint-lark-grammar}, we would place
|
|
|
|
|
|
+%
|
|
|
|
+The Earley algorithm starts with an initialization phase, and then
|
|
|
|
+repeats three actions (prediction, scanning, and completion) for as
|
|
|
|
+long as opportunities arise for those actions. We demonstrate the
|
|
|
|
+Earley algorithm on a running example, parsing the following program:
|
|
\begin{lstlisting}
|
|
\begin{lstlisting}
|
|
- lang_int: . stmt_list
|
|
|
|
|
|
+ print(1 + 3)
|
|
|
|
+\end{lstlisting}
|
|
|
|
+The algorithm's initialization phase creates dotted rules for all the
|
|
|
|
+grammar rules whose left-hand side is the start symbol and places them
|
|
|
|
+in slot $0$ of the chart. It also records the starting position of the
|
|
|
|
+dotted rule, in parentheses on the right. For example, given the
|
|
|
|
+grammar in figure~\ref{fig:Lint-lark-grammar}, we place
|
|
|
|
+\begin{lstlisting}
|
|
|
|
+ lang_int: . stmt_list (0)
|
|
\end{lstlisting}
|
|
\end{lstlisting}
|
|
in slot $0$ of the chart. The algorithm then proceeds to its
|
|
in slot $0$ of the chart. The algorithm then proceeds to its
|
|
\emph{prediction} phase in which it adds more dotted rules to the
|
|
\emph{prediction} phase in which it adds more dotted rules to the
|
|
@@ -4549,32 +4558,340 @@ the nonterminal \code{stmt\_list} appears after a period, so we add all
|
|
the rules for \code{stmt\_list} to slot $0$, with a period at the
|
|
the rules for \code{stmt\_list} to slot $0$, with a period at the
|
|
beginning of their right-hand sides, as follows:
|
|
beginning of their right-hand sides, as follows:
|
|
\begin{lstlisting}
|
|
\begin{lstlisting}
|
|
-stmt_list: .
|
|
|
|
-stmt_list: . stmt NEWLINE stmt_list
|
|
|
|
|
|
+stmt_list: . (0)
|
|
|
|
+stmt_list: . stmt NEWLINE stmt_list (0)
|
|
|
|
+\end{lstlisting}
|
|
|
|
+We continue to perform prediction actions as more opportunities
|
|
|
|
+arise. For example, the \code{stmt} nonterminal now appears after a
|
|
|
|
+period, so we add all the rules for \code{stmt}.
|
|
|
|
+\begin{lstlisting}
|
|
|
|
+stmt: . "print" "(" exp ")" (0)
|
|
|
|
+stmt: . exp (0)
|
|
|
|
+\end{lstlisting}
|
|
|
|
+This reveals more opportunities for prediction, so we add the grammar
|
|
|
|
+rules for \code{exp} and \code{exp\_hi}.
|
|
|
|
+\begin{lstlisting}[escapechar=$]
|
|
|
|
+exp: . exp "+" exp_hi (0)
|
|
|
|
+exp: . exp "-" exp_hi (0)
|
|
|
|
+exp: . exp_hi (0)
|
|
|
|
+exp_hi: . INT (0)
|
|
|
|
+exp_hi: . "input_int" "(" ")" (0)
|
|
|
|
+exp_hi: . "-" exp_hi (0)
|
|
|
|
+exp_hi: . "(" exp ")" (0)
|
|
|
|
+\end{lstlisting}
|
|
|
|
+
|
|
|
|
+We have exhausted the opportunities for prediction, so the algorithm
|
|
|
|
+proceeds to \emph{scanning}, in which we inspect the next input token
|
|
|
|
+and look for a dotted rule at the current position that has a matching
|
|
|
|
+terminal following the period. In our running example, the first input
|
|
|
|
+token is \code{"print"} so we identify the dotted rule in slot $0$ of
|
|
|
|
+the chart:
|
|
|
|
+\begin{lstlisting}
|
|
|
|
+stmt: . "print" "(" exp ")" (0)
|
|
|
|
+\end{lstlisting}
|
|
|
|
+and add the following rule to slot $1$ of the chart, with the period
|
|
|
|
+moved forward past \code{"print"}.
|
|
|
|
+\begin{lstlisting}
|
|
|
|
+stmt: "print" . "(" exp ")" (0)
|
|
|
|
+\end{lstlisting}
|
|
|
|
+If the new dotted rule had a nonterminal after the period, we would
|
|
|
|
+need to carry out a prediction action, adding more dotted rules into
|
|
|
|
+slot $1$. That is not the case, so we continue scanning. The next
|
|
|
|
+input token is \code{"("}, so we add the following to slot $2$ of the
|
|
|
|
+chart.
|
|
|
|
+\begin{lstlisting}
|
|
|
|
+stmt: "print" "(" . exp ")" (0)
|
|
|
|
+\end{lstlisting}
|
|
|
|
+
|
|
|
|
+Now we have a nonterminal after the period, so we carry out several
|
|
|
|
+prediction actions, adding dotted rules for \code{exp} and
|
|
|
|
+\code{exp\_hi} to slot $2$ with a period at the beginning and with
|
|
|
|
+starting position $2$.
|
|
|
|
+\begin{lstlisting}[escapechar=$]
|
|
|
|
+exp: . exp "+" exp_hi (2)
|
|
|
|
+exp: . exp "-" exp_hi (2)
|
|
|
|
+exp: . exp_hi (2)
|
|
|
|
+exp_hi: . INT (2)
|
|
|
|
+exp_hi: . "input_int" "(" ")" (2)
|
|
|
|
+exp_hi: . "-" exp_hi (2)
|
|
|
|
+exp_hi: . "(" exp ")" (2)
|
|
|
|
+\end{lstlisting}
|
|
|
|
+With that prediction complete, we return to scanning, noting that the
|
|
|
|
+next input token is \code{"1"} which the lexer categorized as an
|
|
|
|
+\code{INT}. There is a matching rule is slot $2$:
|
|
|
|
+\begin{lstlisting}
|
|
|
|
+exp_hi: . INT (2)
|
|
|
|
+\end{lstlisting}
|
|
|
|
+so we advance the period and put the following rule is slot $3$.
|
|
|
|
+\begin{lstlisting}
|
|
|
|
+exp_hi: INT . (2)
|
|
\end{lstlisting}
|
|
\end{lstlisting}
|
|
-The prediction phase continues to add dotted rules as more
|
|
|
|
-opportunities arise. For example, the \code{stmt} nonterminal now
|
|
|
|
-appears after a period, so we add all the rules for \code{stmt}.
|
|
|
|
|
|
+This brings us to \emph{completion} actions. When the period reaches
|
|
|
|
+the end of a dotted rule, we have finished parsing a substring
|
|
|
|
+according to the left-hand side of the rule, in this case
|
|
|
|
+\code{exp\_hi}. We therefore need to advance the periods in any dotted
|
|
|
|
+rules in slot $2$ (the starting position for the finished rule) the
|
|
|
|
+period is immediately followed by \code{exp\_hi}. So we identify
|
|
\begin{lstlisting}
|
|
\begin{lstlisting}
|
|
-stmt: . "print" "(" exp ")"
|
|
|
|
-stmt: . exp
|
|
|
|
|
|
+exp: . exp_hi (2)
|
|
|
|
+\end{lstlisting}
|
|
|
|
+and add the following dotted rule to slot $3$
|
|
|
|
+\begin{lstlisting}
|
|
|
|
+exp: exp_hi . (2)
|
|
|
|
+\end{lstlisting}
|
|
|
|
+This triggers another completion step for the nonterminal \code{exp},
|
|
|
|
+adding two more dotted rules to slot $3$.
|
|
|
|
+\begin{lstlisting}[escapechar=$]
|
|
|
|
+exp: exp . "+" exp_hi (2)
|
|
|
|
+exp: exp . "-" exp_hi (2)
|
|
|
|
+\end{lstlisting}
|
|
|
|
+
|
|
|
|
+Returning to scanning, the next input token is \code{"+"}, so
|
|
|
|
+we add the following to slot $4$.
|
|
|
|
+\begin{lstlisting}[escapechar=$]
|
|
|
|
+exp: exp "+" . exp_hi (2)
|
|
|
|
+\end{lstlisting}
|
|
|
|
+The period precedes the nonterminal \code{exp\_hi}, so prediction adds
|
|
|
|
+the following dotted rules to slot $4$ of the chart.
|
|
|
|
+\begin{lstlisting}[escapechar=$]
|
|
|
|
+exp_hi: . INT (4)
|
|
|
|
+exp_hi: . "input_int" "(" ")" (4)
|
|
|
|
+exp_hi: . "-" exp_hi (4)
|
|
|
|
+exp_hi: . "(" exp ")" (4)
|
|
|
|
+\end{lstlisting}
|
|
|
|
+The next input token is \code{"3"} which the lexer categorized as an
|
|
|
|
+\code{INT}, so we advance the period past \code{INT} for the rules in
|
|
|
|
+slot $4$, of which there is just one, and put the following in slot $5$.
|
|
|
|
+\begin{lstlisting}[escapechar=$]
|
|
|
|
+exp_hi: INT . (4)
|
|
|
|
+\end{lstlisting}
|
|
|
|
+
|
|
|
|
+The period at the end of the rule triggers a completion action for the
|
|
|
|
+rules in slot $4$, one of which has a period before \code{exp\_hi}.
|
|
|
|
+So we advance the period and put the following in slot $5$.
|
|
|
|
+\begin{lstlisting}[escapechar=$]
|
|
|
|
+exp: exp "+" exp_hi . (2)
|
|
\end{lstlisting}
|
|
\end{lstlisting}
|
|
-To finish the preduction phase, we add the grammar rules for
|
|
|
|
-\code{exp} and \code{exp\_hi}.
|
|
|
|
|
|
+This triggers another completion action for the rules in slot $2$ that
|
|
|
|
+have a period before \code{exp}.
|
|
\begin{lstlisting}[escapechar=$]
|
|
\begin{lstlisting}[escapechar=$]
|
|
-exp: . exp "+" exp_hi
|
|
|
|
-exp: . exp "-" exp_hi
|
|
|
|
-exp: . exp_hi
|
|
|
|
-exp_hi: . INT
|
|
|
|
-exp_hi: . "input_int" "(" ")"
|
|
|
|
-exp_hi: . "-" exp_hi
|
|
|
|
-exp_hi: . "(" exp ")"
|
|
|
|
|
|
+stmt: "print" "(" exp . ")" (0)
|
|
|
|
+exp: exp . "+" exp_hi (2)
|
|
|
|
+exp: exp . "-" exp_hi (2)
|
|
\end{lstlisting}
|
|
\end{lstlisting}
|
|
|
|
|
|
|
|
+We scan the next input token \code{")"}, placing the following dotted
|
|
|
|
+rule in slot $6$.
|
|
|
|
+\begin{lstlisting}[escapechar=$]
|
|
|
|
+stmt: "print" "(" exp ")" . (0)
|
|
|
|
+\end{lstlisting}
|
|
|
|
+This triggers the completion of \code{stmt} in slot $0$
|
|
|
|
+\begin{lstlisting}
|
|
|
|
+stmt_list: stmt . NEWLINE stmt_list (0)
|
|
|
|
+\end{lstlisting}
|
|
|
|
+The last input token is a \code{NEWLINE}, so we advance the period
|
|
|
|
+and place the new dotted rule in slot $7$.
|
|
|
|
+\begin{lstlisting}
|
|
|
|
+stmt_list: stmt NEWLINE . stmt_list (0)
|
|
|
|
+\end{lstlisting}
|
|
|
|
+We are close to the end of parsing the input!
|
|
|
|
+The period is before the \code{stmt\_list} nonterminal, so we
|
|
|
|
+apply prediction for \code{stmt\_list} and then \code{stmt}.
|
|
|
|
+\begin{lstlisting}
|
|
|
|
+stmt_list: . (7)
|
|
|
|
+stmt_list: . stmt NEWLINE stmt_list (7)
|
|
|
|
+stmt: . "print" "(" exp ")" (7)
|
|
|
|
+stmt: . exp (7)
|
|
|
|
+\end{lstlisting}
|
|
|
|
+There is immediately an opportunity for completion of \code{stmt\_list},
|
|
|
|
+so we add the following to slot $7$.
|
|
|
|
+\begin{lstlisting}
|
|
|
|
+stmt_list: stmt NEWLINE stmt_list . (0)
|
|
|
|
+\end{lstlisting}
|
|
|
|
+This triggers another completion action for \code{stmt\_list} in slot $0$
|
|
|
|
+\begin{lstlisting}
|
|
|
|
+lang_int: stmt_list . (0)
|
|
|
|
+\end{lstlisting}
|
|
|
|
+which in turn completes \code{lang\_int}, the start symbol of the
|
|
|
|
+grammar, so the parsing of the input is complete.
|
|
|
|
+
|
|
|
|
+For reference, we now give a general description of the Earley
|
|
|
|
+algorithm.
|
|
|
|
+\begin{enumerate}
|
|
|
|
+\item The algorithm begins by initializing slot $0$ of the chart with the
|
|
|
|
+ grammar rule for the start symbol, placing a period at the beginning
|
|
|
|
+ of the right-hand side, and recording its starting position as $0$.
|
|
|
|
+
|
|
|
|
+\item The algorithm repeatedly applies the following three kinds of
|
|
|
|
+ actions for as long as there are opportunities to do so.
|
|
|
|
+ \begin{itemize}
|
|
|
|
+ \item Prediction: if there is a dotted rule in slot $k$ whose period
|
|
|
|
+ comes before a nonterminal, add all the rules for that nonterminal
|
|
|
|
+ into slot $k$, placing a period at the beginning of their
|
|
|
|
+ right-hand sides, and recording their starting position as
|
|
|
|
+ $k$.
|
|
|
|
+ \item Scanning: If the token at position $k$ of the input string
|
|
|
|
+ matches the symbol after the period in a dotted rule in slot $k$
|
|
|
|
+ of the chart, advance the prior in the dotted rule, adding
|
|
|
|
+ the result to slot $k+1$.
|
|
|
|
+ \item Completion: If a dotted rule in slot $k$ has a period at the
|
|
|
|
+ end, consider the rules in the slot corresponding to the starting
|
|
|
|
+ position of the completed rule. If any of those rules have a
|
|
|
|
+ nonterminal following their period that matches the left-hand side
|
|
|
|
+ of the completed rule, then advance their period, placing the new
|
|
|
|
+ dotted rule in slot $k$.
|
|
|
|
+ \end{itemize}
|
|
|
|
+ While repeating these three actions, take care to never add
|
|
|
|
+ duplicate dotted rules to the chart.
|
|
|
|
+\end{enumerate}
|
|
|
|
|
|
-\section{The LALR Algorithm}
|
|
|
|
|
|
+We have described how the Earley algorithm recognizes that an input
|
|
|
|
+string matches a grammar, but we have not described how it builds a
|
|
|
|
+parse tree. The basic idea is simple, but it turns out that building
|
|
|
|
+parse trees in an efficient way is more complex, requiring a data
|
|
|
|
+structure called a shared packed parse forest~\citep{Tomita:1985qr}.
|
|
|
|
+The simple idea is to attach a partial parse tree to every dotted
|
|
|
|
+rule. Initially, the tree node associated with a dotted rule has no
|
|
|
|
+children. As the period moves to the right, the nodes from the
|
|
|
|
+subparses are added as children to this tree node.
|
|
|
|
+
|
|
|
|
+As mentioned at the beginning of this section, the Earley algorithm is
|
|
|
|
+$O(n^2)$ for unambiguous grammars, which means that it can parse input
|
|
|
|
+files that contain thousands of tokens in a reasonable amount of time,
|
|
|
|
+but not millions. In the next section we discuss the LALR(1) parsing
|
|
|
|
+algorithm, which has time complexity $O(n)$, making it practical to
|
|
|
|
+use with even the largest of input files.
|
|
|
|
+
|
|
|
|
+\section{The LALR(1) Algorithm}
|
|
\label{sec:lalr}
|
|
\label{sec:lalr}
|
|
|
|
|
|
|
|
+The LALR(1) algorithm consists of a finite automata and a stack to
|
|
|
|
+record its progress in parsing the input string. Each element of the
|
|
|
|
+stack is a pair: a state number and a grammar symbol (a terminal or
|
|
|
|
+nonterminal). The symbol characterizes the input that has been parsed
|
|
|
|
+so-far and the state number is used to remember how to proceed once
|
|
|
|
+the next symbol-worth of input has been parsed. Each state in the
|
|
|
|
+finite automata represents where the parser stands in the parsing
|
|
|
|
+process with respect to certain grammar rules. In particular, each
|
|
|
|
+state is associated with a set of dotted rules.
|
|
|
|
+
|
|
|
|
+Figure~\ref{fig:shift-reduce} shows an example LALR(1) parse table
|
|
|
|
+generated by Lark for the following simple but amiguous grammar:
|
|
|
|
+\begin{lstlisting}[escapechar=$]
|
|
|
|
+exp: INT
|
|
|
|
+ | exp "+" exp
|
|
|
|
+stmt: "print" exp
|
|
|
|
+start: stmt
|
|
|
|
+\end{lstlisting}
|
|
|
|
+%% When PLY generates a parse table, it also
|
|
|
|
+%% outputs a textual representation of the parse table to the file
|
|
|
|
+%% \texttt{parser.out} which is useful for debugging purposes.
|
|
|
|
+Consider state 1 in Figure~\ref{fig:shift-reduce}. The parser has just
|
|
|
|
+read in a \lstinline{PRINT} token, so the top of the stack is
|
|
|
|
+\lstinline{(1,PRINT)}. The parser is part of the way through parsing
|
|
|
|
+the input according to grammar rule 1, which is signified by showing
|
|
|
|
+rule 1 with a period after the \code{PRINT} token and before the
|
|
|
|
+\code{exp} nonterminal. A rule with a period in it is called an
|
|
|
|
+\emph{item}. There are several rules that could apply next, both rule
|
|
|
|
+2 and 3, so state 1 also shows those rules with a period at the
|
|
|
|
+beginning of their right-hand sides. The edges between states indicate
|
|
|
|
+which transitions the automata should make depending on the next input
|
|
|
|
+token. So, for example, if the next input token is \code{INT} then the
|
|
|
|
+parser will push \code{INT} and the target state 4 on the stack and
|
|
|
|
+transition to state 4. Suppose we are now at the end of the input. In
|
|
|
|
+state 4 it says we should reduce by rule 3, so we pop from the stack
|
|
|
|
+the same number of items as the number of symbols in the right-hand
|
|
|
|
+side of the rule, in this case just one. We then momentarily jump to
|
|
|
|
+the state at the top of the stack (state 1) and then follow the goto
|
|
|
|
+edge that corresponds to the left-hand side of the rule we just
|
|
|
|
+reduced by, in this case \code{exp}, so we arrive at state 3. (A
|
|
|
|
+slightly longer example parse is shown in
|
|
|
|
+Figure~\ref{fig:shift-reduce}.)
|
|
|
|
+
|
|
|
|
+
|
|
|
|
+\begin{figure}[htbp]
|
|
|
|
+ \centering
|
|
|
|
+\includegraphics[width=5.0in]{figs/shift-reduce-conflict}
|
|
|
|
+ \caption{An LALR(1) parse table and a trace of an example run.}
|
|
|
|
+ \label{fig:shift-reduce}
|
|
|
|
+\end{figure}
|
|
|
|
+
|
|
|
|
+In general, the shift-reduce algorithm works as follows. Look at the
|
|
|
|
+next input token.
|
|
|
|
+\begin{itemize}
|
|
|
|
+\item If there there is a shift edge for the input token, push the
|
|
|
|
+ edge's target state and the input token on the stack and proceed to
|
|
|
|
+ the edge's target state.
|
|
|
|
+\item If there is a reduce action for the input token, pop $k$
|
|
|
|
+ elements from the stack, where $k$ is the number of symbols in the
|
|
|
|
+ right-hand side of the rule being reduced. Jump to the state at the
|
|
|
|
+ top of the stack and then follow the goto edge for the nonterminal
|
|
|
|
+ that matches the left-hand side of the rule we're reducing by. Push
|
|
|
|
+ the edge's target state and the nonterminal on the stack.
|
|
|
|
+\end{itemize}
|
|
|
|
+
|
|
|
|
+Notice that in state 6 of Figure~\ref{fig:shift-reduce} there is both
|
|
|
|
+a shift and a reduce action for the token \lstinline{PLUS}, so the
|
|
|
|
+algorithm does not know which action to take in this case. When a
|
|
|
|
+state has both a shift and a reduce action for the same token, we say
|
|
|
|
+there is a \emph{shift/reduce conflict}. In this case, the conflict
|
|
|
|
+will arise, for example, when trying to parse the input
|
|
|
|
+\lstinline{print 1 + 2 + 3}. After having consumed \lstinline{print 1 + 2}
|
|
|
|
+the parser will be in state 6, and it will not know whether to
|
|
|
|
+reduce to form an \emph{exp} of \lstinline{1 + 2}, or whether it
|
|
|
|
+should proceed by shifting the next \lstinline{+} from the input.
|
|
|
|
+
|
|
|
|
+A similar kind of problem, known as a \emph{reduce/reduce} conflict,
|
|
|
|
+arises when there are two reduce actions in a state for the same
|
|
|
|
+token. To understand which grammars gives rise to shift/reduce and
|
|
|
|
+reduce/reduce conflicts, it helps to know how the parse table is
|
|
|
|
+generated from the grammar, which we discuss next.
|
|
|
|
+
|
|
|
|
+The parse table is generated one state at a time. State 0 represents
|
|
|
|
+the start of the parser. We add the grammar rule for the start symbol
|
|
|
|
+to this state with a period at the beginning of the right-hand side,
|
|
|
|
+similar to the initialization phase of the Earley parser. If the
|
|
|
|
+period appears immediately before another nonterminal, we add all the
|
|
|
|
+rules with that nonterminal on the left-hand side. Again, we place a
|
|
|
|
+period at the beginning of the right-hand side of each the new
|
|
|
|
+rules. This process called \emph{state closure} is continued
|
|
|
|
+until there are no more rules to add (similar to the prediction
|
|
|
|
+actions of an Earley parser). We then examine each dotted rule in the
|
|
|
|
+current state $I$. Suppose a dotted rule has the form $A ::=
|
|
|
|
+\alpha.X\beta$, where $A$ and $X$ are symbols and $\alpha$ and $\beta$
|
|
|
|
+are sequences of symbols. We create a new state, call it $J$. If $X$
|
|
|
|
+is a terminal, we create a shift edge from $I$ to $J$ (analogous to
|
|
|
|
+scanning in Earley), whereas if $X$ is a nonterminal, we create a
|
|
|
|
+goto edge from $I$ to $J$. We then need to add some dotted rules to
|
|
|
|
+state $J$. We start by adding all dotted rules from state $I$ that
|
|
|
|
+have the form $B ::= \gamma.X\kappa$ (where $B$ is any nonterminal and
|
|
|
|
+$\gamma$ and $\kappa$ are arbitrary sequences of symbols), but with
|
|
|
|
+the period moved past the $X$. (This is analogous to completion in
|
|
|
|
+the Earley algorithm.) We then perform state closure on $J$. This
|
|
|
|
+process repeats until there are no more states or edges to add.
|
|
|
|
+
|
|
|
|
+We then mark states as accepting states if they have a dotted rule
|
|
|
|
+that is the start rule with a period at the end. Also, to add
|
|
|
|
+in the reduce actions, we look for any state containing a dotted rule
|
|
|
|
+with a period at the end. Let $n$ be the rule number for this dotted
|
|
|
|
+rule. We then put a reduce $n$ action into that state for every token
|
|
|
|
+$Y$. For example, in Figure~\ref{fig:shift-reduce} state 4 has an
|
|
|
|
+dotted rule with a period at the end. We therefore put a reduce by
|
|
|
|
+rule 3 action into state 4 for every
|
|
|
|
+token. (Figure~\ref{fig:shift-reduce} does not show a reduce rule for
|
|
|
|
+\code{INT} in state 4 because this grammar does not allow two
|
|
|
|
+consecutive \code{INT} tokens in the input. We will not go into how
|
|
|
|
+this can be figured out, but in any event it does no harm to have a
|
|
|
|
+reduce rule for \code{INT} in state 4; it just means the input will be
|
|
|
|
+rejected at a later point in the parsing process.)
|
|
|
|
+
|
|
|
|
+\begin{exercise}
|
|
|
|
+On a piece of paper, walk through the parse table generation
|
|
|
|
+process for the grammar in Figure~\ref{fig:parser1} and check
|
|
|
|
+your results against Figure~\ref{fig:shift-reduce}.
|
|
|
|
+\end{exercise}
|
|
|
|
+
|
|
|
|
+
|
|
\section{Further Reading}
|
|
\section{Further Reading}
|
|
|
|
|
|
UNDER CONSTRUCTION
|
|
UNDER CONSTRUCTION
|