|
@@ -4103,14 +4103,17 @@ framework~\citep{shinan20:_lark_docs} to translate the concrete syntax
|
|
|
of \LangInt{} (a sequence of characters) into an abstract syntax tree.
|
|
|
You will then be asked to use Lark to create a parser for \LangVar{}.
|
|
|
We also describe the parsing algorithms used inside Lark, studying the
|
|
|
-\citet{Earley:1970ly} and LALR(1) algorithms.
|
|
|
+\citet{Earley:1970ly} and LALR(1) algorithms~\citep{DeRemer69,Anderson73}.
|
|
|
|
|
|
A parser framework such as Lark takes in a specification of the
|
|
|
-concrete syntax and the input program and produces a parse tree. Even
|
|
|
+concrete syntax and an input program and produces a parse tree. Even
|
|
|
though a parser framework does most of the work for us, using one
|
|
|
properly requires some knowledge. In particular, we must learn about
|
|
|
its specification languages and we must learn how to deal with
|
|
|
-ambiguity in our language specifications.
|
|
|
+ambiguity in our language specifications. Also, some algorithms, such
|
|
|
+as LALR(1) place restrictions on the grammars they can handle, in
|
|
|
+which case it helps to know the algorithm when trying to decipher the
|
|
|
+error messages.
|
|
|
|
|
|
The process of parsing is traditionally subdivided into two phases:
|
|
|
\emph{lexical analysis} (also called scanning) and \emph{syntax
|
|
@@ -4125,13 +4128,13 @@ and the use of a slower but more powerful algorithm for parsing.
|
|
|
%
|
|
|
%% Likewise, parser generators typical come in pairs, with separate
|
|
|
%% generators for the lexical analyzer (or lexer for short) and for the
|
|
|
-%% parser. A paricularly influential pair of generators were
|
|
|
+%% parser. A particularly influential pair of generators were
|
|
|
%% \texttt{lex} and \texttt{yacc}. The \texttt{lex} generator was written
|
|
|
%% by \citet{Lesk:1975uq} at Bell Labs. The \texttt{yacc} generator was
|
|
|
%% written by \citet{Johnson:1979qy} at AT\&T and stands for Yet Another
|
|
|
%% Compiler Compiler.
|
|
|
%
|
|
|
-The Lark parse framwork that we use in this chapter includes both
|
|
|
+The Lark parser framework that we use in this chapter includes both
|
|
|
lexical analyzers and parsers. The next section discusses lexical
|
|
|
analysis and the remainder of the chapter discusses parsing.
|
|
|
|
|
@@ -4162,15 +4165,16 @@ Token('NEWLINE', '\n')
|
|
|
Each token includes a field for its \code{type}, such as \code{'INT'},
|
|
|
and a field for its \code{value}, such as \code{'1'}.
|
|
|
|
|
|
-Following in the tradition of \code{lex}, the specification language
|
|
|
-for Lark's lexical analysis generator is one regular expression for
|
|
|
-each type of token. The term \emph{regular} comes from the term
|
|
|
-\emph{regular languages}, which are the languages that can be
|
|
|
-recognized by a finite automata. A \emph{regular expression} is a
|
|
|
-pattern formed of the following core elements:\index{subject}{regular
|
|
|
- expression}\footnote{Regular expressions traditionally include the
|
|
|
- empty regular expression that matches any zero-length part of a
|
|
|
- string, but Lark does not support the empty regular expression.}
|
|
|
+Following in the tradition of \code{lex}~\citep{Lesk:1975uq}, the
|
|
|
+specification language for Lark's lexical analysis generator is one
|
|
|
+regular expression for each type of token. The term \emph{regular}
|
|
|
+comes from the term \emph{regular languages}, which are the languages
|
|
|
+that can be recognized by a finite automata. A \emph{regular
|
|
|
+ expression} is a pattern formed of the following core
|
|
|
+elements:\index{subject}{regular expression}\footnote{Regular
|
|
|
+ expressions traditionally include the empty regular expression that
|
|
|
+ matches any zero-length part of a string, but Lark does not support
|
|
|
+ the empty regular expression.}
|
|
|
\begin{itemize}
|
|
|
\item A single character $c$ is a regular expression and it only
|
|
|
matches itself. For example, the regular expression \code{a} only
|
|
@@ -4249,15 +4253,15 @@ In section~\ref{sec:grammar} we learned how to use grammar rules to
|
|
|
specify the abstract syntax of a language. We now take a closer look
|
|
|
at using grammar rules to specify the concrete syntax. Recall that
|
|
|
each rule has a left-hand side and a right-hand side. However, for
|
|
|
-concrete syntax, each right-hand side expresses a pattern to match
|
|
|
-against a string, instead of matching against an abstract syntax
|
|
|
-tree. In particular, each right-hand side is a sequence of
|
|
|
+concrete syntax, each right-hand side expresses a pattern for a
|
|
|
+string, instead of a patter for an abstract syntax tree. In
|
|
|
+particular, each right-hand side is a sequence of
|
|
|
\emph{symbols}\index{subject}{symbol}, where a symbol is either a
|
|
|
terminal or nonterminal. A \emph{terminal}\index{subject}{terminal} is
|
|
|
a string. The nonterminals play the same role as in the abstract
|
|
|
syntax, defining categories of syntax. The nonterminals of a grammar
|
|
|
include the tokens defined in the lexer and all the nonterminals
|
|
|
-defined in the grammar rules.
|
|
|
+defined by the grammar rules.
|
|
|
|
|
|
As an example, let us take a closer look at the concrete syntax of the
|
|
|
\LangInt{} language, repeated here.
|
|
@@ -4304,12 +4308,12 @@ The rule \code{exp: exp "+" exp} says that any string that matches
|
|
|
\code{exp}, followed by the \code{+} character, followed by another
|
|
|
string that matches \code{exp}, is itself an \code{exp}. For example,
|
|
|
the string \code{'1+3'} is an \code{exp} because \code{'1'} and
|
|
|
-\code{'3'} are both \code{exp} by the rule \code{exp: INT}, and then the
|
|
|
-rule for addition applies to categorize \code{'1+3'} as an \Exp{}. We
|
|
|
-can visualize the application of grammar rules to categorize a string
|
|
|
-using a \emph{parse tree}\index{subject}{parse tree}. Each internal
|
|
|
-node in the tree is an application of a grammar rule and is labeled
|
|
|
-with the nonterminal of its left-hand side. Each leaf node is a
|
|
|
+\code{'3'} are both \code{exp} by the rule \code{exp: INT}, and then
|
|
|
+the rule for addition applies to categorize \code{'1+3'} as an
|
|
|
+\Exp{}. We can visualize the application of grammar rules to parse a
|
|
|
+string using a \emph{parse tree}\index{subject}{parse tree}. Each
|
|
|
+internal node in the tree is an application of a grammar rule and is
|
|
|
+labeled with its left-hand side nonterminal. Each leaf node is a
|
|
|
substring of the input program. The parse tree for \code{'1+3'} is
|
|
|
shown in figure~\ref{fig:simple-parse-tree}.
|
|
|
|
|
@@ -4424,17 +4428,17 @@ is left associative. We also need to consider the interaction of unary
|
|
|
subtraction with both addition and subtraction. How should we parse
|
|
|
\code{'-1+2'}? Unary subtraction has higher
|
|
|
\emph{precendence}\index{subject}{precedence} than addition and
|
|
|
-subtraction, so \code{'-1+2'} should parse as \code{'(-1)+2'} and not
|
|
|
-\code{'-(1+2)'}. The grammar in figure~\ref{fig:Lint-lark-grammar}
|
|
|
-handles the associativity of addition and subtraction by using the
|
|
|
-nonterminal \code{exp\_hi} for all the other expressions, and uses
|
|
|
-\code{exp\_hi} for the second child in the rules for addition and
|
|
|
-subtraction. Furthermore, unary subtraction uses \code{exp\_hi} for
|
|
|
-its child.
|
|
|
-
|
|
|
-For languages with more operators and more precedence levels, one
|
|
|
-would need to refine the \code{exp} nonterminal into several
|
|
|
-nonterminals, one for each precedence level.
|
|
|
+subtraction, so \code{'-1+2'} should parse the same as \code{'(-1)+2'}
|
|
|
+and not \code{'-(1+2)'}. The grammar in
|
|
|
+figure~\ref{fig:Lint-lark-grammar} handles the associativity of
|
|
|
+addition and subtraction by using the nonterminal \code{exp\_hi} for
|
|
|
+all the other expressions, and uses \code{exp\_hi} for the second
|
|
|
+child in the rules for addition and subtraction. Furthermore, unary
|
|
|
+subtraction uses \code{exp\_hi} for its child.
|
|
|
+
|
|
|
+For languages with more operators and more precedence levels, one must
|
|
|
+refine the \code{exp} nonterminal into several nonterminals, one for
|
|
|
+each precedence level.
|
|
|
|
|
|
\begin{figure}[tbp]
|
|
|
\begin{tcolorbox}[colback=white]
|
|
@@ -4520,11 +4524,10 @@ In this section we discuss the parsing algorithm of
|
|
|
The algorithm is powerful in that it can handle any context-free
|
|
|
grammar, which makes it easy to use. However, it is not the most
|
|
|
efficient parsing algorithm: it is $O(n^3)$ for ambiguous grammars and
|
|
|
-$O(n^2)$ for unambiguous grammars~\citep{Hopcroft06:_automata}, where
|
|
|
-$n$ is the number of tokens in the input string. In
|
|
|
-section~\ref{sec:lalr} we learn about the LALR algorithm, which is
|
|
|
-more efficient but can only handle a subset of the context-free
|
|
|
-grammars.
|
|
|
+$O(n^2)$ for unambiguous grammars, where $n$ is the number of tokens
|
|
|
+in the input string~\citep{Hopcroft06:_automata}. In
|
|
|
+section~\ref{sec:lalr} we learn about the LALR(1) algorithm, which is
|
|
|
+more efficient but cannot handle all context-free grammars.
|
|
|
|
|
|
The Earley algorithm can be viewed as an interpreter; it treats the
|
|
|
grammar as the program being interpreted and it treats the concrete
|
|
@@ -4546,26 +4549,26 @@ represents a partial parse that has matched an \code{exp} followed by
|
|
|
\code{+}.
|
|
|
%
|
|
|
The Earley algorithm starts with an initialization phase, and then
|
|
|
-repeats three actions (prediction, scanning, and completion) for as
|
|
|
-long as opportunities arise for those actions. We demonstrate the
|
|
|
-Earley algorithm on a running example, parsing the following program:
|
|
|
+repeats three actions---prediction, scanning, and completion---for as
|
|
|
+long as opportunities arise. We demonstrate the Earley algorithm on a
|
|
|
+running example, parsing the following program:
|
|
|
\begin{lstlisting}
|
|
|
print(1 + 3)
|
|
|
\end{lstlisting}
|
|
|
The algorithm's initialization phase creates dotted rules for all the
|
|
|
grammar rules whose left-hand side is the start symbol and places them
|
|
|
-in slot $0$ of the chart. It also records the starting position of the
|
|
|
-dotted rule, in parentheses on the right. For example, given the
|
|
|
+in slot $0$ of the chart. We also record the starting position of the
|
|
|
+dotted rule in parentheses on the right. For example, given the
|
|
|
grammar in figure~\ref{fig:Lint-lark-grammar}, we place
|
|
|
\begin{lstlisting}
|
|
|
lang_int: . stmt_list (0)
|
|
|
\end{lstlisting}
|
|
|
in slot $0$ of the chart. The algorithm then proceeds to with
|
|
|
\emph{prediction} actions in which it adds more dotted rules to the
|
|
|
-chart based on which nonterminal come after a period. In the above,
|
|
|
-the nonterminal \code{stmt\_list} appears after a period, so we add all
|
|
|
-the rules for \code{stmt\_list} to slot $0$, with a period at the
|
|
|
-beginning of their right-hand sides, as follows:
|
|
|
+chart based on which nonterminals come immediately after a period. In
|
|
|
+the above, the nonterminal \code{stmt\_list} appears after a period,
|
|
|
+so we add all the rules for \code{stmt\_list} to slot $0$, with a
|
|
|
+period at the beginning of their right-hand sides, as follows:
|
|
|
\begin{lstlisting}
|
|
|
stmt_list: . (0)
|
|
|
stmt_list: . stmt NEWLINE stmt_list (0)
|
|
@@ -4577,7 +4580,7 @@ period, so we add all the rules for \code{stmt}.
|
|
|
stmt: . "print" "(" exp ")" (0)
|
|
|
stmt: . exp (0)
|
|
|
\end{lstlisting}
|
|
|
-This reveals more opportunities for prediction, so we add the grammar
|
|
|
+This reveals yet more opportunities for prediction, so we add the grammar
|
|
|
rules for \code{exp} and \code{exp\_hi}.
|
|
|
\begin{lstlisting}[escapechar=$]
|
|
|
exp: . exp "+" exp_hi (0)
|
|
@@ -4593,8 +4596,8 @@ We have exhausted the opportunities for prediction, so the algorithm
|
|
|
proceeds to \emph{scanning}, in which we inspect the next input token
|
|
|
and look for a dotted rule at the current position that has a matching
|
|
|
terminal following the period. In our running example, the first input
|
|
|
-token is \code{"print"} so we identify the dotted rule in slot $0$ of
|
|
|
-the chart:
|
|
|
+token is \code{"print"} so we identify the rule in slot $0$ of
|
|
|
+the chart whose dot comes before \code{"print"}:
|
|
|
\begin{lstlisting}
|
|
|
stmt: . "print" "(" exp ")" (0)
|
|
|
\end{lstlisting}
|
|
@@ -4626,7 +4629,7 @@ exp_hi: . "-" exp_hi (2)
|
|
|
exp_hi: . "(" exp ")" (2)
|
|
|
\end{lstlisting}
|
|
|
With that prediction complete, we return to scanning, noting that the
|
|
|
-next input token is \code{"1"} which the lexer categorized as an
|
|
|
+next input token is \code{"1"} which the lexer parses as an
|
|
|
\code{INT}. There is a matching rule is slot $2$:
|
|
|
\begin{lstlisting}
|
|
|
exp_hi: . INT (2)
|
|
@@ -4636,10 +4639,10 @@ so we advance the period and put the following rule is slot $3$.
|
|
|
exp_hi: INT . (2)
|
|
|
\end{lstlisting}
|
|
|
This brings us to \emph{completion} actions. When the period reaches
|
|
|
-the end of a dotted rule, we have finished parsing a substring
|
|
|
-according to the left-hand side of the rule, in this case
|
|
|
+the end of a dotted rule, we recognize that the substring
|
|
|
+has matched the nonterminal on the left-hand side of the rule, in this case
|
|
|
\code{exp\_hi}. We therefore need to advance the periods in any dotted
|
|
|
-rules in slot $2$ (the starting position for the finished rule) the
|
|
|
+rules in slot $2$ (the starting position for the finished rule) if
|
|
|
period is immediately followed by \code{exp\_hi}. So we identify
|
|
|
\begin{lstlisting}
|
|
|
exp: . exp_hi (2)
|
|
@@ -4756,11 +4759,11 @@ algorithm.
|
|
|
|
|
|
We have described how the Earley algorithm recognizes that an input
|
|
|
string matches a grammar, but we have not described how it builds a
|
|
|
-parse tree. The basic idea is simple, but it turns out that building
|
|
|
-parse trees in an efficient way is more complex, requiring a data
|
|
|
-structure called a shared packed parse forest~\citep{Tomita:1985qr}.
|
|
|
-The simple idea is to attach a partial parse tree to every dotted
|
|
|
-rule. Initially, the tree node associated with a dotted rule has no
|
|
|
+parse tree. The basic idea is simple, but building parse trees in an
|
|
|
+efficient way is more complex, requiring a data structure called a
|
|
|
+shared packed parse forest~\citep{Tomita:1985qr}. The simple idea is
|
|
|
+to attach a partial parse tree to every dotted rule in the chart.
|
|
|
+Initially, the tree node associated with a dotted rule has no
|
|
|
children. As the period moves to the right, the nodes from the
|
|
|
subparses are added as children to this tree node.
|
|
|
|
|
@@ -4774,29 +4777,33 @@ use with even the largest of input files.
|
|
|
\section{The LALR(1) Algorithm}
|
|
|
\label{sec:lalr}
|
|
|
|
|
|
-The LALR(1) algorithm can be viewed as a two phase approach in which
|
|
|
-it first compiles the grammar into a state machine and then runs the
|
|
|
-state machine to parse the input string. The state machine also uses
|
|
|
-a stack to record its progress in parsing the input string. Each
|
|
|
-element of the stack is a pair: a state number and a grammar symbol (a
|
|
|
-terminal or nonterminal). The symbol characterizes the input that has
|
|
|
-been parsed so-far and the state number is used to remember how to
|
|
|
-proceed once the next symbol-worth of input has been parsed. Each
|
|
|
-state in the machine represents where the parser stands in the parsing
|
|
|
-process with respect to certain grammar rules. In particular, each
|
|
|
-state is associated with a set of dotted rules.
|
|
|
-
|
|
|
-Figure~\ref{fig:shift-reduce} shows an example LALR(1) parse table
|
|
|
-generated by Lark for the following simple but amiguous grammar:
|
|
|
+The LALR(1) algorithm~\citep{DeRemer69,Anderson73} can be viewed as a
|
|
|
+two phase approach in which it first compiles the grammar into a state
|
|
|
+machine and then runs the state machine to parse an input string.
|
|
|
+%
|
|
|
+A particularly influential implementation of LALR(1) was the
|
|
|
+\texttt{yacc} parser generator by \citet{Johnson:1979qy}, which stands
|
|
|
+for Yet Another Compiler Compiler.
|
|
|
+%
|
|
|
+The LALR(1) state machine uses a stack to record its progress in
|
|
|
+parsing the input string. Each element of the stack is a pair: a
|
|
|
+state number and a grammar symbol (a terminal or nonterminal). The
|
|
|
+symbol characterizes the input that has been parsed so-far and the
|
|
|
+state number is used to remember how to proceed once the next
|
|
|
+symbol-worth of input has been parsed. Each state in the machine
|
|
|
+represents where the parser stands in the parsing process with respect
|
|
|
+to certain grammar rules. In particular, each state is associated with
|
|
|
+a set of dotted rules.
|
|
|
+
|
|
|
+Figure~\ref{fig:shift-reduce} shows an example LALR(1) state machine
|
|
|
+(also called parse table) for the following simple but ambiguous
|
|
|
+grammar:
|
|
|
\begin{lstlisting}[escapechar=$]
|
|
|
exp: INT
|
|
|
| exp "+" exp
|
|
|
stmt: "print" exp
|
|
|
start: stmt
|
|
|
\end{lstlisting}
|
|
|
-%% When PLY generates a parse table, it also
|
|
|
-%% outputs a textual representation of the parse table to the file
|
|
|
-%% \texttt{parser.out} which is useful for debugging purposes.
|
|
|
Consider state 1 in Figure~\ref{fig:shift-reduce}. The parser has just
|
|
|
read in a \lstinline{PRINT} token, so the top of the stack is
|
|
|
\lstinline{(1,PRINT)}. The parser is part of the way through parsing
|
|
@@ -4819,7 +4826,6 @@ reduced by, in this case \code{exp}, so we arrive at state 3. (A
|
|
|
slightly longer example parse is shown in
|
|
|
Figure~\ref{fig:shift-reduce}.)
|
|
|
|
|
|
-
|
|
|
\begin{figure}[htbp]
|
|
|
\centering
|
|
|
\includegraphics[width=5.0in]{figs/shift-reduce-conflict}
|
|
@@ -4827,8 +4833,8 @@ Figure~\ref{fig:shift-reduce}.)
|
|
|
\label{fig:shift-reduce}
|
|
|
\end{figure}
|
|
|
|
|
|
-In general, the shift-reduce algorithm works as follows. Look at the
|
|
|
-next input token.
|
|
|
+In general, the algorithm works as follows. Look at the next input
|
|
|
+token.
|
|
|
\begin{itemize}
|
|
|
\item If there there is a shift edge for the input token, push the
|
|
|
edge's target state and the input token on the stack and proceed to
|
|
@@ -4837,8 +4843,8 @@ next input token.
|
|
|
elements from the stack, where $k$ is the number of symbols in the
|
|
|
right-hand side of the rule being reduced. Jump to the state at the
|
|
|
top of the stack and then follow the goto edge for the nonterminal
|
|
|
- that matches the left-hand side of the rule we're reducing by. Push
|
|
|
- the edge's target state and the nonterminal on the stack.
|
|
|
+ that matches the left-hand side of the rule that we reducing
|
|
|
+ by. Push the edge's target state and the nonterminal on the stack.
|
|
|
\end{itemize}
|
|
|
|
|
|
Notice that in state 6 of Figure~\ref{fig:shift-reduce} there is both
|
|
@@ -4847,7 +4853,7 @@ algorithm does not know which action to take in this case. When a
|
|
|
state has both a shift and a reduce action for the same token, we say
|
|
|
there is a \emph{shift/reduce conflict}. In this case, the conflict
|
|
|
will arise, for example, when trying to parse the input
|
|
|
-\lstinline{print 1 + 2 + 3}. After having consumed \lstinline{print 1 + 2}
|
|
|
+\lstinline{print 1 + 2 + 3}. After having consumed \lstinline{print 1 + 2}
|
|
|
the parser will be in state 6, and it will not know whether to
|
|
|
reduce to form an \emph{exp} of \lstinline{1 + 2}, or whether it
|
|
|
should proceed by shifting the next \lstinline{+} from the input.
|
|
@@ -4926,17 +4932,17 @@ your results against parse table in figure~\ref{fig:shift-reduce}.
|
|
|
\section{Further Reading}
|
|
|
|
|
|
In this chapter we have just scratched the surface of the field of
|
|
|
-parsing, with the study of a very general put less efficient algorithm
|
|
|
+parsing, with the study of a very general but less efficient algorithm
|
|
|
(Earley) and with a more limited but highly efficient algorithm
|
|
|
(LALR). There are many more algorithms, and classes of grammars, that
|
|
|
-fall between these two. We recommend the reader to \citet{Aho:2006wb}
|
|
|
-for a thorough treatment of parsing.
|
|
|
+fall between these two ends of the spectrum. We recommend the reader
|
|
|
+to \citet{Aho:2006wb} for a thorough treatment of parsing.
|
|
|
|
|
|
Regarding lexical analysis, we described the specification language,
|
|
|
the regular expressions, but not the algorithms for recognizing them.
|
|
|
In short, regular expressions can be translated to nondeterministic
|
|
|
finite automata, which in turn are translated to finite automata. We
|
|
|
-refer the reader again to \citet{Aho:2006wb} for all the details of
|
|
|
+refer the reader again to \citet{Aho:2006wb} for all the details on
|
|
|
lexical analysis.
|
|
|
|
|
|
\fi}
|
|
@@ -23556,8 +23562,10 @@ registers.
|
|
|
% LocalWords: TupleProxy RawTuple InjectTuple InjectTupleProxy vecof
|
|
|
% LocalWords: InjectList InjectListProxy unannotated Lgradualr poly
|
|
|
% LocalWords: GenericVar AllType Inst builtin ap pps aps pp deepcopy
|
|
|
-% LocalWords: liskov clu Liskov dk Napier um inst popl jg seq ith
|
|
|
-% LocalWords: racketEd subparts subpart nonterminal nonterminals
|
|
|
-% LocalWords: pseudocode underapproximation underapproximations
|
|
|
-% LocalWords: semilattices overapproximate incrementing
|
|
|
-% LocalWords: multilanguage
|
|
|
+% LocalWords: liskov clu Liskov dk Napier um inst popl jg seq ith qy
|
|
|
+% LocalWords: racketEd subparts subpart nonterminal nonterminals Dyn
|
|
|
+% LocalWords: pseudocode underapproximation underapproximations LALR
|
|
|
+% LocalWords: semilattices overapproximate incrementing Earley docs
|
|
|
+% LocalWords: multilanguage Prelim shinan DeRemer lexer Lesk LPAR cb
|
|
|
+% LocalWords: RPAR abcbab abc bzca usub paren expr lang WS Tomita qr
|
|
|
+% LocalWords: subparses
|