Jeremy Siek 3 years ago
parent
commit
915e6828b2
2 changed files with 121 additions and 18 deletions
  1. 28 0
      book.bib
  2. 93 18
      book.tex

+ 28 - 0
book.bib

@@ -1,3 +1,31 @@
+@article{Earley:1970ly,
+	acmid = {362035},
+	address = {New York, NY, USA},
+	author = {Earley, Jay},
+	date-added = {2011-05-28 11:31:46 -0600},
+	date-modified = {2011-05-28 11:31:48 -0600},
+	doi = {http://doi.acm.org/10.1145/362007.362035},
+	issn = {0001-0782},
+	issue = {2},
+	journal = {Commun. ACM},
+	keywords = {compilers, computational complexity, context-free grammar, parsing, syntax analysis},
+	month = {February},
+	numpages = {9},
+	pages = {94--102},
+	publisher = {ACM},
+	title = {An efficient context-free parsing algorithm},
+	url = {http://doi.acm.org/10.1145/362007.362035},
+	volume = {13},
+	year = {1970},
+	Bdsk-File-1 = {YnBsaXN0MDDRAQJccmVsYXRpdmVQYXRoXnA5NC1lYXJsZXkucGRmCAsYAAAAAAAAAQEAAAAAAAAAAwAAAAAAAAAAAAAAAAAAACc=},
+	Bdsk-Url-1 = {http://doi.acm.org/10.1145/362007.362035}}
+
+@Book{Hopcroft06:_automata,
+  author = 	 {John Hopcroft and Rajeev Motwani and Jeffrey Ullman},
+  title = 	 {Introduction to Automata Theory, Languages, and Computation},
+  publisher = 	 {Pearson},
+  year = 	 2006}
+
 @techreport{Lesk:1975uq,
 @techreport{Lesk:1975uq,
 	author = {M. E. Lesk and E. Schmidt},
 	author = {M. E. Lesk and E. Schmidt},
 	date-added = {2007-08-27 13:37:27 -0600},
 	date-added = {2007-08-27 13:37:27 -0600},

+ 93 - 18
book.tex

@@ -782,13 +782,14 @@ A programming language can be thought of as a \emph{set} of programs.
 The set is infinite (that is, one can always create larger programs),
 The set is infinite (that is, one can always create larger programs),
 so one cannot simply describe a language by listing all the
 so one cannot simply describe a language by listing all the
 programs in the language. Instead we write down a set of rules, a
 programs in the language. Instead we write down a set of rules, a
-\emph{grammar}, for building programs. Grammars are often used to
+\emph{context-free grammar}, for building programs. Grammars are often used to
 define the concrete syntax of a language, but they can also be used to
 define the concrete syntax of a language, but they can also be used to
 describe the abstract syntax. We write our rules in a variant of
 describe the abstract syntax. We write our rules in a variant of
 Backus-Naur form (BNF)~\citep{Backus:1960aa,Knuth:1964aa}.
 Backus-Naur form (BNF)~\citep{Backus:1960aa,Knuth:1964aa}.
 \index{subject}{Backus-Naur form}\index{subject}{BNF} As an example,
 \index{subject}{Backus-Naur form}\index{subject}{BNF} As an example,
 we describe a small language, named \LangInt{}, that consists of
 we describe a small language, named \LangInt{}, that consists of
-integers and arithmetic operations.  \index{subject}{grammar}
+integers and arithmetic operations.\index{subject}{grammar}
+\index{subject}{context-free grammar}
 
 
 The first grammar rule for the abstract syntax of \LangInt{} says that an
 The first grammar rule for the abstract syntax of \LangInt{} says that an
 instance of the \racket{\code{Int} structure}\python{\code{Constant} class} is an expression:
 instance of the \racket{\code{Int} structure}\python{\code{Constant} class} is an expression:
@@ -4086,7 +4087,7 @@ all, fast code is useless if it produces incorrect results!
 
 
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 {\if\edition\pythonEd
 {\if\edition\pythonEd
-\chapter{Parser Generation}
+\chapter{Parsing}
 \label{ch:parsing-Lvar}
 \label{ch:parsing-Lvar}
 \setcounter{footnote}{0}
 \setcounter{footnote}{0}
 \index{subject}{parsing}
 \index{subject}{parsing}
@@ -4095,6 +4096,9 @@ In this chapter we learn how to use the Lark parser
 generator~\citep{shinan20:_lark_docs} to translate the concrete syntax
 generator~\citep{shinan20:_lark_docs} to translate the concrete syntax
 of \LangInt{} (a sequence of characters) into an abstract syntax tree.
 of \LangInt{} (a sequence of characters) into an abstract syntax tree.
 You will then be asked to use Lark to create a parser for \LangVar{}.
 You will then be asked to use Lark to create a parser for \LangVar{}.
+We then learn about the parsing algorithms used inside Lark, studying
+the \citet{Earley:1970ly} and LALR algorithms.
+
 A parser generator takes in a specification of the concrete syntax and
 A parser generator takes in a specification of the concrete syntax and
 produces a parser. Even though a parser generator does most of the
 produces a parser. Even though a parser generator does most of the
 work for us, using one properly requires some knowledge.  In
 work for us, using one properly requires some knowledge.  In
@@ -4274,10 +4278,10 @@ exp: INT
    | exp "-" exp
    | exp "-" exp
    | "(" exp ")"
    | "(" exp ")"
 
 
-stmt: "print" "(" exp ")"
-    | exp
+stmt_list:
+    | stmt NEWLINE stmt_list
 
 
-lang_int: (stmt NEWLINE)*
+lang_int: stmt_list
 \end{lstlisting}
 \end{lstlisting}
 \end{minipage}
 \end{minipage}
 \end{center}
 \end{center}
@@ -4350,7 +4354,10 @@ exp: INT                    -> int
 stmt: "print" "(" exp ")"   -> print
 stmt: "print" "(" exp ")"   -> print
     | exp                   -> expr
     | exp                   -> expr
 
 
-lang_int: (stmt NEWLINE)*   -> module
+stmt_list:                   -> empty_stmt
+    | stmt NEWLINE stmt_list -> add_stmt
+
+lang_int: stmt_list          -> module
 \end{lstlisting}
 \end{lstlisting}
 \end{minipage}
 \end{minipage}
 \end{center}
 \end{center}
@@ -4439,7 +4446,10 @@ exp_hi: INT                 -> int
 stmt: "print" "(" exp ")"  -> print
 stmt: "print" "(" exp ")"  -> print
     | exp                  -> expr
     | exp                  -> expr
 
 
-lang_int: (stmt NEWLINE)*  -> module
+stmt_list:                   -> empty_stmt
+    | stmt NEWLINE stmt_list -> add_stmt
+
+lang_int: stmt_list          -> module
 \end{lstlisting}
 \end{lstlisting}
 \end{tcolorbox}
 \end{tcolorbox}
 \caption{An unambiguous Lark grammar for \LangInt{}.}
 \caption{An unambiguous Lark grammar for \LangInt{}.}
@@ -4476,14 +4486,14 @@ def parse_tree_to_ast(e):
 
 
 \begin{exercise}
 \begin{exercise}
   \normalfont\normalsize
   \normalfont\normalsize
-
-  Use Lark to create a lexer and parser for \LangVar{}.  We recommend
-  using Lark's default parsing algorithm (Earley) with the
-  \code{ambiguity} option set to \code{'explicit'} so that if your
-  grammar is ambiguous, the output will include multiple parse trees
-  which will indicate to you that there is a problem with your
-  grammar. Your parser should ignore white space so we
-  recommend using Lark's \code{\%ignore} directive as follows.
+%
+  Use Lark to create a lexer and parser for \LangVar{}.  Use Lark's
+  default parsing algorithm (Earley) with the \code{ambiguity} option
+  set to \code{'explicit'} so that if your grammar is ambiguous, the
+  output will include multiple parse trees which will indicate to you
+  that there is a problem with your grammar. Your parser should ignore
+  white space so we recommend using Lark's \code{\%ignore} directive
+  as follows.
 \begin{lstlisting}
 \begin{lstlisting}
 WS: /[ \t\f\r\n]/+
 WS: /[ \t\f\r\n]/+
 %ignore WS
 %ignore WS
@@ -4493,20 +4503,85 @@ Lark-generated parser instead of using the \code{parse} function from
 the \code{ast} module. Test your compiler on all of the \LangVar{}
 the \code{ast} module. Test your compiler on all of the \LangVar{}
 programs that you have created and create four additional programs
 programs that you have created and create four additional programs
 that would reveal ambiguities in your grammar.
 that would reveal ambiguities in your grammar.
-
 \end{exercise}
 \end{exercise}
 
 
 
 
 \section{The Earley Algorithm}
 \section{The Earley Algorithm}
+\label{sec:earley}
+
+In this section we discuss the parsing algorithm of
+\citet{Earley:1970ly}, which is the default algorithm used by Lark.
+The algorithm is powerful in that it can handle any context-free
+grammar, which makes it easy to use. However, it is not the most
+efficient parsing algorithm: it is $O(n^3)$ for ambiguous grammars and
+$O(n^2)$ for unambiguous grammars~\citep{Hopcroft06:_automata}.  In
+section~\ref{sec:lalr} we learn about the LALR algorithm, which is
+more efficient but can only handle a subset of the context-free
+grammars.
+
+The Earley algorithm uses a data structure called a
+\emph{chart}\index{subject}{chart} to keep track of its progress.  The
+chart is an array with one slot for each position in the input string,
+where position $0$ is before the first character and position $n$ is
+immediately after the last character. So the array has length $n+1$
+for an input string of length $n$. Each slot in the chart contains a
+set of \emph{dotted rules}. A dotted rule is simply a grammar rule
+with a period indicating how much of its right-hand side has already
+been parsed. For example, the dotted rule
+\begin{lstlisting}
+exp: exp "+" . exp_hi
+\end{lstlisting}
+represents a partial parse that has matched an expression followed by
+\code{+}, but has not yet parsed an expression to the right of
+\code{+}.
+
+The algorithm begins by creating dotted rules for all the grammar
+rules whose left-hand side is the start symbol and placing then in
+slot $0$ of the chart.  For example, given the grammar in
+figure~\ref{fig:Lint-lark-grammar}, we would place
+\begin{lstlisting}
+  lang_int: . stmt_list
+\end{lstlisting}
+in slot $0$ of the chart. The algorithm then proceeds to its
+\emph{prediction} phase in which it adds more dotted rules to the
+chart based on which nonterminal come after a period. In the above,
+the nonterminal \code{stmt\_list} appears after a period, so we add all
+the rules for \code{stmt\_list} to slot $0$, with a period at the
+beginning of their right-hand sides, as follows:
+\begin{lstlisting}
+stmt_list:  . 
+stmt_list:  .  stmt  NEWLINE  stmt_list
+\end{lstlisting}
+The prediction phase continues to add dotted rules as more
+opportunities arise. For example, the \code{stmt} nonterminal now
+appears after a period, so we add all the rules for \code{stmt}.
+\begin{lstlisting}
+stmt:  .  "print" "("  exp ")"
+stmt:  .  exp
+\end{lstlisting}
+To finish the preduction phase, we add the grammar rules for
+\code{exp} and \code{exp\_hi}.
+\begin{lstlisting}[escapechar=$]
+exp: . exp "+" exp_hi
+exp: . exp "-" exp_hi
+exp: . exp_hi
+exp_hi: . INT
+exp_hi: . "input_int" "(" ")"
+exp_hi: . "-" exp_hi
+exp_hi: . "(" exp ")"
+\end{lstlisting}
 
 
 
 
 \section{The LALR Algorithm}
 \section{The LALR Algorithm}
-
+\label{sec:lalr}
 
 
 \section{Further Reading}
 \section{Further Reading}
 
 
 UNDER CONSTRUCTION
 UNDER CONSTRUCTION
 
 
+finite automata
+
+