Procházet zdrojové kódy

start of parsing chapter

Jeremy Siek před 2 roky
rodič
revize
4daf39fd85
1 změnil soubory, kde provedl 99 přidání a 4 odebrání
  1. 99 4
      book.tex

+ 99 - 4
book.tex

@@ -497,13 +497,14 @@ perform.\index{subject}{concrete syntax}\index{subject}{abstract
   syntax}\index{subject}{abstract syntax
   tree}\index{subject}{AST}\index{subject}{program}\index{subject}{parse}
 The process of translating from concrete syntax to abstract syntax is
-called \emph{parsing}~\citep{Aho:2006wb}. This book does not cover the
-theory and implementation of parsing.
+called \emph{parsing}~\citep{Aho:2006wb}\python{ and is studied in
+  chapter~\ref{ch:parsing-Lvar}}.
+\racket{This book does not cover the theory and implementation of parsing.}%
 %
 \racket{A parser is provided in the support code for translating from
-  concrete to abstract syntax.}
+  concrete to abstract syntax.}%
 %
-\python{We use Python's \code{ast} module to translate from concrete
+\python{For now we use Python's \code{ast} module to translate from concrete
   to abstract syntax.}
 
 ASTs can be represented inside the compiler in many different ways,
@@ -4074,6 +4075,100 @@ make sure that your compiler still passes all the tests.  After
 all, fast code is useless if it produces incorrect results!
 \end{exercise}
 
+
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\chapter{Parsing}
+\label{ch:parsing-Lvar}
+\setcounter{footnote}{0}
+
+\index{subject}{parsing}
+
+
+In this chapter we learn how to use the Lark parser generator to
+translate the concrete syntax of \LangVar{} (a sequence of characters)
+into an abstract syntax tree.  A parser generator takes in a
+specification of the concrete syntax and produces a parser. Even
+though a parser generator does most of the work for us, using one
+properly requires considerable knowledge about parsing algorithms.  In
+particular, we must learn about the specification languages used by
+parser generators and we must learn how to deal with ambiguity in our
+language specifications.
+
+The process of parsing is traditionally subdivided into two phases:
+\emph{lexical analysis} (often called scanning) and
+\emph{parsing}. The lexical analysis phase translates the sequence of
+characters into a sequence of \emph{tokens}, that is, words consisting
+of several characters. The parsing phase organizes the tokens into a
+\emph{parse tree} that captures how the tokens were matched by rules
+in the grammar of the language. The reason for the subdivision into
+two phases is to enable the use of a faster but less powerful
+algorithm for lexical analysis and the use of a slower but more
+powerful algorithm for parsing.
+%
+Likewise, parser generators typical come in pairs, with separate
+generators for the lexical analyzer and for the parser.  A paricularly
+influential pair of generators were \texttt{lex} and
+\texttt{yacc}. The \texttt{lex} generator was written by Eric Schmidt
+and Mike Lesk~\cite{Lesk:1975uq} at Bell Labs. The \texttt{yacc}
+generator was written by Stephen C. Johnson at
+AT\&T~\cite{Johnson:1979qy} and stands for Yet Another Compiler
+Compiler.
+
+The Lark parse generator that we use in this chapter includes both a
+lexical analyzer and a parser. The next section discusses lexical
+analysis and the remainder of the chapter discusses parsing.
+
+
+\section{Lexical analysis}
+\label{sec:lex}
+
+The lexical analyzers produced by Lark turn a sequence of characters
+(a string) into a sequence of token objects. For example, the string
+\begin{lstlisting}
+'print(1 + 3)'
+\end{lstlisting}
+\noindent could be converted into the following sequence of token objects
+\begin{lstlisting}
+Token('PRINT', 'print')
+Token('LPAR', '(')
+Token('INT', '1')
+Token('PLUS', '+')
+Token('INT', '3')
+Token('RPAR', ')')
+Token('NEWLINE', '\n')
+\end{lstlisting}
+where each token includes a field for its \code{type}, such as \code{'INT'},
+and for its \code{value}, such as \code{'1'}.
+
+Following in the tradition of \code{lex}, the Lark generator requires
+a specification of which words should be categorized as which types of
+the tokens using \emph{regular expressions}.  The term ``regular''
+comes from ``regular languages'', which are the (particularly simple)
+set of languages that can be recognized by a finite automata. A
+\emph{regular expression} is a pattern formed of the following core
+elements:\index{subject}{regular expression}
+
+\begin{enumerate}
+\item a single character, e.g. \texttt{a}. The only string that matches this
+  regular expression is \texttt{a}.
+\item two regular expressions, one followed by the other
+  (concatenation), e.g. \texttt{bc}.  The only string that matches
+  this regular expression is \texttt{bc}.
+\item one regular expression or another (alternation), e.g.
+  \texttt{a|bc}.  Both the string \texttt{'a'} and \texttt{'bc'} would
+  be matched by this pattern.
+\item a regular expression repeated zero or more times (Kleene
+  closure), e.g. \texttt{(a|bc)*}.  The string \texttt{'bcabcbc'}
+  would match this pattern, but not \texttt{'bccba'}.
+\item the empty sequence
+\end{enumerate}
+
+
+
+
+
+
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \chapter{Register Allocation}
 \label{ch:register-allocation-Lvar}