Explorar o código

revised 1.1, trees, grammars, and s-expr

Jeremy Siek %!s(int64=9) %!d(string=hai) anos
pai
achega
5472fddffe
Modificáronse 2 ficheiros con 921 adicións e 161 borrados
  1. 706 65
      all.bib
  2. 215 96
      book.tex

A diferenza do arquivo foi suprimida porque é demasiado grande
+ 706 - 65
all.bib


+ 215 - 96
book.tex

@@ -14,6 +14,9 @@
 \usepackage{xypic}
 \usepackage{xypic}
 \usepackage{semantic}
 \usepackage{semantic}
 
 
+% Computer Modern is already the default. -Jeremy
+%\renewcommand{\ttdefault}{cmtt}
+
 \lstset{%
 \lstset{%
 language=Lisp,
 language=Lisp,
 basicstyle=\ttfamily\small,
 basicstyle=\ttfamily\small,
@@ -147,90 +150,200 @@ Need to give thanks to
 \label{ch:trees-recur}
 \label{ch:trees-recur}
 
 
 In this chapter, we review the basic tools that are needed for
 In this chapter, we review the basic tools that are needed for
-implementing a compiler. We use abstract syntax trees (ASTs) to
-represent programs (Section~\ref{sec:ast}) and pattern matching to
-inspect an AST node (Section~\ref{sec:pattern-matching}).  We use
-recursion to construct and deconstruct entire ASTs
-(Section~\ref{sec:recursion}).
+implementing a compiler. We use abstract syntax trees (ASTs) in the
+form of S-expressions to represent programs (Section~\ref{sec:ast})
+and pattern matching to inspect an AST node
+(Section~\ref{sec:pattern-matching}).  We use recursion to construct
+and deconstruct entire ASTs (Section~\ref{sec:recursion}).
 
 
-\section{Abstract Syntax Trees and Grammars}
+\section{Trees, Grammars, and S-Expressions}
 \label{sec:ast}
 \label{sec:ast}
 
 
-In programming language theory (PLT), abstract syntax trees (AST) are
-used to structurally model the syntax of a program. As an example, we
-first provide the Backus-Naur Form (BNF), or grammar, of a simple
-arithmetic language, {\tt Arith}.
-
-\begin{figure}[htbp]
-\centering
-\fbox{
-\begin{minipage}{0.85\textwidth}
-\[
-\begin{array}{lcl}
-  \Op    &::=& \key{+} \mid \key{-} \mid \key{*} \\
-  \itm{Arith} &::=& \itm{Integer} \mid (\Op \; \itm{Arith} \; \itm{Arith}) \mid (\Op \; \itm{Arith}) 
-\end{array}
-\]
+The primary data structure that is commonly used for representing
+programs is the \emph{abstract syntax tree} (AST). When considering
+some part of a program, a compiler needs to ask what kind of part it
+is and what sub-parts it has. For example, the program on the left is
+represented by the AST on the right.
+\begin{center}
+\begin{minipage}{0.4\textwidth}
+\begin{lstlisting}
+(+ 50 (- 8))
+\end{lstlisting}
+\end{minipage}
+\begin{minipage}{0.4\textwidth}
+\begin{equation}
+\xymatrix@=15pt{
+    & *+[F]{+} \ar[dl]\ar[dr]& \\
+*+[F]{\tt 50}  &   & *+[F]{-} \ar[d] \\
+    &   & *+[F]{\tt 8} 
+} \label{eq:arith-prog}
+\end{equation}
+\end{minipage}
+\end{center}
+When deciding how to compile this program, we need to know that the
+top-most part is an addition and that it has two sub-parts, the
+integer \texttt{50} and the negation of \texttt{8}. The abstract
+syntax tree data structure directly supports these queries and hence
+is a good choice. In this book, we will often write down the textual
+representation of a program even when we really have in mind the AST,
+simply because the textual representation is easier to typeset.  We
+recommend that, in your mind, you should alway interpret programs as
+abstract syntax trees.
+
+A programming language can be thought of as a \emph{set} of programs.
+The set is typically infinite (one can always create larger and larger
+programs), so one cannot simply describe a language by listing all of
+the programs in the language. Instead we write down a set of rules, a
+\emph{grammar}, for building programs. We shall write our rules in a
+variant of Backus-Naur Form (BNF)~\citep{Backus:1960aa,Knuth:1964aa}.
+As an example, we describe a small language, named $\itm{arith}$, of
+integers and arithmetic operations. The first rule says that any
+integer is in the language:
+\begin{equation}
+\itm{arith} ::= \Int  \label{eq:arith-int}
+\end{equation}
+Each rule has a left-hand-side and a right-hand-side. The way to read
+a rule is that if you have all the program parts on the
+right-hand-side, then you can create and AST node and categorize it
+according to the left-hand-side. (We do not define $\Int$ because the
+reader already knows what an integer is.)
+
+The second rule says that, given an $\itm{arith}$, you can build
+another arith by negating it.
+\begin{equation}
+  \itm{arith} ::= (\key{-} \; \itm{arith})  \label{eq:arith-neg}
+\end{equation}
+By rule \eqref{eq:arith-int}, \texttt{8} is an $\itm{arith}$, then by
+rule \eqref{eq:arith-neg}, the following AST is an $\itm{arith}$.
+\begin{center}
+\begin{minipage}{0.25\textwidth}
+\begin{lstlisting}
+(- 8)
+\end{lstlisting}
 \end{minipage}
 \end{minipage}
+\begin{minipage}{0.25\textwidth}
+\begin{equation}
+\xymatrix@=15pt{
+ *+[F]{-} \ar[d] \\
+ *+[F]{\tt 8} 
 }
 }
-\caption{The syntax of the {\tt Arith} language.}
-\label{fig:arith-syntax}
-\end{figure}
-
-From this grammar, we have defined {\tt Arith} by constraining its syntax.
-Effectively, we have defined {\tt Arith} by first defining what a legal 
-expression (or program) within the language is. To clarify further, we can 
-think of {\tt Arith} as a \textit{set} of expressions, where, under syntax
-constraints, \mbox{{\tt (+ 1 1)}} and {\tt -1} are inhabitants and {\tt (+ 3.2 3)}
-and {\tt (++ 2 2)} are not (see ~Figure\ref{fig:ast}).
-
-The relationship between a grammar and an AST is then similar to that of a set
-and an inhabitant. From this, every syntaxically valid expression, under the 
-constraints of a grammar, can be represented by an abstract syntax tree. This
-is because {\tt Arith} is essentially a specification of a Tree-like 
-data-structure. In this case, tree nodes are the arithmetic operators {\tt +} and
-{\tt -}, and the leaves are  integer constants. From this, we can represent any
-expression of {\tt Arith} using a \textit{syntax expression} (s-exp).
-
-\begin{figure}[htbp]
-\centering
-\fbox{
-\begin{minipage}{0.85\textwidth}
+\label{eq:arith-neg8}
+\end{equation}
+\end{minipage}
+\end{center}
+
+The third and last rule for the $\itm{arith}$ language is for addition:
+\begin{equation}
+  \itm{arith} ::= (\key{+} \; \itm{arith} \; \itm{arith}) \label{eq:arith-add}
+\end{equation}
+Now we can see that the AST \eqref{eq:arith-prog} is in $\itm{arith}$.
+We know that \lstinline{50} is in $\itm{arith}$ by rule
+\eqref{eq:arith-int} and we have shown that \texttt{(- 8)} is in
+$\itm{arith}$, so we can apply rule \eqref{eq:arith-add} to show that
+\texttt{(+ 50 (- 8))} is in the $\itm{arith}$ language.
+
+If you have an AST for which the above three rules do not apply, then
+the AST is not in $\itm{arith}$. For example, the AST \texttt{(- 50
+  (+ 8))} is not in $\itm{arith}$ because there are no rules for $+$
+with only one argument, nor for $-$ with two arguments.  Whenever we
+define a language through a grammar, we implicitly mean for the
+language to be the smallest set of programs that are justified by the
+rules. That is, the language only includes those programs that the
+rules allow.
+
+It is common to have many rules with the same left-hand side, so the
+following vertical bar notation is used to gather several rules on one
+line.  We refer to each clause between a vertical bar as an
+``alternative''.
 \[
 \[
-\begin{array}{lcl}
-  exp  &::=& sexp \mid (sexp*) \mid (unquote \; sexp)  \\
-  sexp &::=& Val \mid Var \mid (quote \; exp) \mid (quasiquote \; exp)
-\end{array}
+\itm{arith} ::= \Int \mid (\key{-} \; \itm{arith}) \mid
+   (\key{+} \; \itm{arith} \; \itm{arith})
 \]
 \]
-\end{minipage}
-}
-\caption{\textit{s-exp} syntax: $Val$ and $Var$ are shorthand for Value and Variable.}
-\label{fig:sexp-syntax}
-\end{figure}
 
 
-For our purposes, we will treat s-exps equivalent to \textit{possibly
-deeply-nested lists}. For the sake of brevity, the symbols $single$ $quote$ ('),
-$backquote$ (`), and $comma$ (,) are reader sugar for {\tt quote}, 
-{\tt quasiquote}, and {\tt unquote}. We provide several examples of s-exps and
-functions that return s-exps below. We use the {\tt >} symbol to represent 
-interaction with a Racket REPL.
-\begin{verbatim}
-(define 1plus1 `(1 + 1))
-(define (1plusX x) `(1 + ,x))
-(define (XplusY x y) `(,x + ,y))
-
-> 1plus1
-'(1 + 1)
-> (1plusX 1)
-'(1 + 1)
-> (XplusY 1 1)
-'(1 + 1)
-> `,1plus1
-'(1 + 1)
-\end{verbatim}
-In any expression wrapped with {\tt quasiquote} ({\tt `}), sub-expressions
-wrapped with an {\tt unquote} expression are evaluated before the entire 
-expression is returned wrapped in a {\tt quote} expression.
+Racket, as a descendant of Lisp~\citep{McCarthy:1960dz}, has
+particularly convenient support for creating and manipulating abstract
+syntax trees with its \emph{symbolic expression} feature, or
+S-expression for short. We can create an S-expression simply by
+writing a backquote followed by the textual representation of the
+AST. For example, an S-expression to represent the AST
+\eqref{eq:arith-prog} is created by the following Racket expression:
+\begin{center}
+\texttt{`(+ 50 (- 8))}
+\end{center}
+
+To build larger S-expressions one often needs to splice together
+several smaller S-expressions. Racket provides the comma operator to
+splice an S-expression into a larger one. For example, instead of
+creating the S-expression for AST \eqref{eq:arith-prog} all at once,
+we could have first created an S-expression for AST
+\eqref{eq:arith-neg8} and then spliced that into the addition
+S-expression.
+\begin{lstlisting}
+(define ast1.4 `(- 8))
+(define ast1.1 `(+ 50 ,neg8))
+\end{lstlisting}
+In general, the Racket expression that follows the comma (splice)
+can be any expression that computes an S-expression.
+
+
+
+
+%% From this grammar, we have defined {\tt arith} by constraining its
+%% syntax.  Effectively, we have defined {\tt arith} by first defining
+%% what a legal expression (or program) within the language is. To
+%% clarify further, we can think of {\tt arith} as a \textit{set} of
+%% expressions, where, under syntax constraints, \mbox{{\tt (+ 1 1)}} and
+%% {\tt -1} are inhabitants and {\tt (+ 3.2 3)} and {\tt (++ 2 2)} are
+%% not (see ~Figure\ref{fig:ast}).
+
+%% The relationship between a grammar and an AST is then similar to that
+%% of a set and an inhabitant. From this, every syntaxically valid
+%% expression, under the constraints of a grammar, can be represented by
+%% an abstract syntax tree. This is because {\tt arith} is essentially a
+%% specification of a Tree-like data-structure. In this case, tree nodes
+%% are the arithmetic operators {\tt +} and {\tt -}, and the leaves are
+%% integer constants. From this, we can represent any expression of {\tt
+%%   arith} using a \textit{syntax expression} (s-exp).
+
+%% \begin{figure}[htbp]
+%% \centering
+%% \fbox{
+%% \begin{minipage}{0.85\textwidth}
+%% \[
+%% \begin{array}{lcl}
+%%   exp  &::=& sexp \mid (sexp*) \mid (unquote \; sexp)  \\
+%%   sexp &::=& Val \mid Var \mid (quote \; exp) \mid (quasiquote \; exp)
+%% \end{array}
+%% \]
+%% \end{minipage}
+%% }
+%% \caption{\textit{s-exp} syntax: $Val$ and $Var$ are shorthand for Value and Variable.}
+%% \label{fig:sexp-syntax}
+%% \end{figure}
+
+%% For our purposes, we will treat s-exps equivalent to \textit{possibly
+%%   deeply-nested lists}. For the sake of brevity, the symbols $single$
+%% $quote$ ('), $backquote$ (`), and $comma$ (,) are reader sugar for
+%% {\tt quote}, {\tt quasiquote}, and {\tt unquote}. We provide several
+%% examples of s-exps and functions that return s-exps below. We use the
+%% {\tt >} symbol to represent interaction with a Racket REPL.
+%% \begin{verbatim}
+%% (define 1plus1 `(1 + 1))
+%% (define (1plusX x) `(1 + ,x))
+%% (define (XplusY x y) `(,x + ,y))
+
+%% > 1plus1
+%% '(1 + 1)
+%% > (1plusX 1)
+%% '(1 + 1)
+%% > (XplusY 1 1)
+%% '(1 + 1)
+%% > `,1plus1
+%% '(1 + 1)
+%% \end{verbatim}
+%% In any expression wrapped with {\tt quasiquote} ({\tt `}), sub-expressions
+%% wrapped with an {\tt unquote} expression are evaluated before the entire 
+%% expression is returned wrapped in a {\tt quote} expression.
 
 
 % \marginpar{\scriptsize Introduce s-expressions, quote, and quasi-quote, and comma in
 % \marginpar{\scriptsize Introduce s-expressions, quote, and quasi-quote, and comma in
 %   this section. Make sure to include examples of ASTs. The description
 %   this section. Make sure to include examples of ASTs. The description
@@ -254,13 +367,14 @@ expression is returned wrapped in a {\tt quote} expression.
 % \end{enumerate}
 % \end{enumerate}
 
 
 For our purposes, our compiler will take a Scheme-like expression and
 For our purposes, our compiler will take a Scheme-like expression and
-transform it to X86\_64 Assembly. Along the way, we transform each input
-expression into a handful of  \textit{intermediary languages} (IL). 
-A key tool for transforming one language into another is \textit{pattern matching}. 
-
-Racket provides a built-in pattern-matcher, {\tt match}, that we can use
-to perform operations on s-exps. As a preliminary example, we include a 
-familiar definition of factorial, first without using match.
+transform it to X86\_64 Assembly. Along the way, we transform each
+input expression into a handful of \textit{intermediary languages}
+(IL).  A key tool for transforming one language into another is
+\textit{pattern matching}.
+
+Racket provides a built-in pattern-matcher, {\tt match}, that we can
+use to perform operations on s-exps. As a preliminary example, we
+include a familiar definition of factorial, first without using match.
 \begin{verbatim}
 \begin{verbatim}
 (define (! n)
 (define (! n)
   (if (zero? n) 1
   (if (zero? n) 1
@@ -287,7 +401,7 @@ comprised of \textit{left-hand side} (LHS) and \textit{right-hand side} (RHS)
 sub-expressions. LHS sub-expressions can be thought of as an expression
 sub-expressions. LHS sub-expressions can be thought of as an expression
 of the grammar in Figure~\ref{fig:sexp-syntax}. To provide an example, we
 of the grammar in Figure~\ref{fig:sexp-syntax}. To provide an example, we
 include a function that takes an arbitrary expression, {\tt exp} and
 include a function that takes an arbitrary expression, {\tt exp} and
-determines whether or not {\tt exp} \(\in\) {\tt Arith}.
+determines whether or not {\tt exp} \(\in\) {\tt arith}.
 \begin{verbatim}
 \begin{verbatim}
 (define (arith-foo exp)
 (define (arith-foo exp)
   (match exp
   (match exp
@@ -295,12 +409,12 @@ determines whether or not {\tt exp} \(\in\) {\tt Arith}.
     (`(,e1 ,op ,e2) #:when (memv op '(+ -)) 
     (`(,e1 ,op ,e2) #:when (memv op '(+ -)) 
      (and (arith-foo e1) (arith-foo e2)))
      (and (arith-foo e1) (arith-foo e2)))
     (`(,op ,e) #:when (memv op '(+ -)) (arith-foo e))
     (`(,op ,e) #:when (memv op '(+ -)) (arith-foo e))
-    (else (error "not an Arith expression: " arith-exp))))
+    (else (error "not an arith expression: " arith-exp))))
 \end{verbatim}
 \end{verbatim}
 Here, {\tt \#:when} puts constraints on the value of matched expressions.
 Here, {\tt \#:when} puts constraints on the value of matched expressions.
 In this case, we make sure that every sub-expression in \textit{op} position
 In this case, we make sure that every sub-expression in \textit{op} position
 is either {\tt +} or {\tt -}. Otherwise, we return an error, signaling a
 is either {\tt +} or {\tt -}. Otherwise, we return an error, signaling a
-non-{\tt Arith} expression. As we mentioned earlier, every expression 
+non-{\tt arith} expression. As we mentioned earlier, every expression 
 wrapped in an {\tt unquote} is evaluated first. When used in a LHS {\tt match}
 wrapped in an {\tt unquote} is evaluated first. When used in a LHS {\tt match}
 sub-expression, these expressions evaluate to the actual value of the matched
 sub-expression, these expressions evaluate to the actual value of the matched
 expression (i.e., {\tt arith-exp}). Thus, {\tt `(,e1 ,op ,e2)} and 
 expression (i.e., {\tt arith-exp}). Thus, {\tt `(,e1 ,op ,e2)} and 
@@ -340,7 +454,7 @@ ignore the {\tt read} operator.
 \caption{The syntax of the $S_0$ language. The abbreviation \Op{} is
 \caption{The syntax of the $S_0$ language. The abbreviation \Op{} is
   short for operator, \Exp{} is short for expression, \Int{} for integer,
   short for operator, \Exp{} is short for expression, \Int{} for integer,
   and \Var{} for variable.}
   and \Var{} for variable.}
-\label{fig:s0-syntax}
+%\label{fig:s0-syntax}
 \end{figure}
 \end{figure}
 \begin{verbatim}
 \begin{verbatim}
 
 
@@ -368,7 +482,7 @@ reader a feeling for the scale of this first compiler, the instructor
 solution for the $S_0$ compiler consists of 6 recursive functions and
 solution for the $S_0$ compiler consists of 6 recursive functions and
 a few small helper functions that together span 256 lines of code.
 a few small helper functions that together span 256 lines of code.
 
 
-\begin{figure}[htbp]
+\begin{figure}[btp]
 \centering
 \centering
 \fbox{
 \fbox{
 \begin{minipage}{0.85\textwidth}
 \begin{minipage}{0.85\textwidth}
@@ -633,6 +747,7 @@ into the text representation for x86 (Figure~\ref{fig:x86-a}).
 \begin{figure}[tbp]
 \begin{figure}[tbp]
 \fbox{
 \fbox{
 \begin{minipage}{0.96\textwidth}
 \begin{minipage}{0.96\textwidth}
+\vspace{-10pt}
 \[
 \[
 \begin{array}{lcl}
 \begin{array}{lcl}
 \Arg &::=&  \INT{\Int} \mid \REG{\itm{register}}
 \Arg &::=&  \INT{\Int} \mid \REG{\itm{register}}
@@ -681,7 +796,7 @@ differences.
 
 
 We ease the challenge of compiling from $S_0$ to x86 by breaking down
 We ease the challenge of compiling from $S_0$ to x86 by breaking down
 the problem into several steps, dealing with the above differences one
 the problem into several steps, dealing with the above differences one
-at a time. The main question then becomes: in what order to we tackle
+at a time. The main question then becomes: in what order do we tackle
 these differences? This is often one of the most challenging questions
 these differences? This is often one of the most challenging questions
 that a compiler writer must answer because some orderings may be much
 that a compiler writer must answer because some orderings may be much
 more difficult to implement than others. It is difficult to know ahead
 more difficult to implement than others. It is difficult to know ahead
@@ -698,12 +813,12 @@ locations. Thus, it makes sense to deal with \#2 before \#3 so that
 consider where \#1 should fit in. Because it has to do with the format
 consider where \#1 should fit in. Because it has to do with the format
 of x86 instructions, it makes more sense after we have flattened the
 of x86 instructions, it makes more sense after we have flattened the
 nested expressions (\#2). Finally, when should we deal with \#4
 nested expressions (\#2). Finally, when should we deal with \#4
-(variable overshadowing)?  We shall be solving this problem by
-renaming variables to make sure they have unique names. Recall that
-our plan for \#2 involves moving nested expressions, which could be
-problematic if it changes the shadowing of variables. However, if we
-deal with \#4 first, then it will not be an issue.  Thus, we arrive at
-the following ordering.
+(variable overshadowing)?  We shall solve this problem by renaming
+variables to make sure they have unique names. Recall that our plan
+for \#2 involves moving nested expressions, which could be problematic
+if it changes the shadowing of variables. However, if we deal with \#4
+first, then it will not be an issue.  Thus, we arrive at the following
+ordering.
 \[
 \[
 \xymatrix{
 \xymatrix{
 4 \ar[r] & 2 \ar[r] & 1 \ar[r] & 3
 4 \ar[r] & 2 \ar[r] & 1 \ar[r] & 3
@@ -733,7 +848,9 @@ and there is a \key{return} construct to specify the return value of
 the program. A program consists of a sequence of statements that
 the program. A program consists of a sequence of statements that
 include at least one \key{return} statement.
 include at least one \key{return} statement.
 
 
-\begin{figure}[htbp]
+\begin{figure}[tbp]
+\fbox{
+\begin{minipage}{0.96\textwidth}
 \[
 \[
 \begin{array}{lcl}
 \begin{array}{lcl}
 \Arg &::=& \Int \mid \Var \\
 \Arg &::=& \Int \mid \Var \\
@@ -742,6 +859,8 @@ include at least one \key{return} statement.
 \Prog & ::= & (\key{program}\;\itm{info}\;\Stmt^{+})
 \Prog & ::= & (\key{program}\;\itm{info}\;\Stmt^{+})
 \end{array}
 \end{array}
 \]
 \]
+\end{minipage}
+}
 \caption{The $C_0$ intermediate language.}
 \caption{The $C_0$ intermediate language.}
 \label{fig:c0-syntax}
 \label{fig:c0-syntax}
 \end{figure}
 \end{figure}

Algúns arquivos non se mostraron porque demasiados arquivos cambiaron neste cambio