9 tahun lalu · 5472fddffe
--- a/all.bib
+++ b/all.bib
--- a/book.tex
+++ b/book.tex
@@ -14,6 +14,9 @@
 
															 \usepackage{xypic}
														
 
															 \usepackage{semantic}
														
 
															+% Computer Modern is already the default. -Jeremy
														
 
															+%\renewcommand{\ttdefault}{cmtt}
														
 
															+
														
 
															 \lstset{%
														
 
															 language=Lisp,
														
 
															 basicstyle=\ttfamily\small,
														
@@ -147,90 +150,200 @@ Need to give thanks to
 
															 \label{ch:trees-recur}
														
 
															 In this chapter, we review the basic tools that are needed for
														
 
															-implementing a compiler. We use abstract syntax trees (ASTs) to
														
 
															-represent programs (Section~\ref{sec:ast}) and pattern matching to
														
 
															-inspect an AST node (Section~\ref{sec:pattern-matching}).  We use
														
 
															-recursion to construct and deconstruct entire ASTs
														
 
															-(Section~\ref{sec:recursion}).
														
 
															+implementing a compiler. We use abstract syntax trees (ASTs) in the
														
 
															+form of S-expressions to represent programs (Section~\ref{sec:ast})
														
 
															+and pattern matching to inspect an AST node
														
 
															+(Section~\ref{sec:pattern-matching}).  We use recursion to construct
														
 
															+and deconstruct entire ASTs (Section~\ref{sec:recursion}).
														
 
															-\section{Abstract Syntax Trees and Grammars}
														
 
															+\section{Trees, Grammars, and S-Expressions}
														
 
															 \label{sec:ast}
														
 
															-In programming language theory (PLT), abstract syntax trees (AST) are
														
 
															-used to structurally model the syntax of a program. As an example, we
														
 
															-first provide the Backus-Naur Form (BNF), or grammar, of a simple
														
 
															-arithmetic language, {\tt Arith}.
														
 
															-
														
 
															-\begin{figure}[htbp]
														
 
															-\centering
														
 
															-\fbox{
														
 
															-\begin{minipage}{0.85\textwidth}
														
 
															-\[
														
 
															-\begin{array}{lcl}
														
 
															-  \Op    &::=& \key{+} \mid \key{-} \mid \key{*} \\
														
 
															-  \itm{Arith} &::=& \itm{Integer} \mid (\Op \; \itm{Arith} \; \itm{Arith}) \mid (\Op \; \itm{Arith}) 
														
 
															-\end{array}
														
 
															-\]
														
 
															+The primary data structure that is commonly used for representing
														
 
															+programs is the \emph{abstract syntax tree} (AST). When considering
														
 
															+some part of a program, a compiler needs to ask what kind of part it
														
 
															+is and what sub-parts it has. For example, the program on the left is
														
 
															+represented by the AST on the right.
														
 
															+\begin{center}
														
 
															+\begin{minipage}{0.4\textwidth}
														
 
															+\begin{lstlisting}
														
 
															+(+ 50 (- 8))
														
 
															+\end{lstlisting}
														
 
															+\end{minipage}
														
 
															+\begin{minipage}{0.4\textwidth}
														
 
															+\begin{equation}
														
 
															+\xymatrix@=15pt{
														
 
															+    & *+[F]{+} \ar[dl]\ar[dr]& \\
														
 
															+*+[F]{\tt 50}  &   & *+[F]{-} \ar[d] \\
														
 
															+    &   & *+[F]{\tt 8} 
														
 
															+} \label{eq:arith-prog}
														
 
															+\end{equation}
														
 
															+\end{minipage}
														
 
															+\end{center}
														
 
															+When deciding how to compile this program, we need to know that the
														
 
															+top-most part is an addition and that it has two sub-parts, the
														
 
															+integer \texttt{50} and the negation of \texttt{8}. The abstract
														
 
															+syntax tree data structure directly supports these queries and hence
														
 
															+is a good choice. In this book, we will often write down the textual
														
 
															+representation of a program even when we really have in mind the AST,
														
 
															+simply because the textual representation is easier to typeset.  We
														
 
															+recommend that, in your mind, you should alway interpret programs as
														
 
															+abstract syntax trees.
														
 
															+
														
 
															+A programming language can be thought of as a \emph{set} of programs.
														
 
															+The set is typically infinite (one can always create larger and larger
														
 
															+programs), so one cannot simply describe a language by listing all of
														
 
															+the programs in the language. Instead we write down a set of rules, a
														
 
															+\emph{grammar}, for building programs. We shall write our rules in a
														
 
															+variant of Backus-Naur Form (BNF)~\citep{Backus:1960aa,Knuth:1964aa}.
														
 
															+As an example, we describe a small language, named $\itm{arith}$, of
														
 
															+integers and arithmetic operations. The first rule says that any
														
 
															+integer is in the language:
														
 
															+\begin{equation}
														
 
															+\itm{arith} ::= \Int  \label{eq:arith-int}
														
 
															+\end{equation}
														
 
															+Each rule has a left-hand-side and a right-hand-side. The way to read
														
 
															+a rule is that if you have all the program parts on the
														
 
															+right-hand-side, then you can create and AST node and categorize it
														
 
															+according to the left-hand-side. (We do not define $\Int$ because the
														
 
															+reader already knows what an integer is.)
														
 
															+
														
 
															+The second rule says that, given an $\itm{arith}$, you can build
														
 
															+another arith by negating it.
														
 
															+\begin{equation}
														
 
															+  \itm{arith} ::= (\key{-} \; \itm{arith})  \label{eq:arith-neg}
														
 
															+\end{equation}
														
 
															+By rule \eqref{eq:arith-int}, \texttt{8} is an $\itm{arith}$, then by
														
 
															+rule \eqref{eq:arith-neg}, the following AST is an $\itm{arith}$.
														
 
															+\begin{center}
														
 
															+\begin{minipage}{0.25\textwidth}
														
 
															+\begin{lstlisting}
														
 
															+(- 8)
														
 
															+\end{lstlisting}
														
 
															 \end{minipage}
														
 
															+\begin{minipage}{0.25\textwidth}
														
 
															+\begin{equation}
														
 
															+\xymatrix@=15pt{
														
 
															+ *+[F]{-} \ar[d] \\
														
 
															+ *+[F]{\tt 8} 
														
 
															 }
														
 
															-\caption{The syntax of the {\tt Arith} language.}
														
 
															-\label{fig:arith-syntax}
														
 
															-\end{figure}
														
 
															-
														
 
															-From this grammar, we have defined {\tt Arith} by constraining its syntax.
														
 
															-Effectively, we have defined {\tt Arith} by first defining what a legal 
														
 
															-expression (or program) within the language is. To clarify further, we can 
														
 
															-think of {\tt Arith} as a \textit{set} of expressions, where, under syntax
														
 
															-constraints, \mbox{{\tt (+ 1 1)}} and {\tt -1} are inhabitants and {\tt (+ 3.2 3)}
														
 
															-and {\tt (++ 2 2)} are not (see ~Figure\ref{fig:ast}).
														
 
															-
														
 
															-The relationship between a grammar and an AST is then similar to that of a set
														
 
															-and an inhabitant. From this, every syntaxically valid expression, under the 
														
 
															-constraints of a grammar, can be represented by an abstract syntax tree. This
														
 
															-is because {\tt Arith} is essentially a specification of a Tree-like 
														
 
															-data-structure. In this case, tree nodes are the arithmetic operators {\tt +} and
														
 
															-{\tt -}, and the leaves are  integer constants. From this, we can represent any
														
 
															-expression of {\tt Arith} using a \textit{syntax expression} (s-exp).
														
 
															-
														
 
															-\begin{figure}[htbp]
														
 
															-\centering
														
 
															-\fbox{
														
 
															-\begin{minipage}{0.85\textwidth}
														
 
															+\label{eq:arith-neg8}
														
 
															+\end{equation}
														
 
															+\end{minipage}
														
 
															+\end{center}
														
 
															+
														
 
															+The third and last rule for the $\itm{arith}$ language is for addition:
														
 
															+\begin{equation}
														
 
															+  \itm{arith} ::= (\key{+} \; \itm{arith} \; \itm{arith}) \label{eq:arith-add}
														
 
															+\end{equation}
														
 
															+Now we can see that the AST \eqref{eq:arith-prog} is in $\itm{arith}$.
														
 
															+We know that \lstinline{50} is in $\itm{arith}$ by rule
														
 
															+\eqref{eq:arith-int} and we have shown that \texttt{(- 8)} is in
														
 
															+$\itm{arith}$, so we can apply rule \eqref{eq:arith-add} to show that
														
 
															+\texttt{(+ 50 (- 8))} is in the $\itm{arith}$ language.
														
 
															+
														
 
															+If you have an AST for which the above three rules do not apply, then
														
 
															+the AST is not in $\itm{arith}$. For example, the AST \texttt{(- 50
														
 
															+  (+ 8))} is not in $\itm{arith}$ because there are no rules for $+$
														
 
															+with only one argument, nor for $-$ with two arguments.  Whenever we
														
 
															+define a language through a grammar, we implicitly mean for the
														
 
															+language to be the smallest set of programs that are justified by the
														
 
															+rules. That is, the language only includes those programs that the
														
 
															+rules allow.
														
 
															+
														
 
															+It is common to have many rules with the same left-hand side, so the
														
 
															+following vertical bar notation is used to gather several rules on one
														
 
															+line.  We refer to each clause between a vertical bar as an
														
 
															+``alternative''.
														
 
															 \[
														
 
															-\begin{array}{lcl}
														
 
															-  exp  &::=& sexp \mid (sexp*) \mid (unquote \; sexp)  \\
														
 
															-  sexp &::=& Val \mid Var \mid (quote \; exp) \mid (quasiquote \; exp)
														
 
															-\end{array}
														
 
															+\itm{arith} ::= \Int \mid (\key{-} \; \itm{arith}) \mid
														
 
															+   (\key{+} \; \itm{arith} \; \itm{arith})
														
 
															 \]
														
 
															-\end{minipage}
														
 
															-}
														
 
															-\caption{\textit{s-exp} syntax: $Val$ and $Var$ are shorthand for Value and Variable.}
														
 
															-\label{fig:sexp-syntax}
														
 
															-\end{figure}
														
 
															-For our purposes, we will treat s-exps equivalent to \textit{possibly
														
 
															-deeply-nested lists}. For the sake of brevity, the symbols $single$ $quote$ ('),
														
 
															-$backquote$ (`), and $comma$ (,) are reader sugar for {\tt quote}, 
														
 
															-{\tt quasiquote}, and {\tt unquote}. We provide several examples of s-exps and
														
 
															-functions that return s-exps below. We use the {\tt >} symbol to represent 
														
 
															-interaction with a Racket REPL.
														
 
															-\begin{verbatim}
														
 
															-(define 1plus1 `(1 + 1))
														
 
															-(define (1plusX x) `(1 + ,x))
														
 
															-(define (XplusY x y) `(,x + ,y))
														
 
															-
														
 
															-> 1plus1
														
 
															-'(1 + 1)
														
 
															-> (1plusX 1)
														
 
															-'(1 + 1)
														
 
															-> (XplusY 1 1)
														
 
															-'(1 + 1)
														
 
															-> `,1plus1
														
 
															-'(1 + 1)
														
 
															-\end{verbatim}
														
 
															-In any expression wrapped with {\tt quasiquote} ({\tt `}), sub-expressions
														
 
															-wrapped with an {\tt unquote} expression are evaluated before the entire 
														
 
															-expression is returned wrapped in a {\tt quote} expression.
														
 
															+Racket, as a descendant of Lisp~\citep{McCarthy:1960dz}, has
														
 
															+particularly convenient support for creating and manipulating abstract
														
 
															+syntax trees with its \emph{symbolic expression} feature, or
														
 
															+S-expression for short. We can create an S-expression simply by
														
 
															+writing a backquote followed by the textual representation of the
														
 
															+AST. For example, an S-expression to represent the AST
														
 
															+\eqref{eq:arith-prog} is created by the following Racket expression:
														
 
															+\begin{center}
														
 
															+\texttt{`(+ 50 (- 8))}
														
 
															+\end{center}
														
 
															+
														
 
															+To build larger S-expressions one often needs to splice together
														
 
															+several smaller S-expressions. Racket provides the comma operator to
														
 
															+splice an S-expression into a larger one. For example, instead of
														
 
															+creating the S-expression for AST \eqref{eq:arith-prog} all at once,
														
 
															+we could have first created an S-expression for AST
														
 
															+\eqref{eq:arith-neg8} and then spliced that into the addition
														
 
															+S-expression.
														
 
															+\begin{lstlisting}
														
 
															+(define ast1.4 `(- 8))
														
 
															+(define ast1.1 `(+ 50 ,neg8))
														
 
															+\end{lstlisting}
														
 
															+In general, the Racket expression that follows the comma (splice)
														
 
															+can be any expression that computes an S-expression.
														
 
															+
														
 
															+
														
 
															+
														
 
															+
														
 
															+%% From this grammar, we have defined {\tt arith} by constraining its
														
 
															+%% syntax.  Effectively, we have defined {\tt arith} by first defining
														
 
															+%% what a legal expression (or program) within the language is. To
														
 
															+%% clarify further, we can think of {\tt arith} as a \textit{set} of
														
 
															+%% expressions, where, under syntax constraints, \mbox{{\tt (+ 1 1)}} and
														
 
															+%% {\tt -1} are inhabitants and {\tt (+ 3.2 3)} and {\tt (++ 2 2)} are
														
 
															+%% not (see ~Figure\ref{fig:ast}).
														
 
															+
														
 
															+%% The relationship between a grammar and an AST is then similar to that
														
 
															+%% of a set and an inhabitant. From this, every syntaxically valid
														
 
															+%% expression, under the constraints of a grammar, can be represented by
														
 
															+%% an abstract syntax tree. This is because {\tt arith} is essentially a
														
 
															+%% specification of a Tree-like data-structure. In this case, tree nodes
														
 
															+%% are the arithmetic operators {\tt +} and {\tt -}, and the leaves are
														
 
															+%% integer constants. From this, we can represent any expression of {\tt
														
 
															+%%   arith} using a \textit{syntax expression} (s-exp).
														
 
															+
														
 
															+%% \begin{figure}[htbp]
														
 
															+%% \centering
														
 
															+%% \fbox{
														
 
															+%% \begin{minipage}{0.85\textwidth}
														
 
															+%% \[
														
 
															+%% \begin{array}{lcl}
														
 
															+%%   exp  &::=& sexp \mid (sexp*) \mid (unquote \; sexp)  \\
														
 
															+%%   sexp &::=& Val \mid Var \mid (quote \; exp) \mid (quasiquote \; exp)
														
 
															+%% \end{array}
														
 
															+%% \]
														
 
															+%% \end{minipage}
														
 
															+%% }
														
 
															+%% \caption{\textit{s-exp} syntax: $Val$ and $Var$ are shorthand for Value and Variable.}
														
 
															+%% \label{fig:sexp-syntax}
														
 
															+%% \end{figure}
														
 
															+
														
 
															+%% For our purposes, we will treat s-exps equivalent to \textit{possibly
														
 
															+%%   deeply-nested lists}. For the sake of brevity, the symbols $single$
														
 
															+%% $quote$ ('), $backquote$ (`), and $comma$ (,) are reader sugar for
														
 
															+%% {\tt quote}, {\tt quasiquote}, and {\tt unquote}. We provide several
														
 
															+%% examples of s-exps and functions that return s-exps below. We use the
														
 
															+%% {\tt >} symbol to represent interaction with a Racket REPL.
														
 
															+%% \begin{verbatim}
														
 
															+%% (define 1plus1 `(1 + 1))
														
 
															+%% (define (1plusX x) `(1 + ,x))
														
 
															+%% (define (XplusY x y) `(,x + ,y))
														
 
															+
														
 
															+%% > 1plus1
														
 
															+%% '(1 + 1)
														
 
															+%% > (1plusX 1)
														
 
															+%% '(1 + 1)
														
 
															+%% > (XplusY 1 1)
														
 
															+%% '(1 + 1)
														
 
															+%% > `,1plus1
														
 
															+%% '(1 + 1)
														
 
															+%% \end{verbatim}
														
 
															+%% In any expression wrapped with {\tt quasiquote} ({\tt `}), sub-expressions
														
 
															+%% wrapped with an {\tt unquote} expression are evaluated before the entire 
														
 
															+%% expression is returned wrapped in a {\tt quote} expression.
														
 
															 % \marginpar{\scriptsize Introduce s-expressions, quote, and quasi-quote, and comma in
														
 
															 %   this section. Make sure to include examples of ASTs. The description
														
@@ -254,13 +367,14 @@ expression is returned wrapped in a {\tt quote} expression.
 
															 % \end{enumerate}
														
 
															 For our purposes, our compiler will take a Scheme-like expression and
														
 
															-transform it to X86\_64 Assembly. Along the way, we transform each input
														
 
															-expression into a handful of  \textit{intermediary languages} (IL). 
														
 
															-A key tool for transforming one language into another is \textit{pattern matching}. 
														
 
															-
														
 
															-Racket provides a built-in pattern-matcher, {\tt match}, that we can use
														
 
															-to perform operations on s-exps. As a preliminary example, we include a 
														
 
															-familiar definition of factorial, first without using match.
														
 
															+transform it to X86\_64 Assembly. Along the way, we transform each
														
 
															+input expression into a handful of \textit{intermediary languages}
														
 
															+(IL).  A key tool for transforming one language into another is
														
 
															+\textit{pattern matching}.
														
 
															+
														
 
															+Racket provides a built-in pattern-matcher, {\tt match}, that we can
														
 
															+use to perform operations on s-exps. As a preliminary example, we
														
 
															+include a familiar definition of factorial, first without using match.
														
 
															 \begin{verbatim}
														
 
															 (define (! n)
														
 
															   (if (zero? n) 1
														
@@ -287,7 +401,7 @@ comprised of \textit{left-hand side} (LHS) and \textit{right-hand side} (RHS)
 
															 sub-expressions. LHS sub-expressions can be thought of as an expression
														
 
															 of the grammar in Figure~\ref{fig:sexp-syntax}. To provide an example, we
														
 
															 include a function that takes an arbitrary expression, {\tt exp} and
														
 
															-determines whether or not {\tt exp} \(\in\) {\tt Arith}.
														
 
															+determines whether or not {\tt exp} \(\in\) {\tt arith}.
														
 
															 \begin{verbatim}
														
 
															 (define (arith-foo exp)
														
 
															   (match exp
														
@@ -295,12 +409,12 @@ determines whether or not {\tt exp} \(\in\) {\tt Arith}.
 
															     (`(,e1 ,op ,e2) #:when (memv op '(+ -)) 
														
 
															      (and (arith-foo e1) (arith-foo e2)))
														
 
															     (`(,op ,e) #:when (memv op '(+ -)) (arith-foo e))
														
 
															-    (else (error "not an Arith expression: " arith-exp))))
														
 
															+    (else (error "not an arith expression: " arith-exp))))
														
 
															 \end{verbatim}
														
 
															 Here, {\tt \#:when} puts constraints on the value of matched expressions.
														
 
															 In this case, we make sure that every sub-expression in \textit{op} position
														
 
															 is either {\tt +} or {\tt -}. Otherwise, we return an error, signaling a
														
 
															-non-{\tt Arith} expression. As we mentioned earlier, every expression 
														
 
															+non-{\tt arith} expression. As we mentioned earlier, every expression 
														
 
															 wrapped in an {\tt unquote} is evaluated first. When used in a LHS {\tt match}
														
 
															 sub-expression, these expressions evaluate to the actual value of the matched
														
 
															 expression (i.e., {\tt arith-exp}). Thus, {\tt `(,e1 ,op ,e2)} and 
														
@@ -340,7 +454,7 @@ ignore the {\tt read} operator.
 
															 \caption{The syntax of the $S_0$ language. The abbreviation \Op{} is
														
 
															   short for operator, \Exp{} is short for expression, \Int{} for integer,
														
 
															   and \Var{} for variable.}
														
 
															-\label{fig:s0-syntax}
														
 
															+%\label{fig:s0-syntax}
														
 
															 \end{figure}
														
 
															 \begin{verbatim}
														
@@ -368,7 +482,7 @@ reader a feeling for the scale of this first compiler, the instructor
 
															 solution for the $S_0$ compiler consists of 6 recursive functions and
														
 
															 a few small helper functions that together span 256 lines of code.
														
 
															-\begin{figure}[htbp]
														
 
															+\begin{figure}[btp]
														
 
															 \centering
														
 
															 \fbox{
														
 
															 \begin{minipage}{0.85\textwidth}
														
@@ -633,6 +747,7 @@ into the text representation for x86 (Figure~\ref{fig:x86-a}).
 
															 \begin{figure}[tbp]
														
 
															 \fbox{
														
 
															 \begin{minipage}{0.96\textwidth}
														
 
															+\vspace{-10pt}
														
 
															 \[
														
 
															 \begin{array}{lcl}
														
 
															 \Arg &::=&  \INT{\Int} \mid \REG{\itm{register}}
														
@@ -681,7 +796,7 @@ differences.
 
															 We ease the challenge of compiling from $S_0$ to x86 by breaking down
														
 
															 the problem into several steps, dealing with the above differences one
														
 
															-at a time. The main question then becomes: in what order to we tackle
														
 
															+at a time. The main question then becomes: in what order do we tackle
														
 
															 these differences? This is often one of the most challenging questions
														
 
															 that a compiler writer must answer because some orderings may be much
														
 
															 more difficult to implement than others. It is difficult to know ahead
														
@@ -698,12 +813,12 @@ locations. Thus, it makes sense to deal with \#2 before \#3 so that
 
															 consider where \#1 should fit in. Because it has to do with the format
														
 
															 of x86 instructions, it makes more sense after we have flattened the
														
 
															 nested expressions (\#2). Finally, when should we deal with \#4
														
 
															-(variable overshadowing)?  We shall be solving this problem by
														
 
															-renaming variables to make sure they have unique names. Recall that
														
 
															-our plan for \#2 involves moving nested expressions, which could be
														
 
															-problematic if it changes the shadowing of variables. However, if we
														
 
															-deal with \#4 first, then it will not be an issue.  Thus, we arrive at
														
 
															-the following ordering.
														
 
															+(variable overshadowing)?  We shall solve this problem by renaming
														
 
															+variables to make sure they have unique names. Recall that our plan
														
 
															+for \#2 involves moving nested expressions, which could be problematic
														
 
															+if it changes the shadowing of variables. However, if we deal with \#4
														
 
															+first, then it will not be an issue.  Thus, we arrive at the following
														
 
															+ordering.
														
 
															 \[
														
 
															 \xymatrix{
														
 
															 4 \ar[r] & 2 \ar[r] & 1 \ar[r] & 3
														
@@ -733,7 +848,9 @@ and there is a \key{return} construct to specify the return value of
 
															 the program. A program consists of a sequence of statements that
														
 
															 include at least one \key{return} statement.
														
 
															-\begin{figure}[htbp]
														
 
															+\begin{figure}[tbp]
														
 
															+\fbox{
														
 
															+\begin{minipage}{0.96\textwidth}
														
 
															 \[
														
 
															 \begin{array}{lcl}
														
 
															 \Arg &::=& \Int \mid \Var \\
														
@@ -742,6 +859,8 @@ include at least one \key{return} statement.
 
															 \Prog & ::= & (\key{program}\;\itm{info}\;\Stmt^{+})
														
 
															 \end{array}
														
 
															 \]
														
 
															+\end{minipage}
														
 
															+}
														
 
															 \caption{The $C_0$ intermediate language.}
														
 
															 \label{fig:c0-syntax}
														
 
															 \end{figure}