Saturday, April 21, 2012

Parsing in Brief


Part I. Intoduction to Parsing
A. Introduction
This is the first of a series of articles on language translation. This is commonly called “parsing”, and I will no doubt lapse from time to time into this usage, but it is better to reserve the word for one phase of translation.
The techniques I will discuss are used in compilers and interpreters, of course, but also in many more-common circumstances. For example, when you add a formula to a speadsheet, the application must translate the formula into a form it can use. Many database programs let you specify formats for input fields -- these must translate your format, and then use the result to translate the input. Consider the expression search in MPW. I guarantee you there’s a translator operating there. And, of course, my own program Idealiner uses a translator when you specify topic numbers.
All the code examples in this series of articles will be MPW tools, to minimize interface considerations. I doubt if you’ll see a single Macintosh trap before the tenth article. I will focus instead on the internals of language translation.
Language translation can be divided into three stages: first, the LEXICAL ANALYZER (also called the “scanner”) divides the input text into individual symbols, pieces like identifiers, operators and constants; then the PARSER interprets the sequence of symbols according to the language’s grammar; and lastly the CODE GENERATOR responds in an appropriate fashion to the statements the parser has processed. These three phases operate simultaneously, the parser calling the lexical analyzer for input, and then calling the code generator for output.
A(1). Parsing
The first three parts in the series will focus on parsing.
The first part of the article (this very one) will introduce parsing and the parser-generator YACC (Yet Another Compiler Compiler). YACC converts a set of GRAMMAR RULES into C source for a parser that will accept that grammar. You still have to write the lexical analyzer and the code generator (although YACC provides a framework for code generation) but YACC takes care of the most tedious part of the job.
The second part will go beneath the surface and explore the inner workings of YACC. This will be somewhat technical, and may seem unnecessary, but it will provide an essential foundation for later topics. The article will cover deriving parse tables from YACC output, parsing expressions by hand (using the parse tables), and YACC’s debug output.
The next article will discuss ambiguities, association and error detection. An understanding of the first two topics are essential to writing correct and efficient grammars. The third covers detecting and reporting errors by type and location.
A(2). Lexical Analysis
With the third article, I will return to the first stage of language translation, lexical analysis. The article will begin with a discussion of state machines, then present a simple but effective table-driven analyzer.
The fourth article will be an excursion into a seemingly unrelated field: hash tables and binary trees. The idea is to develop some tools to increase the power of the lexical analyzer in the following article.
The fifth article will extend the analyzer by adding a symbol table. The routines developed in the fifth article will give us a way to save symbols; the example program will be an improved version of the MPW Canon tool.
A(3). And a Useful Example
Now, I’m pretty sure of the preceding, because I’ve already written the articles! What follows is a forecast of what I’ll do next.
The plan is to build an MPW tool that will preprocess either Pascal or C source files and convert inline assembly code into Pascal inline routines or C direct functions, as appropriate. That is, you will be able to write assembly language instructions, and the preprocessor will convert your assembly into machine language. Your assembler routines will be declared with high-level headers, a la Lightspeed C, and you will be able to refer to routine arguments and local variables by name, rather than indexing off the stack, a real convenience. I’m going to write this tool because, damn it, I want to use it!
This will be a major project: I’ll need to parse Pascal and C well enough to find routines, and I’ll need to parse assembler completely. I’ll then need to write a more-or-less complete assembler. It could take six or eight columns.
B. Grammar Descriptions
Here’s a reference for this topic: Compilers: Principles, Techniques and Tools, by Aho, Sethi and Ullman (Addison-Wesley 1986). You may have heard of some of these guys. I will refer to it as “ASU”.
A computer language is composed of a set of symbols (the “words” of the language), and a set of grammar rules that determine how these words are formed into statements (the “sentences” of a computer language). As I said earlier, the lexical analyzer is in charge of extracting the “words” from the input; it is the job of the parser to make meaningful “sentences” from these “words”.
A bit of terminology: classes of symbols (such as “integer” or “identifier”) are called TOKENS, and specific instances of tokens (such as “252” or “sysbeep”) are called LEXEMES. The lexical analyzer will typically determine both the type and value of a symbol that it reads; the former is the token classification and the latter is the lexical value. The parser cares about tokens; the code generator cares about values.
I guarantee that I will use the word “token” to mean both token and lexeme. The meaning should be clear from the context.
B(1). Grammar Rules
A grammar rule is called a PRODUCTION. It is a substitution rule; the single symbol of the left-hand side (often, but not by me, abbreviated LHS) produces the expansion on the right-hand side (RHS). Here’s an example:
stmt -> ID ‘=’ expr ‘;’
A ‘stmt’ can be expanded to (that’s what the ‘->’ means) a series of symbols consisting of an ‘ID’, an ‘=’ sign, and ‘expr’ and a semicolon. Symbols within quotation marks are explicit, and must appear as written; other symbols are token classes.
Recursive definition is permitted. Here’s an example:
expr -> expr ‘+’ NUM
Applying this rule to the first, we discover that:
stmt -> ID ‘=’ expr ‘+’ NUM ‘;’
stmt -> ID ‘=’ expr ‘+’ NUM ‘+’ NUM ‘;’
And so on. It is possible to determine a complete language such as Pascal or C with relatively few such productions, even though there are infinitely-many legal statements in the language.
Now, everyone can make up his own conventions, of course, but I will distinguish two kinds of non-explicit symbols by showing one in all-caps and one in all-lower-case. All-caps symbols are defined by the lexical analyzer, not by the parser. Thus, they will not appear on the left-hand side of any production. These symbols are call TERMINALS, because they terminate expansion. When you expand to a terminal, you can expand no further. The lower-case symbols are, surprise!, NON-TERMINALS. They are defined by the parser rather than the lexical analyzer, appear somewhere as the left-hand side of one or more productions, and do not terminate expansion. These distinctions are important!
As the above examples suggest, several productions can share the same left-hand side:
expr -> expr ‘+’ NUM
expr -> NUM
This pair of productions expands to arbitrary sums. Just start with the first production and substitute the first production into it to add another term, or the second to terminate the expansion.
B(2). An Example Grammar
This will be fun. If you don’t want to mess with abstract symbols, just skip this whole section; the result is all you need. My development of the grammar won’t be very theoretical, though.
A parser reads the input symbols from left to right, until it has read the right-hand side of a production. Then it REDUCES the input, running the production backwards and replacing the right-hand side with the left-hand side. This is the point at which code is generated, and expressions evaluated: when a reduction occurs.
So to see if a grammar does what we want, we can start with a test statement that should be accepted by the grammar, and see what happens when we play parser.
The grammar we are shooting for will accept simple algebraic expressions, involving addition, subtraction, multiplication and division of numbers.
B(2)(a). First Try
Earlier, we saw productions that can expand to arbitrary sums:
expr -> expr ‘+’ NUM
expr -> NUM
We can add a few more to get the other three usual operators:
/* 1 */

(1) expr -> expr ‘+’ NUM
(2) expr -> expr ‘-’ NUM
(3) expr -> expr ‘*’ NUM
(4) expr -> expr ‘/’ NUM
(5) expr -> NUM
Let’s try a simple test:
NUM ‘+’ NUM ‘*’ NUM
-> expr ‘+’ NUM ‘*’ NUM
(rule 5; remember that we read from the left)
-> expr ‘*’ NUM (rule 1)
-> expr (rule 3)
The addition was the first thing to go; therefore, it was performed first. This is contrary to our expectations, I hope (my past as a math teacher shows itself!). The grammar won’t work. We need to make sure that multiplication and division are performed before addition and subtraction.
B(2)(b). Second Try
So let’s introduce another non-terminal. We want to cluster products together, to ensure that they are evaluated first, so let’s make them a separate non-terminal:
/* 2 */

(1) expr -> expr ‘+’ term
(2) expr -> expr ‘-’ term
(3) expr -> term
(4) term -> term ‘*’ NUM
(5) term -> term ‘/’ NUM
(6) term -> NUM
Now try the test string again:
NUM ‘+’ NUM ‘*’ NUM
-> term ‘+’ NUM ‘*’ NUM
(rule 6)
-> expr ‘+’ NUM ‘*’ NUM
(rule 3)
-> expr ‘+’ term ‘*’ NUM
(rule 6)
Oops, it looks like we have a choice here: rule 1 or rule 4. We really don’t, though; reducing by rule 1 would leave us with “expr ‘*’ NUM”, and we don’t have any rule with “expr ‘*’” in it.
-> expr ‘+’ term
(rule 4)
-> expr
(rule 1)
So the multiplication happened first, just like we wanted.
B(2)(c). Third Try
Suppose, however, that we wanted the addition to occur first. Then we need to add some parentheses:
/* 3 */

(1) expr -> expr ‘+’ term
(2) expr -> expr ‘-’ term
(3) expr -> term
(4) term -> term ‘*’ NUM
(5) term -> term ‘/’ NUM
(6) term -> ‘(‘ expr ‘)’
(7) term -> NUM
Now we treat an addition or subtraction that occurs within parentheses as if the sum or difference was a ‘term’ (rule 6). And that should do it.
C. YACC: Yet Another Compiler Compiler
YACC is a UNIX tool that builds parsers from grammars. We can take the grammar just developed, supplement it with a minimal bit of C code, and YACC will write the complete parser for us. Or, if the parser is only a small part of a big program, YACC will write that small part, and we can then link it to the main program. In either case, YACC saves us from the unbearable tedium of computing parse tables by hand.
A YACC input file is divided into three parts: the declaration part, the grammar part, and the program part. (Remind you of anything? My past as a COBOL programmer coming out!) The three sections are separated by lines beginning:
%%

YACC writes a C source file as its output; it will also write a file of parser information if you desire it. We’ll look at that next time.
C(1). The Declaration Section
The declarations include a set of C-style declarations that are used by YACC to build the parser. For now, there are only two sorts of declarations that concern us.
The first is the “%token” declaration. YACC will automatically recognize one-character tokens like ‘+’ and ‘-’. All other terminal tokens must be declared with the %token statement, e.g.,
%token NUM
Non-terminals do not need to be declared; they are implicitly declared in the grammar rules.
The second declaration type lets us pass regular C declarations through YACC to the C compiler. These look like this:
/* 4 */

%{
#define blip 12
%}
Anything between the %{ and the %} is passed through unchanged, and written by YACC to the C source file.
It is customary to include a #define for YYSTYPE. What is YYSTYPE? It is the type of the parser’s stack. For now, there’s no need to worry about it. It is “int” by default, and int will work fine for what we’re doing this month. Later, after I’ve discussed how the parser operates, and we know what sort of things go on the stack, we’ll come back to it.
C(2). The Grammar Section
The grammar section includes all the grammar productions, like those we discussed earlier. They are written in a somewhat non-standard format (taking ASU’s notation as the standard, which I think is reasonable). They also provide a framework for code generation.
C(2)(a). Production Rule Format
The arrow in productions is replaced with a colon, and productions with the same left-hand side are combined into one, with the right-hand sides separated with vertical bars, |. The last right-hand side is terminated with a semicolon. The usual practice is to format the productions like this:
/* 5 */

expr : expr ‘+’ term
| expr ‘-’ term
| term
;

term : term ‘*’ NUM
| term ‘/’ NUM
| ‘(‘ expr ‘)’
| NUM
;
in place of:
expr -> expr ‘+’ term
expr -> expr ‘-’ term
expr -> term
term -> term ‘*’ NUM
term -> term ‘/’ NUM
term -> ‘(‘ expr ‘)’
term -> NUM
C(2)(b). Code Generation
You can follow each right-hand side with some pseudo-C code that the parser will then call when the production rule is executed (i.e., if you read that stuff about reductions, when the input is reduced by that production).
Here’s an example. Suppose you have the production rules:
expr : expr ‘+’ term
| expr ‘-’ term
| term
;
The question is, what is the value of the “expr”? In the first case, it’s the value of the first token of the right-hand side plus the value of the third; in the second, it’s the first minus the third; and in the third, it’s just the value of the first token. So we write:
/* 6 */

expr : expr ‘+’ term
{
$$ = $1 + $3;
}

| expr ‘-’ term
{
$$ = $1 - $3;
}

| term
{
$$ = $1;
}

;
This isn’t hard to figure out; $$ is the value of the left-hand side, $1 is the value of the first token of the right-hand side, $3 the value of the third token. Using those symbols, you just write straight C code. You are not limited to a single line, and you can call routines that you have written elsewhere. Don’t forget the braces or the semicolons.
C(3). The Program Section
The program section is made up of straight C code, and is copied unaltered into the C source file. While YACC requires that you supply some routines to it, you can put them in another file, and this section can be completely empty. In simpler cases, however, the program section allows you to write your entire program in the YACC source file.
Here’s what YACC requires: a lexical analyzer, called yylex(), and an error routine, yyerror(). Common sense requires a main() routine as well.
The prototype for yylex() is:
/* 7 */

int yylex();
The parser routine, called yyparse(), will call yylex() whenever it needs input. yylex() reads the next token, sets the global variable “yylval” to the value of the token (optional; this is the “$” value used in the production code), and returns the token type. You might wonder where it reads the token from, since it hasn’t any arguments. The answer is, it uses global variables of some sort, such as a global string or a global file reference, that is set up by the main() routine.
The error routine is:
/* 8 */

void yyerror(char *message);
The message is generated by the parser when it detects an error; yyerror()’s job is to notify the user that something has gone wrong. Of course, you don’t have do live with YACC’s default error messages, which are rather unilluminating; you can call yyerror() yourself. More on this in a future article.
D. An MPW Hex Calculator
Now for some real code. Our example program is an MPW tool, a hex calculator. It will evaluate expressions using +, -, * and /, and will also properly evaluate expressions with parentheses in them. The name of the tool is “Hex”; you can invoke it with an expression, e.g.:
Hex ‘2 - ( 3 - 4 )’
in which case it will evaluate, print the result, and exit. Notice that in this case, the expression must be in quotation marks, or MPW will treat each token separately. Note also that the tokens must be separated by spaces. This is to simplify the lexical analyzer; we will relax this requirement in a future version.
The tool may also be invoked without an expression to evaluate. It will then go into a loop, reading in expressions and evaluating them, e.g.:
Hex
? 2 - ( 3 - 4 )
= 3
? 64 * 8
= 320
?
The loop ends on a blank line.
Here are the input globals and the code for the main() routine:
/* 9 */

char *input;
char *token;

void main(int argc, char *argv[])
{
char thestring[256];
if (argc < 1)
printf(“\tImpossible error!\n”);
else if (argc > 2)
printf(“\tHey! One at a time!\n”);
else if (argc == 2)
{
input = argv[1];
yyparse();
}
else
{
printf(“? “);
while (strlen(gets(thestring)) > 2)
{
input = &thestring[2];
yyparse();
printf(“? “);
}
}
}
“input” is the input buffer, and holds the expression to evaluate. “token” is filled by yylex() with the current token. It’s useful for debugging and error reporting. Finally, in the read loop, notice that the gets() routine will read the “? ” prompt as well as the expression, this being MPW, which is why “input” points to thestring[2].
D(1). The Lexical Analyzer and Error Routines
This is about as simple as a lexical analyzer can be. The strtok() routine will return the tokens in the input string, so long as they’re separated by spaces, or newline at the end of input. Then if sscanf() can read a hex number, that’s what the token must be; if it isn’t a number, it must be an operator or a parenthesis, so return the first character.
This routine is so simple it is vulnerable to pathological input -- please don’t TRY to break it! Where’s the challenge? This is just a stopgap, good enough to serve until we get a real lexical analyzer.
/* 10 */

int yylex()
{
if (input == 0)
token = strtok(0, “ “);
else
{
token = strtok(input, “ “);
input = 0;
}
if (token == 0)
return(‘\n’);
else if (sscanf(token, “%x”, &yylval) == 1)
return(NUM);
else
return(token[0]);
}
The error routine is even simpler. It just prints out the parser’s default error message, which is the blindingly helpful “syntax error”. We do add the current token, which may be helpful.
/* 11 */

#define yyerror(x)
{
printf(“\t%s [%s]\n”, x, token);
return(0);
}
(This is divided into separate lines to fit in Mac Toot’s narrow columns; that’s not the way we’d write it in C, of course!) Note the return statement. yyerror is called from within the parser routine yyparse(), so that’s what we’re returning from. The effect is to abort the translation.
D(2). The Grammar
The grammar we will use is the same as that developed earlier. There’s one additional production: we have to give YACC a START SYMBOL, which is the left-hand side of the first production. In this case, “prob” is short for “problem”.
prob -> expr ‘\n’
The newline is a single-character token returned by yylex() to signal the end of input. So if the entire input string is an expression, we’ve got a complete problem. Any other occurrence of a newline is an error.
/* 12 */

prob -> expr ‘\n’
expr -> expr + term
expr -> expr - term
expr -> term
term -> term * NUM
term -> term / NUM
term -> ( expr )
term -> NUM
D(3). The Value Calculation and Output
Here’s the actual YACC input file, declaration and grammar sections. The declaration section consists of a single %token declaration, making NUM a terminal symbol. The grammar section includes all the productions listed above, each with some associated code.
/* 13 */

%token NUM

%%

prob : expr
{
printf(“\t= %X\n”, $1);
return(0);
}
;
expr : expr ‘+’ term
{
$$ = $1 + $3;
}
| expr ‘-’ term
{
$$ = $1 - $3;
}
| term
{
$$ = $1;
}
;

term : term ‘*’ NUM
{
$$ = $1 * $3;
}
| term ‘/’ NUM
{
$$ = $1 / $3;
}
| ‘(‘ expr ‘)’
{
$$ = $2;
}
| NUM
{
$$ = $1;
}
;
D(4). The Make File
YACC goes right into your make file, just like any other MPW tool. Hmm. I thought I said that YACC was a UNIX tool... The truth is that there is an MPW version of YACC available, called MACYACC (I have renamed the tool on my system to make typing easier).
#14

Hex.c ƒ Hex.make Hex.y
Yacc -VHex.out Hex.y
Hex.c.o ƒ Hex.make Hex.c
C -r Hex.c
Hex ƒƒ Hex.make Hex.c.o
Link -w -t MPST -c ‘MPS ‘ �
Hex.c.o �
{Libraries}”Interface.o �
{CLibraries}”CRuntime.o �
{CLibraries}”StdCLib.o �
{CLibraries}”CSANELib.o �
{CLibraries}”Math.o �
{CLibraries}”CInterface.o �
{Libraries}”ToolLibs.o �
-o Hex

No comments:

Post a Comment