Ceed A Tiny Compiler with ELF & PE Target

Introduction

I have always been interested in compilers but never got a chance to build one. I recently got some time and decided to learn compiler development. I started to look for an example compiler that meets following criteria:

Minimalistic – Quick walk-through and reading
Documented – Easier to understand
Complete – Produces executables without external tools
Uses Lex & Yacc – Many books available on building parsers

I didn’t find a good example that meets these requirements and can used as a starting point to bootstrap knowledge of compiler construction. I decided to develop a compiler, primarily for learning and later on, to create an example, that serves as a resource for others who are interested in learning compiler development.

Ceed (pronounced as seed) is an open source (BSD style license) compiler that compiles a high level language to x86 machine code and packages it in an executable binary. It supports 32-bit Linux and Windows and can generate output file in either ELF or PE format. Complete code for Ceed is around 1400 lines of Flex/Bison/C code excluding comments.

In this post, I would describe the high level language used by Ceed, followed by design and implementation of Ceed.

Ceed Language

Primary purpose of Ceed is to be a compiler that can be used in learning compiler development. For this reason, Ceed language supports a limited set of features. It supports basic arithmetic and relational operators, string constants, global and local variables, function calls without parameters, stdin/stdout I/O, and if/else conditions. It does not have any looping constructs as that is left as an exercise. To give an idea of Ceed language, a few examples are shown below.

Obligatory hello world in Ceed:

_a {
    ws("Hello world!");
    nl();
}

A more elaborate example, that does arithmetic based on user input:

/*
 * A helper function to show the concept of functions and global variables.
 * It prints value of global variable R, followed by a new line.
 */
_b {
    wi(R); nl();
}

/*
 * This is the entry point, equivalent to c main function.
 * It does simple calculation based on user input.
 */
_a {
    nl();
    ws("-------------------------"); nl();
    ws("| Choice   Operation    |"); nl();
    ws("-------------------------"); nl();
    ws("|   1      Addition     |"); nl();
    ws("|   2      Substraction |"); nl();
    ws("|   0      Exit         |"); nl();
    ws("-------------------------"); nl();
    ws("Please enter your choice: ");
    x = ri();
    if (x == 1) {
        ws("Enter first number: ");
        a = ri();
        ws("Enter second number: ");
        b = ri();
        /*
         * Store result in global variable R and print value
         * of R via a function call.
         */
        R = a + b;
        ws("Result: ");
        _b();
    } el {
        if (x == 2) {
            ws("Enter first number: ");
            a = ri();
            ws("Enter second number: ");
            b = ri();
            ws("Result: ");
            /*
             * Print result directly via a built-in function call.
             */
            wi(a - b); nl();
        }
    }
}

Complete set of supported language constructs in Ceed are:

26 global variables – Named from A to Z
26 local variables – Named from a to z
26 functions – Named from _a to _z
- This is a simplification of using only a-z or A-Z is to avoid maintaining a symbol lookup table
C-style NULL terminated string constants – These can contain any character and unlike C, there is no concept of escape sequence
Built in functions for stdin/stdout I/O
- ws and wi to write string and integer to stdout respectively
- ri to read integer from stdin
- nl to write new line to stdout
Arithmetic Operators – Only + and – are supported
Relational Operators – Only == and > are supported
Conditional clause using if and el are supported
C style comments are supported
Entry point of a Ceed program is function with the name _a
- If this function is not defined, compilation results in an error

As mentioned earlier, there are no looping constructs in Ceed language. However, there is support for recursion and that can be used to simulate loops as shown below.

_b {
    if (A == 0) {
        ws("Result: ");
        wi(R);
        nl();
    } el {
        R = R + A;
        A = A - 1;
        _b();
    }
}

_a {
    A = 5;
    _b();
}

Each function has 26-local variables allocated on the stack, which translates to around 104 bytes of stack + few other bytes to save registers. This means you may hit a stack overflow exception after certain number of iteration even if you don’t use any local variables. On Windows 10, I got stack overflow after ~9K iterations and on Linux after ~78K iterations.

Ceed Design & Implementation

There are three main components in Ceed that are described in following sections.

1. Parser

Parser is the component that takes an Ceed language input file and creates a symbol table from it. It uses flex and bison to do tokenization of input file and apply context free grammar rules to parse it. It passes the symbol table to code generator at the end of parsing. This is implemented in ceed.l and ceed.y

Code for Ceed parser is based on a lex and yacc calculator tutorial from epaperpress. It is extended to support function calls, strings and few other aspects. I won’t spend much time on the parser design as code is pretty self-explanatory, as well as one can read through original tutorial and many other Lex and Yacc tutorials. Upon successful parsing, a symbol table is generated. Key data structure for the symbol table are shown below:

typedef enum _sym_typ {
    sym_typ_const,
    sym_typ_ident,
    sym_typ_stmt
} sym_typ;

// Constants
typedef struct _sym_const {
    int value;
} sym_const;

// Identifiers
typedef struct _sym_ident {
    int index;
} sym_ident;

// Expressions
typedef struct _sym_stmt {
    int type;
    int sym_count;
    struct _sym *sym_list[1];
} sym_stmt;

typedef struct _sym {
    sym_typ type;
    union {
        sym_const con;
        sym_ident ident;
        sym_stmt stmt;
    };
} sym;

There are three possible types of symbols: constant, identifier and statement. The statement type is further categorized in many sub types. Once all the code is parsed, code generator is invoked from ceed.y file (emit_code), as shown below:

program:
    code { emit_code($1); free_sym($1); }
    ;

Three types of identifiers: local variable, global variable and function names are supported. As a simplification, inside parser, these are represented by an index and not by their name. Local variables in Ceed code can be from a-z and inside parser, these are represented from 0-25. Global variables can be from A-Z and are represented by index 26-51. Function names can be from _a-_z and are represented from 52-77. In the code generator section, you would see how these indexes are used to identify, whether a local variable is used or global variable or to identify which function is called, as well as generate code for these.

2. Code Generator

It is the component that takes the symbol table generated by parser and compiles it to machine code. Ceed does not use any intermediate language. It directly produces machine code for x86. This is implemented in ceedcmpl.c.

There are many design choices used to simplify overall code generation. In this section, we describe some of the key design decisions and assumptions to make it easier to understand Ceed implementation.

Warning: Code generated by Ceed is sub optimal and might have bugs. Ceed is intended to serve as a tutorial for learning compiler development. It is not intended to be used for any production use cases. There are instances where code generation is duplicated to keep it simple. A modular code generation is possible and an improvement that can be done in future. However, for now, the implementation is good enough for the purpose of this compiler.

Virtual address space

On Linux and Windows, each executable needs a virtual address space to execute. This address space is decided based on information contained in the executable file. In ELF file, it is program headers and in PE files, these are called sections. Overall layout of an executable generated by Ceed is shown below.

+---------------------------------+  0x900000
|                                 |
|                                 |
|              .rdata             |
|                                 |
|                                 |
+---------------------------------+  0x800000
|                                 |
|                                 |
|              .text              |
|                                 |
|                                 |
+---------------------------------+  0x700000
|                                 |
|                                 |
|              .data             |
|                                 |
|                                 |
+----------------+----------------+  0x600000
|                |                |
|                |                |
|                |    .idata      |
|                |                |
|                |                |
|  Not Loaded    +----------------+  0x500000
|                |                |
|                |                |
|                |   PE Header    |
|                |                |
|                |                |
+----------------+----------------+  0x400000
      ELF32             PE32

We make a simplifying assumption that each section is at most 16MB and that each section starts at a fixed address. This simplification allows a one pass code generation without need for any fix ups in a second pass. Without this assumption, .text section would be of variable size and .rdata section start address would change based on actual size of .text section. This would require creation of a fix up table during code generation. Once virtual address of .rdata is calculated (based upon .text size), then code would need to be fixed up by enumerating the fix up table. This simplification allows a one pass code generation. The limitation is that you cannot write code in a single executable that exceeds 16MB.

There is still need for some fix ups e.g. in an if/else condition (checkout emit_if_else) but it is localized and does not complicate the code. This fix up logic can be used as reference to build a full blown fix up tables (if needed).

Global and Local Variable Support

Another simplification we have done is that, Ceed only supports fixed number of global or local variables i.e. 26 variables. This allows it to know the offset of each variable in their section and simplify code generation. As an example, the code generator to access a variable is shown below.

u32
emit_var(sym *p)
{
   if (p->ident.index >= 0 && p->ident.index <= 25) {
       // Local variable.

       // mov ecx, ebp
       emit8(0x89);
       emit8(0xe9);

       // sub ecx, x
       emit8(0x83);
       emit8(0xe9);
       emit8((p->ident.index + 1) * 4);
   } else if (p->ident.index >= 26 && p->ident.index <= 51) {
       // Global variable.

       // mov ecx, data_section_va + offset
       emit8(0xb9);
       emit32(get_data_va(ei) + ((p->ident.index - 26) * 4));
   }
   return CEED_TYPE_ADDR;
}

From the parser, an index for variables, 0-25 for local variables and 26-51 for global variables is returned. To generate code that uses a global or local variable, this index into .data section or stack is used, respectively, without any lookup tables.

String Support

Only constant strings are supported in Ceed. You cannot use a variable to refer to a string either. All strings are stored in .rdata section which is a read only section. From lexer i.e. ceed.l, if a string is identified in the input file, an add_str function is called, that copies this string to .rdata section and returns the offset into .rdata section where this string is copied. Lexer then returns this offset to represent the string and code generator emits code to access the string from this offset. As strings are only allowed in the built in function ws, code generation for ws is quite simple, as it prints specific number of bytes from the offset into .rdata section.

Standard Input/Output Handling

I/O handling is platform specific and differs on Linux and Windows. In Linux, system calls using int 0x80 are used to do I/O from console, whereas in Windows, you need to call into kernel32.dll functions, such as ReadConsole, WriteConsole to do console I/O. The code to do standard I/O is implemented in ceed_elf.c and ceed_pe.c for Linux and Windows respectively. In windows, a special section that has import table is generated to allow linking to kernel32. Please look at pe_gen_import_section for implementation of it. In windows, stdin and stdout handle needs to be obtained by calling kernel32:GetStdHandle() function as well. As an optimization, Ceed uses available space in .data section to store these handles and uses them for all future calls to read or write from console. These handles are obtained by executing a small piece of code before _a of the input source file executes (look at: pe_emit_main_init). On linux, there is no such requirement so input/output code invokes the relevant system calls by using int 0x80 directly and does not depend upon any other libraries.

Integer I/O Handling

System APIs on Linux and Windows supports I/O in terms of characters. To allow I/O in terms of integers, Ceed needed support for atoi and itoa to convert integer to strings and strings to integer. wi function takes an integer as input, converts it to a string using itoa and outputs it to stdout. ri function reads input from stdin, converts it to integer using atoi and returns the read value as an integer. As Ceed doesn’t link to any external libraries, except for kernel32 for I/O on Windows, we needed to somehow invoke code that can implement atoi and itoa in the context of output executable generated by Ceed. There were two choices, either to generate machine code that can implement these functions or somehow inject implementation of these functions in Ceed output executable. We did this by writing atoi and itoa in x86 asm (atoi_itoa.asm) and using nasm to generate machine code for these APIs. We copy this generated code in Ceed (look at ceedrtl.c) as a byte array and copy this machine code directly in the .text section at offset 0x0 and 0x80 (look at cmplr_init in ceedcmpl.c).

This approach is non-scalable and error prone. In a proper compiler, you would either allow linking to a standard library that has such functions implemented, or allow creation of such functions by supporting strings without limitations that are present in Ceed. This was outside the scope of Ceed, hence a slightly tedious but easier approach was taken.

3. Executable Generator

It is the component that takes generated machine code and wraps up in the target executable, which can be either in ELF format or PE format. This is implemented in ceedelf.c and ceedpe.c.

This component is quite straight forward. It creates an output file in binary format and writes executable header, ELF and Program header for ELF, DOS header, PE header and optional header for PE. After that it writes each section one by one in the file. Each section has two properties, file offset and memory offset (or virtual offset). File offset is where the section is located in the file on disk and memory/virtual offset is where the section should be loaded in memory.

You can learn more about these files by using readelf on Linux and CFF explorer or other PE format tools on Windows.

Source Code

Source code for Ceed is provided under a BSD style license, available at: https://github.com/intellectualheaven/ceed/