[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

1. Teyjus Internals

The process of building the system is quite simple. However, this simplicity is achieved through a complex combination of many systems and programming languages. If you are unfamiliar with autoconf and Makefiles, read the `autoconf' and `make' documentation thoroughly. A familiarity with `sed' and `awk' as well as basic shell scripting (`sh') will also be useful.

The following sections describe the components of the build process, and then delve into descriptions of the many modules comprising the Teyjus system.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

1.1 Configure

The file `configure.in' configures the Teyjus system based on the characteristics of the machine on which it is running. The following is brief description of the major components of this file.

After initializing autoconf and testing the compiler, we check for the presence of GNU make. We prefer GNU make because it will bring any files referenced with the include directive up to date before reading them. We use this feature to ensure the incremental rebuilding of the dependencies as the project is changed. See Dependencies.

The next few sections contain standard autoconf code to search for utility programs and header files. Note that many of these programs are not necessary to build the core functionality of Teyjus. For instance, texi2html is only used to create an HTML version of this documentation. Of the header files, only `signal.h' and either `string.h' or `strings.h' are necessary.

Configure next checks for the presence of several library functions. The function mmap is used in the memory system (see Heap), in conjunction with getpagesize or sysconf. The signal functions (sigaction, signal, and siglongjmp) are used to implement backtracking and to handle terminal interrupts. The Paths module makes use of the stat function in searching for files. See Paths.

The next section makes the determination of where and how to allocate the heap for the abstract machine. See Allocating the Heap. Configure compiles and runs two programs, `build/mtloop' and `build/memtest', to locate the best place for the memory system. These programs will attempt to find the largest possible block in the best location to place the heap for the abstract machine.

Configure supports the inclusion of debugging symbols in the executables with the --enable-debug option to configure. It also supports the use of a library called dmalloc which helps to detect memory errors. This library publically available and is not distributed with the Teyjus system.

We also support two modes of status output from the makefiles: plain and verbose. For more information, see Makefiles.

The last part of the configuration file contains code which is inserted into the file `config.status'. This code builds several utility files:

`build/mkdep.awk'

An awk script used to extract the pathnames of files #include'd from a source file.

`build/dependencies'

A file which is included from the primary makefile and which further includes every dependency file in the system.

`build/*.build'

Files containing lists of object files used in a particular build. For example, `build/tjdis.build' lists the object files which should be linked to create `tjdis'.

`examples/modules'

A file listing all of the bytecode files for the examples. This file is used in building the examples.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

1.2 Makefiles

There are four Makefiles in the project. Only the top-level Makefile should be used directly. It contains rules to build the four executables (make tjcc tjsim teyjus tjdis), to generate this documentation (make doc), to bytecode-compile the lambdaProlog examples (make examples), and to run the regression tests (make test). The usual rules for installation and cleaning are also present, as well as a set of rules to rebuild the Makefiles in case of a change to the `configure' files.

When invoking `configure', the user may specify verbose or plain mode. Under plain mode, @PLAIN@ is replaced by @echo, causing make to echo the following text to the terminal, and @VERBOSE@ is replaced by @, preventing make from echoing the following command as it is executed. Under verbose mode, @PLAIN@ is replaced by @#, causing the line to be ignored by make, and @VERBOSE@ is replaced by nothing, so make will echo each command as it is executed.

The first section of the Makefile contains variables substituted by `config.status', mostly specifying pathnames of utility programs. The QUIETER_MAKE variable deserves special attention. If we are using GNU make, we include in the variable a flag to prevent the printing of messages such as "Entering directory xyz". If we are not using GNU make, however, then this variable is merely the pathname of the make executable.

The default rule builds all of the executables as well as the documentation, and also creates a shell script called `build/mkdist'. This script is used on-site to build releases of the system, and is not covered in this manual.

The Makefile then sets several make variables which are used later.

The next section contains rules to rebuild the `build/*.build' files, which is effective only under GNU make. Under other versions of make, the user must invoke `config.status' to rebuild these files.

The next section contains top-level build rules. For the four executables, these are invocations of the linker. For the documentation, make recurses into the `documentation/' directory.

The next section implements the rules to build individual files. For the `.c.cd' and `.h.hd' rules, see Dependencies. The other rules are standard rules to build `.o' files from `.c' files, and to compile files with flex and yacc.

The next rule substitutes the current version of the system (stored in `build/version') into the source file `front/version.c', from which it can be printed in response to the -version command-line option.

The next section contains rules to clean and "really clean" the source directories.

The next section contains rules to install and uninstall the system. These rules install into subdirectories of the directory specified to `configure' with the --prefix= option. The four executables are placed in the `bin/' subdirectory, the documentation is placed in the `info/' subdirectory, and the system's examples are placed in the `examples/' subdirectory. Rules for uninstallation are also provided.

The next section contains rules to rebuild the various files used in the configuration process when they are changed.

Finally, we include the file `build/dependencies', which is described in Dependencies.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

1.3 Dependencies

The Teyjus system implements source-file dependencies through a custom-built system based on `sed' and `awk'.

Dependencies are generated on a file-by-file basis, and stored in files with the suffix `.cd' for `.c' files and `.hd' for `.h' files. These dependency files are incorporated into the top-level Makefile via include. Specifically, the Makefile includes a file called `build/dependencies', which is built by `config.status' and includes each of the dependency files in the system. Each of these files contains a make rule specifying dependencies for its namesake. Thus, if `foo.c' contained

 
#include "foo.h"

foo()
{
}

and `foo.h' contained

 
#include "bar.h"

foo();

then `foo.cd' would contain

 
foo.o: foo.cd foo.hd

and `foo.hd' would contain

 
foo.h: bar.hd

The Makefile provides suffix rules to make `.hd' files from `.h' files, `.cd' files from `.c' files, and `.o' files from `.c' files. The final dependency tree is then:

 
foo.o --> foo.cd --> foo.c
            \--> foo.hd --> foo.h
                  \--> bar.hd --> bar.h ... etc.

where --> denotes "depends on".

GNU make is capable of bringing files which are referenced by a include directive up to date before reading the file. However, as implemented, the system does not depend on this feature. When the system is first built, `configure' examines make, and if GNU make is not found, it generates all `.cd' and `.hd' files immediately. If GNU make is found, however, it allows make to generate the files. Once the files exist, they will be recompiled as sources are changed because they are a part of the `.o' dependency tree as described above.

The `.hd' and `.cd' files are built by an Awk script called `build/mkdep.awk', which is itself built by an Awk script called `build/mkmkdep.awk'. The latter script builds in knowledge of the directory structure of the system, allowing `build/mkdep.awk' to translate references such as #include "paths.h" to qualified pathnames (`utils/paths.h').

The `build/' directory contains files (`*.build.in') which list the object files used in a particular build. These files should remain organized as they are -- there is no provision for comments or blank lines. These files are converted automatically by the Makefile (or `configure' if GNU make is not present) into a format suitable for make.

When you add a source file to the system, add a reference to its object (`.o') file in the relevant `build/*.build.in' files. Then invoke `config.status' in the main directory, to rebuild the proper files, before invoking make.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

1.4 Allocating the Heap

The heap used for the Teyjus abstract machine has some complex requirements, and its efficient implementation is crucial to the speed of the simulator. To this end, we use many system-dependent features when they are available, but provide alternate implementations for less capable systems.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

1.4.1 The Data Areas

The Teyjus abstract machine, based on the Warren Abstract Machine used in Prolog, makes use of six regions of memory. One of these regions is the code area, which contains the abstract machine bytecode implementing the predicates available for execution. The remaining five areas are used to store data relevant to the current simulator context. From highest to lowest in memory, these regions are:

PDL

The push-down lists are used in interpreted Simpl. One of the two lists grows from the top of this memory region, and the other grows from the bottom.

SL Stack

The structures list stack is used in higher-order unification. It grows from the bottom of its memory region toward the top top in a stacklike fashion.

Trail

The trail provides storage for information to allow the reversal of destructive changes made during machine execution. This is information is used on backtracking to restore the state of the machine at a particular choice point. The trail grows from the bottom of its memory toward the top top in a stacklike fashion.

Stack

The global stack stores branch points and import points. The stack grows from the bottom of its memory toward the top top in a stacklike fashion.

Heap

The heap is a space that is managed in much the same way as in the WAM: the portion of it that is used grows as procedures are invoked and space is reclaimed only upon failure of the procedure. The heap grows from the bottom of its memory toward the top in a stacklike fashion.

These five areas are stored, back-to-back, in one contiguous chunk and will be referred to as the simulator heap.

Do not confuse the heap, which is a single region, with the simulator heap, which contains the above five regions. In this document we shall use the term heap to refer to both the simulator heap and the heap region; the meaning should be clear from context.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

1.4.2 Heap Issues

The following issues must be addressed in an implementation:

  1. Regions must be contained in a contiguous 256 Mb block.

    Data types and properties are distinguished within the heap and trail by the use of six tag bits in each 32-bit integer. Pointers must thus be stored in the remaining 26 bits. Since addresses are multiples of four, the two least-significant bits are available for tagging. In the Teyjus implementation of the simulator, the most-significant four bits are also used for tagging. For efficiency in untagging addresses, we must assume that all addresses in the simulator heap have identical most-significant four bits. Thus all five regions of the heap must be contained in a contiguous block of at most 256 Mb (the 28th power of 2).

  2. Regions are immovable after creation.

    The simulator makes use of address comparisons in its execution, and thus depends on the particular ordering of regions given above. Also, the data stored in these regions is highly interrelated by pointers, so runtime movement of the regions is impractical.

  3. The use of malloc() is not an effective solution.

    To maximize the memory available to lambdaProlog programs, then, the simulator must distribute the heap regions in some economical fashion over the 256 Mb. However, this requires the allocation of 256 Mb at simulator startup, and the block allocated must contain addresses with identical most-significant four bits as described above. Clearly, malloc() is not the most appropriate tool for such allocation.

    However, if malloc() is used to allocate a much smaller block, it is likely that that block will contain addresses with identical most-significant four bits, and that those four bits will be identical between invocations of the simulator executable. However, with malloc, neither of these conditions can be guaranteed.

  4. The simulator must detect region overflow.

    As a robust system, Teyjus must detect and recover from situations where the space in a heap region is exhausted. Since the regions grow as stacks, it is a simple matter to check for overflow at each increment. However, such checks exact a considerable performance penalty, and should be avoided if possible.

  5. Block allocation from the simulator heap must be supported.

    In addition to the stacklike growth of the simulator heap's regions, the loading of bytecode modules requires the allocation of fixed-size blocks of memory within the simulator heap to contain static module data such as strings and type skeletons.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

1.4.3 Virtual Memory

This section assumes some familiarity with modern paging virtual memory systems. For more detail on the implementation, read your system's documentation on the syscall mmap().

The first three of the above issues can be resolved by using 256 Mb of memory, without requiring the allocation of such a quantity at startup. Essentially, we need to allocate address space without storage space. That is, we would like to reserve a large range of addresses for the heap. However, we may not use all of the addresses, so we do not require storage (either in RAM or on backing store) for the addressed data. Further, we would like to automate the allocation of storage for these pages to avoid costly runtime boundary checks.

The syscall mmap() does all of this and more. Our technique is to map a 256 Mb block of addresses to `/dev/zero', using the MAP_PRIVATE and MAP_NORESERVE flags. The MAP_PRIVATE flag instructs the memory system to create a private copy of `/dev/zero' only on a write to a page. The MAP_NORESERVE causes the system to delay the reservation of each frame in the system's swap space until this write is made.

Issue (4) can also be resolved with mmap(). Instead of checking for overflow at each increment, we place a guard pages at maximum extent of each region. A guard page is a page with no access permissions, such that an accessing the page in any way will result in a SIGSEGV or SIGBUS signal being sent to the process. For the heap, stack, and trail, this page is placed at the upper end of the region. For the PDL and the SL stack, the guard page is placed in the center of the region.

The simulator must then catch SIGSEGV and SIGBUS, and handle them appropriately.

Issue (5) complicates matters slightly. These blocks of strings and type skeletons must be allocated somewhere in the simulator heap, but should not interfere with the stacklike allocation of space in any region. We choose to allocate these blocks at the upper ends of the trail, stack, and heap. In doing so, we must move the guard pages down below the blocks to prevent overflow from the trail, stack, and heap into the blocks.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

1.4.4 Where To Put The Heap

In a 32-bit address space, there are only 16 blocks which can contain the 256 Mb heap block. It is likely that some of these blocks are partially occupied by program text, data, stack, or libraries, and are thus unusable. However, it is very likely that at least one block will be available at all times. Thus, we need only select this page.

For efficiency reasons, this selection needs to take place at configuration time. See Configure. Several C precompiler macros are generated, and these are used in several simulator source files to efficiently tag and untag pointers at runtime. See Tag Values.

This configuration time selection is made by two programs, `build/mtloop' and `build/memtest'. The configure script compiles both of these programs, then runs `build/mtloop'. `build/mtloop' is simply a program to call `build/memtest', which attempts to place a block of a given size at a given location. By default, `build/mtloop' will try to place the 256Mb heap block with mmap() by calling `build/memtest'. Smaller heap sizes or the use of malloc may be requested.

`build/mtloop' will first attempt to place the desired block at a system chosen location. Should this be unavailable, `build/mtloop' will try all possible locations to allocate a block of this size. If there is no possible location for a block of this size, `build/mtloop' will try again with a block of half the original size. This will continue until a location is found in which to place the block, or the block size becomes less than 16Mb. Should it become absolutely necessary, `build/mtloop' will attempt to use malloc().


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

1.4.5 Tag Values

Of the several data types used in the simulator heap, the "reference" tag is distinguished. References are destructively evaluated. That is, if A is a reference to B and B is a reference to C, then when A is evaluated it becomes a reference to C. References are common in the heap, and must be evaluated quickly. We therefore use the most-significant four bits for the simulator heap as the tag bits for references. Thus, we may treat a reference data element as a bare pointer, requiring no pointer arithmetic before evaluation.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

1.4.6 The Heap Implementation

As described above, much of the information about the simulator heap is determined at build time by the configuration system. This information is incorporated into the header file config.h. Given this information, the tagging system is realized in the Data Formats module (see Data Formats). The heap is allocated and managed by the Heap module (see Heap).


[ << ] [ >> ]           [Top] [Contents] [Index] [ ? ]

This document was generated by Gopalan Nadathur on October, 20 2007 using texi2html 1.76.