1455 lines
		
	
	
		
			64 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			1455 lines
		
	
	
		
			64 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
# JOE: Hacker's Guide
 | 
						|
 | 
						|
[TOC]
 | 
						|
 | 
						|
## Character classes
 | 
						|
 | 
						|
The data structures for representing character classes such as \[a-z0-9\] or
 | 
						|
\[\\w\\d\] are in [cclass.c](http://sourceforge.net/p/joe-editor/mercurial/ci/default/tree/joe/cclass.c) and [cclass.h](http://sourceforge.net/p/joe-editor/mercurial/ci/default/tree/joe/cclass.h).  These are used by the syntax
 | 
						|
highligher, the regular expression matcher, the keyboard mapping table and
 | 
						|
for implementing JOE's ctype routines (such as joe_iswupper).
 | 
						|
 | 
						|
Structures are provided for sets (is a character in a particular set?), mappings
 | 
						|
to integers (which integer is associated with this character?) and mappings to
 | 
						|
pointers (which pointer is associated with this character?).
 | 
						|
 | 
						|
Two different sets of algorithms are provided: one set based on intervals
 | 
						|
and another based on radix trees.
 | 
						|
 | 
						|
### Interval arrays
 | 
						|
 | 
						|
An interval is a simple range of character codes in the following structure:
 | 
						|
 | 
						|
	struct interval {
 | 
						|
		int first, last;
 | 
						|
	};
 | 
						|
 | 
						|
If A-Z is be represented, then first is 65 and last is 90.  If just the
 | 
						|
number zero is to be represented, then first and last are both 48.
 | 
						|
 | 
						|
An array of these represents a set.  If the array is sorted, you can use a
 | 
						|
binary search to determine if a character belongs to a set.  Two functions
 | 
						|
supporting this are provided:
 | 
						|
 | 
						|
	/* Sort an struct interval array */
 | 
						|
	void interval_sort(struct interval *array, ptrdiff_t size);
 | 
						|
 | 
						|
	/* Test if character ch is in a sorted interval array using binary search.
 | 
						|
	   Returns -1 if not found, or index to matching struct interval */
 | 
						|
	ptrdiff_t interval_test(struct interval *array, ptrdiff_t size, int ch);
 | 
						|
 | 
						|
These functions are no longer used in the editor since they have been
 | 
						|
superseded by radix trees.  However, facts about Unicode use arrays of
 | 
						|
intervals, see [unicat.c](http://sourceforge.net/p/joe-editor/mercurial/ci/default/tree/joe/unicat.c).  Unicat.c is generated by [util/uniproc.c](http://sourceforge.net/p/joe-editor/mercurial/ci/default/tree/joe/util/uniproc.c) from the
 | 
						|
raw data provided on the [unicode.org](ftp://ftp.unicode.org/Public/UNIDATA/) website.  Arrays of intervals provide
 | 
						|
the source data for the radix trees.  The struct Cclass includes an
 | 
						|
interval array as the source data before it is optimized into a struct Rset.
 | 
						|
 | 
						|
### Interval lists
 | 
						|
 | 
						|
The next data structure in cclass.c is a linked list of intervals:
 | 
						|
 | 
						|
	struct interval_list {
 | 
						|
	        struct interval_list *next; /* Next item in list */
 | 
						|
	        struct interval interval; /* Range of characters */
 | 
						|
	        void *map;
 | 
						|
	};
 | 
						|
 | 
						|
This provides a way to create a mapping between characters and pointers
 | 
						|
(void *map).  For example, a key sequence mapping table (a KMAP) can be
 | 
						|
implemented this way.  Map could point to either a MACRO bound to a key, or
 | 
						|
to a KMAP for a sub-map (for example a top-level map would have ^K bound to
 | 
						|
a submap which has H bound to the help command).
 | 
						|
 | 
						|
The following functions are provided to handle interval lists:
 | 
						|
 | 
						|
	/* Construct a struct interval_list */
 | 
						|
	struct interval_list *mkinterval(
 | 
						|
		struct interval_list *next,
 | 
						|
		int first, int last,
 | 
						|
		void *map);
 | 
						|
 | 
						|
	/* Free an interval list (frees all items in the list) */
 | 
						|
	void rminterval(struct interval_list *item);
 | 
						|
 | 
						|
	/* Add a single interval to an interval list */
 | 
						|
	struct interval_list *interval_add(
 | 
						|
		struct interval_list *interval_list,
 | 
						|
		int first, int last,
 | 
						|
		void *map);
 | 
						|
 | 
						|
	/* Add an array of intervals (a character class) to an interval list */
 | 
						|
	struct interval_list *interval_set(
 | 
						|
		struct interval_list *list,
 | 
						|
		struct interval *array, int size,
 | 
						|
		void *map);
 | 
						|
 | 
						|
	/* Look up single character in an interval list, return what it's mapped to */
 | 
						|
	void *interval_lookup(
 | 
						|
		struct interval_list *list,
 | 
						|
		void *dflt,
 | 
						|
		int ch);
 | 
						|
 | 
						|
	/* Print interval list to the log buffer for debugging */
 | 
						|
	void interval_show(struct interval_list *list);
 | 
						|
 | 
						|
Interval_add is a complex funtion which adds a mapping to an interval list. 
 | 
						|
You pass it the address of the first item in the list (or NULL) and it
 | 
						|
returns an edited list.  Interval_add will override any mappings which
 | 
						|
already exist in the given range.  The resulting list is always sorted and
 | 
						|
the items in it are non-overlapping.  It might add, delete or modify items
 | 
						|
to the list.
 | 
						|
 | 
						|
Interval_set is just like interval_add, but takes an array of intervals (as
 | 
						|
previously discussed) instead of a single interval.  For example, to create
 | 
						|
a mapping from all Unicode numbers to a particular MACRO, you could pass the
 | 
						|
Nd_table from unicat.c to it.
 | 
						|
 | 
						|
Interval_lookup returns the pointer mapped to the character, or if none is
 | 
						|
mapped, it returns dflt.
 | 
						|
 | 
						|
### Radix trees
 | 
						|
 | 
						|
These sets of radix tree structures and associated methods are provided:
 | 
						|
 | 
						|
* Rset: for sets
 | 
						|
* Rtree: for maps from characters to pointers
 | 
						|
* Rmap: for maps from characters to integers
 | 
						|
 | 
						|
The best way to understand a radix tree is to study one of the lookup
 | 
						|
functions (slightly simplified):
 | 
						|
 | 
						|
	void *rtree_lookup_unopt(struct Rtree *r, int ch)
 | 
						|
	{
 | 
						|
	        int a, b, c, d;
 | 
						|
	        int idx;
 | 
						|
 | 
						|
	        a = (ch >> TOPSHIFT);
 | 
						|
	        b = (SECONDMASK & (ch >> SECONDSHIFT));
 | 
						|
	        c = (THIRDMASK & (ch >> THIRDSHIFT));
 | 
						|
	        d = (LEAFMASK & (ch >> LEAFSHIFT));
 | 
						|
 | 
						|
	        idx = r->top.entry[a];
 | 
						|
	        if (idx != -1) {
 | 
						|
	                idx = r->second.table.b[idx].entry[b];
 | 
						|
	                if (idx != -1) {
 | 
						|
	                        idx = r->third.table.c[idx].entry[c];
 | 
						|
	                        if (idx != -1)
 | 
						|
	                                return r->leaf.table.d[idx].entry[d];
 | 
						|
	                }
 | 
						|
	        }
 | 
						|
 | 
						|
	        return NULL;
 | 
						|
	}
 | 
						|
 | 
						|
The idea is to break the 21 bit character code into four parts: a (7 bits),
 | 
						|
b (5 bits), c (5 bits) and d (4 bits) and use these to traverse a four-level
 | 
						|
tree.  The top-level table is embedded in the struct Rtree itself.  Tables
 | 
						|
from the other levels (the two mid levels and the leaf level) are allocated
 | 
						|
out of struct Levels.  The top and mid-level tables are arrays of 16-bit
 | 
						|
shorts.
 | 
						|
 | 
						|
The number of levels and the size of each level was chosen as a compromise
 | 
						|
between speed and size.  We certainly want radix tree lookups to be faster
 | 
						|
than interval array binary searches (otherwise why bother?), so we can't
 | 
						|
have too many levels.  On the other hand, radix trees should not use too
 | 
						|
much memory.  So for example a single level of 2^21 entries is out.
 | 
						|
 | 
						|
#### De-duplication
 | 
						|
 | 
						|
The leaf tables often contain the same data, so we de-duplicate them such
 | 
						|
that many of the second to last level table entries index to the same leaf
 | 
						|
table.  This is such an effective optimization that the resulting radix tree
 | 
						|
ends up being around the same size as an equivalent interval array (an
 | 
						|
interval array with an array of the same number of entries containing
 | 
						|
associated data (ints or pointers)).
 | 
						|
 | 
						|
The second to last level also has duplicate tables, but not many so it's not
 | 
						|
de-duplicated.
 | 
						|
 | 
						|
De-deplication happens progressively during the build, and as a second pass
 | 
						|
optimization step.  Why do it progressively if there is a whole second pass? 
 | 
						|
There is an important reason for this: shorts are used for the second to
 | 
						|
last level table indexes.  If you don't deduplicate progressively, you can
 | 
						|
have more than 64K leaf tables before optimization and overflow the short. 
 | 
						|
There is a second reason: lower memory usage tends to equate to faster
 | 
						|
performance.  For this reason it's preferable to keep the memory usage low
 | 
						|
during the build.
 | 
						|
 | 
						|
The progressive deduplication algorithm is simple: if the most recently
 | 
						|
completed leaf table matches the previously completed leaf table, we use the
 | 
						|
previous one in its place.
 | 
						|
 | 
						|
There is one big complication: there is no strong concept of a completed
 | 
						|
leaf table.  A leaf table is assumed to be complete when the last entry has
 | 
						|
been written to.  But there is nothing preventing previous data from being
 | 
						|
overwritten, or for tables to be made in reverse order (this can happen in a
 | 
						|
regex character class like [zyxw...]).  It means that a de-duplicated table
 | 
						|
might have to re-duplicated.  There is a reference count on each leaf table
 | 
						|
which is incremented every time that table is used during de-duplication.
 | 
						|
If the reference count is not equal to one during a table write, it is first
 | 
						|
copied to a new table (it's a copy on write scheme).
 | 
						|
 | 
						|
The second-pass deduplication works like this: a hash table of all leaf
 | 
						|
tables is built.  If a duplicate table is detected during the hash table
 | 
						|
build, it is deduplicated (not added to hash table).
 | 
						|
 | 
						|
#### Anglocentrism
 | 
						|
 | 
						|
There is one further optimization: we provide faster access to the first 512
 | 
						|
character codes (so that ASCII is fast).  To do this, the first of the
 | 
						|
second to last level tables is embedded in the struct Rtree.  If the code is
 | 
						|
below 512, then the upper 5 bits index this table to find the leaf table,
 | 
						|
and the lower 4 bits index to leaf table.  Thus, there is one local access
 | 
						|
and one indirect access for codes below 512.
 | 
						|
 | 
						|
### Rset
 | 
						|
 | 
						|
struct Rset radix trees are used for simple sets.  In this case there are no
 | 
						|
leaf tables.  Instead, the second to last level table contains bit maps.  If
 | 
						|
a bit is a one, then the corresponding character is in the set.  Each bit
 | 
						|
map is 16 bits: the shorts that make up the second to last level table are
 | 
						|
used for these bitmaps instead of as leaf-table indexes.
 | 
						|
 | 
						|
Here are the provided methods:
 | 
						|
 | 
						|
	/* Initialize a struct Rset */
 | 
						|
	void rset_init(struct Rset *r);
 | 
						|
 | 
						|
	/* Clear a struct Rset: anything malloced for it is freed, but the
 | 
						|
	 * struct Rset itself is not freed (it could be embedded in another
 | 
						|
	 * structure). */
 | 
						|
	void rset_clr(struct Rset *r);
 | 
						|
 | 
						|
	/* Test if a character is in a set: this one can be called before
 | 
						|
	 * before rset_opt is called. */
 | 
						|
	int rset_lookup_unopt(struct Rset *r, int ch);
 | 
						|
 | 
						|
	/* Add a range of characters to a set: ch to che */
 | 
						|
	void rset_add(struct Rset *r, int ch, int che);
 | 
						|
 | 
						|
	/* Add an array of ranges to a set */
 | 
						|
	void rset_set(struct Rset *r, struct interval *array, ptrdiff_t size);
 | 
						|
 | 
						|
	/* De-duplicate leaf tables and set up embedded table for quicker
 | 
						|
	 * access for characters below 512 */
 | 
						|
	void rset_opt(struct Rset *r);
 | 
						|
 | 
						|
	/* Test if a character is in a set: only use this after rset_opt has
 | 
						|
	 * been called */
 | 
						|
	int rset_lookup(struct Rset *r, int ch);
 | 
						|
 | 
						|
	/* Print Rset in human readable form to startup log for debugging */
 | 
						|
	void rset_show(struct Rset *r);
 | 
						|
 | 
						|
Rset_add originally accepted just a single character to add, not a range. 
 | 
						|
This turned out to be too inefficient: the editor was using a lot of CPU
 | 
						|
time during start-up just to build the standard character class tables. 
 | 
						|
Therefore, it has been modified to loop over the range for a fast inner loop
 | 
						|
and to lift function call overhead out of the original loop in rset_set.
 | 
						|
 | 
						|
### Rtree
 | 
						|
 | 
						|
struct Rtree provides a mapping from characters to pointers.  For example,
 | 
						|
this is used in the regular expression code to traverse states.
 | 
						|
 | 
						|
Here are the provided methods:
 | 
						|
 | 
						|
	/* Initialize a struct Rset */
 | 
						|
	void rtree_init(struct Rtree *r);
 | 
						|
 | 
						|
	/* Clear a struct Rtree: anything malloced for it is freed, but the
 | 
						|
	 * struct Rtree itself is not freed (it could be embedded in another
 | 
						|
	 * structure). */
 | 
						|
	void rtree_clr(struct Rtree *r);
 | 
						|
 | 
						|
	/* Test if a character is in a tree: return pointer bound to it
 | 
						|
	 * or NULL.  This can be called before rtree_opt. */
 | 
						|
	void *rtree_lookup_unopt(struct Rtree *r, int ch);
 | 
						|
 | 
						|
	/* Add a binding to from a range of characters to a pointer */
 | 
						|
	void rtree_add(struct Rtree *r, int ch, int che, void *map);
 | 
						|
 | 
						|
	/* Add a binding to from each range of characters in the given array to a pointer */
 | 
						|
	void rtree_set(struct Rtree *r, struct interval *array, ptrdiff_t len, void *map);
 | 
						|
 | 
						|
	/* Add a binding to from each range of characters in the given
 | 
						|
	 * interval_list list to its pointer */
 | 
						|
	void rtree_build(struct Rtree *r, struct interval_list *l);
 | 
						|
 | 
						|
	/* De-duplicate leaf tables and set up embedded table for quicker
 | 
						|
	 * access for characters below 512 */
 | 
						|
	void rtree_opt(struct Rtree *r);
 | 
						|
 | 
						|
	/* Test if a character is in a tree: return pointer bound to it
 | 
						|
	 * or NULL.  Only call this after rtree_opt has been called. */
 | 
						|
	void *rtree_lookup(struct Rtree *r, int ch);
 | 
						|
 | 
						|
	/* Print Rtree in human readable form to startup log for debugging */
 | 
						|
	void rtree_show(struct Rtree *r);
 | 
						|
 | 
						|
They are the same as the rset methods, except that a pointer is provided. 
 | 
						|
Also, a method is provided to convert an interval_list into an Rtree.
 | 
						|
 | 
						|
### Rmap
 | 
						|
 | 
						|
struct Rmap provides a mapping from characters to integers.  For example,
 | 
						|
this is used for Unicode simple case conversion.  The integer provides an
 | 
						|
offset to be added to the character to convert it from one case to another. 
 | 
						|
It's important to use an offset, not a direct replacement character for this
 | 
						|
so that leaf nodes have the same values and can be de-duplicated.
 | 
						|
 | 
						|
Here are the provided methods:
 | 
						|
 | 
						|
	void rmap_init(struct Rtree *r);
 | 
						|
 | 
						|
	void rmap_clr(struct Rtree *r);
 | 
						|
 | 
						|
	int rmap_lookup_unopt(
 | 
						|
		struct Rtree *r,
 | 
						|
		int ch,
 | 
						|
		int dflt
 | 
						|
	);
 | 
						|
 | 
						|
	void rmap_add(
 | 
						|
		struct Rtree *r,
 | 
						|
		int ch, int che,
 | 
						|
		int map,
 | 
						|
		int dflt
 | 
						|
	);
 | 
						|
 | 
						|
	void rmap_set(
 | 
						|
		struct Rtree *r,
 | 
						|
		struct interval *array,
 | 
						|
		ptrdiff_t len,
 | 
						|
		int map,
 | 
						|
		int dflt
 | 
						|
	);
 | 
						|
 | 
						|
	void rmap_opt(struct Rtree *r);
 | 
						|
 | 
						|
	int rmap_lookup(struct Rtree *r, int ch, int dflt);
 | 
						|
 | 
						|
	void rmap_show(struct Rtree *r);
 | 
						|
 | 
						|
They are the same methods as for struct Rtree, except that you provide
 | 
						|
'dflt', the value to return if there is no binding for the character.
 | 
						|
 | 
						|
### struct Cclass
 | 
						|
 | 
						|
Finally we have struct Cclass, which is made up of an interval array and a
 | 
						|
struct Rset and which provides set operations for character classes.  The
 | 
						|
interval array is used for the set operations, then the array is optimized
 | 
						|
into a struct Rset for quick character lookups.
 | 
						|
 | 
						|
The Unicode database (made from the data in unicat.c and methods in
 | 
						|
unicode.c) uses this data structure.  For example, you can get a character
 | 
						|
class containing all letters with:
 | 
						|
 | 
						|
	struct Cclass *letters = unicode("L");
 | 
						|
 | 
						|
The set operations allow you to construct complex classes, for example the
 | 
						|
character class for \\i, the initial character of identifiers is contructed
 | 
						|
like this:
 | 
						|
 | 
						|
        cclass_init(cclass_alnum_);
 | 
						|
        cclass_union(cclass_alnum_, unicode("L"));
 | 
						|
        cclass_union(cclass_alnum_, unicode("Pc"));
 | 
						|
	/* cclass_union(cclass_alpha_, unicode("Sc")); */
 | 
						|
        cclass_union(cclass_alpha_, unicode("Nl"));
 | 
						|
        cclass_union(cclass_alnum_, unicode("Mn"));
 | 
						|
        cclass_union(cclass_alnum_, unicode("Mc"));
 | 
						|
        cclass_union(cclass_alnum_, unicode("Nd"));
 | 
						|
        cclass_add(cclass_alnum_, 0x200c, 0x200d);
 | 
						|
        cclass_opt(cclass_alnum_);
 | 
						|
 | 
						|
The following methods are provided:
 | 
						|
 | 
						|
	/* Initialize a character class */
 | 
						|
	void cclass_init(struct Cclass *cclass);
 | 
						|
 | 
						|
	/* Free memory used by a Cclass (does not free cclass itself) */
 | 
						|
	void cclass_clr(struct Cclass *cclass);
 | 
						|
 | 
						|
	/* Add a range to a character class */
 | 
						|
	void cclass_add(struct Cclass *cclass, int first, int last);
 | 
						|
 | 
						|
	/* Remove a range from a character class */
 | 
						|
	void cclass_sub(struct Cclass *cclass, int first, int last);
 | 
						|
 | 
						|
	/* Merge one character class into another */
 | 
						|
	void cclass_union(struct Cclass *cclass, struct Cclass *n);
 | 
						|
 | 
						|
	/* Merge an interval array into a character class */
 | 
						|
	void cclass_merge(struct Cclass *cclass, struct interval *intervals, int len);
 | 
						|
 | 
						|
	/* Subtract one character class from another */
 | 
						|
	void cclass_diff(struct Cclass *cclass, struct Cclass *n);
 | 
						|
 | 
						|
	/* Compute unicode inverse of a character class, for [^a-z] */
 | 
						|
	void cclass_inv(struct Cclass *cclass);
 | 
						|
 | 
						|
	/* Lookup a character in a character class using binary search.
 | 
						|
	   Return true if character is in the class. */
 | 
						|
	int cclass_lookup_unopt(struct Cclass *m, int ch);
 | 
						|
 | 
						|
	/* Optimize a character class for fast lookup with cclass_lookup.
 | 
						|
	 * Generates a radix tree version of the character class. */
 | 
						|
	void cclass_opt(struct Cclass *cclass);
 | 
						|
 | 
						|
	/* Return true if character is in the class */
 | 
						|
	int cclass_lookup(struct Cclass *cclass, int ch);
 | 
						|
 | 
						|
	/* Print character class to startup log */
 | 
						|
	void cclass_show(struct Cclass *cclass);
 | 
						|
 | 
						|
	/* Remap a character class from Unicode to a specific locale */
 | 
						|
	struct Cclass *cclass_remap(struct Cclass *m, struct charmap *map);
 | 
						|
 | 
						|
Cclass_remap is used to convert a character class from Unicode to some other
 | 
						|
character set (represented by struct charmap *).  For example, to get
 | 
						|
letters in the KOI8-R 8-bit character set, use:
 | 
						|
 | 
						|
	struct Cclass *r = cclass_remap(unicode("L"), find_charmap("KOI8-R"));
 | 
						|
 | 
						|
Cclass_remap remembers the conversion to save work, so the input struct
 | 
						|
Cclass is expected to remain around forever.  Basically this is to be used
 | 
						|
only for built-in character classes.
 | 
						|
 | 
						|
## Regular expressions
 | 
						|
 | 
						|
As of verison 4.1, JOE uses an enhanced version of [Thompson's NFA matching algorithm](https://swtch.com/~rsc/regexp/regexp1.html).
 | 
						|
 | 
						|
The code is in [regex.c](http://sourceforge.net/p/joe-editor/mercurial/ci/default/tree/joe/regex.c) and [regex.h](http://sourceforge.net/p/joe-editor/mercurial/ci/default/tree/joe/regex.h).
 | 
						|
 | 
						|
The regular expression matcher is a subroutine of the the larger text search
 | 
						|
algorithm in JOE.  For example, text search will use [Boyer-Moore](https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm)
 | 
						|
to find a leading prefix of the regular expression before running the
 | 
						|
regular expression matcher.  A leading prefix is leading text present in
 | 
						|
all of the different possible text which can match the regular expression.
 | 
						|
For example, in 'hello(foo|bar)', the leading text is 'hello', but it's empty
 | 
						|
in 'hello|there'.  One of the jobs of the regular expression code is to
 | 
						|
identify this leading text.
 | 
						|
 | 
						|
The API for the regular expression library is simple:
 | 
						|
 | 
						|
	/* Compile an expression s/len into a struct regcomp
 | 
						|
	 * Check regcomp->err for compile errors.
 | 
						|
	 */
 | 
						|
	struct regcomp *joe_regcomp(
 | 
						|
		struct charmap *charmap, /* The character set of the expression */
 | 
						|
		const char *s, ptrdiff_t len, /* The regular expression */
 | 
						|
		int icase, /* Set to ignore case */
 | 
						|
		int stdfmt, /* Set for standard syntax expressions, vs. JOE syntax */
 | 
						|
		int debug /* Set to have debug information printed to startup log */
 | 
						|
	);
 | 
						|
 | 
						|
	/* Free a compiled regular expression */
 | 
						|
	void joe_regfree(struct regcomp *r);
 | 
						|
 | 
						|
	struct regmatch {
 | 
						|
	        off_t rm_so, rm_eo; /* Starting and ending byte offset of a match, or -1 if no match */
 | 
						|
	}
 | 
						|
 | 
						|
	/* Returns 0 if buffer at p matches compiled expression r
 | 
						|
	 * 'matches' is filled in with submatch addresses
 | 
						|
	 * 'nmatch' has size of matches array
 | 
						|
	*/
 | 
						|
	int joe_regexec(
 | 
						|
		struct regcomp *r, /* Compiled regular expression */
 | 
						|
		P *p, /* Pointer to edit buffer */
 | 
						|
		int nmatch, /* Size of matches array */
 | 
						|
		struct regmatch *matches, /* Locations of sub-matches */
 | 
						|
		int fold /* Set to ignore case */
 | 
						|
	);
 | 
						|
 | 
						|
 | 
						|
Joe_regcomp converts the regular expression into an [AST](https://en.wikipedia.org/wiki/Abstract_syntax_tree)
 | 
						|
with an operator precedence parser and then generates a byte-code program
 | 
						|
representation of the regular expression's [NFA](https://en.wikipedia.org/wiki/Nondeterministic_finite_automaton).
 | 
						|
Joe_regexec simulates the byte-code program with an arbitrary number of threads machine. 
 | 
						|
It feeds characters from the buffer into each active thread until one
 | 
						|
of the threads encounters the END instruction, there are no more threads, or
 | 
						|
the end of the buffer is reached.  A side effect is the buffer pointer __P__
 | 
						|
is advanced to the first character not part of the matching text.
 | 
						|
 | 
						|
If the 'v' flag is included in JOE's search and replace options prompt, the
 | 
						|
AST and program are sent to JOE's startup log so that they can be viewed
 | 
						|
(with ESC x showlog).  For example, here is the log when you try to search
 | 
						|
for 'hello\\(foo\\|bar\\)\\*there':
 | 
						|
 | 
						|
~~~~~
 | 
						|
Parse tree:
 | 
						|
,
 | 
						|
  ,
 | 
						|
    ,
 | 
						|
      ,
 | 
						|
        ,
 | 
						|
          ,
 | 
						|
            ,
 | 
						|
              ,
 | 
						|
                ,
 | 
						|
                  ,
 | 
						|
                    'h'
 | 
						|
                    'e'
 | 
						|
                  'l'
 | 
						|
                'l'
 | 
						|
              'o'
 | 
						|
            *
 | 
						|
              {
 | 
						|
                |
 | 
						|
                  ,
 | 
						|
                    ,
 | 
						|
                      'f'
 | 
						|
                      'o'
 | 
						|
                    'o'
 | 
						|
                  ,
 | 
						|
                    ,
 | 
						|
                      'b'
 | 
						|
                      'a'
 | 
						|
                    'r'
 | 
						|
          't'
 | 
						|
        'h'
 | 
						|
      'e'
 | 
						|
    'r'
 | 
						|
  'e'
 | 
						|
Leading prefix 'hello'
 | 
						|
NFA-program:
 | 
						|
PC      INSN
 | 
						|
--      ----
 | 
						|
0:      'h'
 | 
						|
4:      'e'
 | 
						|
8:      'l'
 | 
						|
12:     'l'
 | 
						|
16:     'o'
 | 
						|
20:     fork 92
 | 
						|
28:     bra 0
 | 
						|
36:     fork 64
 | 
						|
44:     'f'
 | 
						|
48:     'o'
 | 
						|
52:     'o'
 | 
						|
56:     jump 76
 | 
						|
64:     'b'
 | 
						|
68:     'a'
 | 
						|
72:     'r'
 | 
						|
76:     ket 0
 | 
						|
84:     fork 28
 | 
						|
92:     't'
 | 
						|
96:     'h'
 | 
						|
100:    'e'
 | 
						|
104:    'r'
 | 
						|
108:    'e'
 | 
						|
112:    end
 | 
						|
Total size = 116
 | 
						|
~~~~~
 | 
						|
 | 
						|
The "Parse tree" (AST) is printed in an indented format: A line begins with an
 | 
						|
operator, then all succeeding lines with deeper indentation are its operands. 
 | 
						|
The ',' operator means succession: match the first operand, then match the
 | 
						|
second operand.  The '{' operator means record the submatch of its
 | 
						|
argument.  The '*' operator means 0 or more matches of its argument.  The
 | 
						|
'|' operator means match one argument or the other.
 | 
						|
 | 
						|
The "Leading prefix" is the part of the regular expression which can use
 | 
						|
Boyer-Moore search.
 | 
						|
 | 
						|
The "NFA-program" shows the dis-assembled program.  A letter like 'h' is an
 | 
						|
instruction to match the letter itself.  'fork' is an instruction which
 | 
						|
creates a new thread beginning at the specified program counter value. 
 | 
						|
'jump' means the thread should goto the specified PC value.  'bra' and 'ket'
 | 
						|
indicate that the start and ending location of submatch should be recorded.
 | 
						|
 | 
						|
Here is the log from the regular expression '-\\i\\c-' to show you how
 | 
						|
character classes appear (\\i is the initial character of an identifier, and
 | 
						|
\\c is the following character of an identifier):
 | 
						|
 | 
						|
~~~~~
 | 
						|
Parse tree:
 | 
						|
,
 | 
						|
  ,
 | 
						|
    ,
 | 
						|
      '-'
 | 
						|
      [
 | 
						|
        [41..5a] [5f..5f] [61..7a]
 | 
						|
    [
 | 
						|
      [30..39] [41..5a] [5f..5f] [61..7a]
 | 
						|
  '-'
 | 
						|
Leading prefix '-'
 | 
						|
NFA-program:
 | 
						|
PC      INSN
 | 
						|
--      ----
 | 
						|
0:      '-'
 | 
						|
4:      class [41..5a] [5f..5f] [61..7a]
 | 
						|
12:     class [30..39] [41..5a] [5f..5f] [61..7a]
 | 
						|
20:     '-'
 | 
						|
24:     end
 | 
						|
Total size = 28
 | 
						|
~~~~~
 | 
						|
 | 
						|
Note that this is showing the ASCII character encoding so that the character
 | 
						|
classes are not too big.  With the UTF-8 character encoding, the character
 | 
						|
classes will include the full Unicode definitions of \\i and \\c.
 | 
						|
 | 
						|
### Functions
 | 
						|
 | 
						|
	/* Parse a character or escape sequence from ptr/len.
 | 
						|
	 * Returns with character or -256 and cat (if not NULL) filled in with a character class
 | 
						|
	 */
 | 
						|
	int escape(int utf8, const char **ptr, ptrdiff_t *len, struct Cclass **cat);
 | 
						|
 | 
						|
Escape reads one character from a string.  If the character is an escape
 | 
						|
sequence, it's parsed into either a single character or a character class.
 | 
						|
Escape is used in other parts of the editor: for example, it's used to
 | 
						|
parse character lists of syntax highlighting files.
 | 
						|
 | 
						|
	/* Conventional syntax regex */
 | 
						|
	static int do_parse_conventional(struct regcomp *g, int prec, int fold);
 | 
						|
 | 
						|
do_parse_conventional is the standard syntax regular expression parser.  It
 | 
						|
records the AST in regcomp->nodes.
 | 
						|
 | 
						|
	/* JOE syntax regex */
 | 
						|
	static int do_parse(struct regcomp *g, int prec, int fold);
 | 
						|
 | 
						|
do_parse is the JOE syntax regular expression parser.  The JOE syntax is
 | 
						|
just like the regular syntax except that regular expression special
 | 
						|
characters must be escaped.
 | 
						|
 | 
						|
	/* Determine leading prefix of search string */
 | 
						|
	/* We can use boyer-moore on the prefix if:
 | 
						|
		Character set is byte coded.
 | 
						|
		Character set is UTF-8 coded, but no folding is requested.
 | 
						|
		Character set is UTF-8 coded, folding is requested, but character is below 128 */
 | 
						|
 | 
						|
	static int extract(struct regcomp *g, int no, int fold);
 | 
						|
 | 
						|
Extract finds the leading prefix for Boyer-Moore from the AST.
 | 
						|
 | 
						|
	/* Convert parse-tree into code with infinite loops */
 | 
						|
	static void codegen(struct regcomp *g, int no, int *end);
 | 
						|
 | 
						|
Codegen converts the AST into byte-code.  Functions from [frag.c](http://sourceforge.net/p/joe-editor/mercurial/ci/default/tree/joe/frag.c) / [frag.h](http://sourceforge.net/p/joe-editor/mercurial/ci/default/tree/joe/frag.h)
 | 
						|
write instructions to the memory image and handle alignment.  They also provide a
 | 
						|
mechanism for linking the program (to handle forward jumps).  Basically a
 | 
						|
linked-list of all the jump intructions which target the same destination is
 | 
						|
created right in the memory image, using the jump instruction operands as
 | 
						|
the link pointers.  When the jump destination is finally known, the
 | 
						|
link-list is traversed and the link pointers are replaced with the
 | 
						|
destination address.
 | 
						|
 | 
						|
### Matcher
 | 
						|
 | 
						|
joe_regexec has the arbitrary number of threads simulator.  It has two banks
 | 
						|
of threads, A and B, and ping-pongs between them for each input character. 
 | 
						|
It feeds the next input character to all threads in bank A.  It executes as
 | 
						|
many instructions as possible for each thread, and stops when the thread
 | 
						|
finally accepts the character or rejects it.  If the thread accepts the
 | 
						|
character, it will be active for the next character, so it is moved to bank
 | 
						|
B.  Otherwise, the thread will die when we transition to bank B.  If a
 | 
						|
thread in bank A forks, the newly created thread is appended to bank A so
 | 
						|
that the current character is fed to it before switching.  Once all threads
 | 
						|
have been fed the character, we switch to bank B and feed in the next
 | 
						|
character.
 | 
						|
 | 
						|
This scheme is simple and fast, but there are complications.  First has to
 | 
						|
do with sub-match addressing.  Without modification, the matcher will find
 | 
						|
all possible solutions to expressions like 'x(a\*)(a\*)y'.  This means that
 | 
						|
at the end of the match, there will be a thread for each of these
 | 
						|
solutions.  If the input is xaaaay, it will find x()(aaaa)y, x(a)(aaa)y,
 | 
						|
x(aa)(aa)y, x(aaa)(a)y, and x(aaaa)()y.
 | 
						|
 | 
						|
Research topic: due to this property, I think the Thompson NFA matcher can
 | 
						|
be extended to support back-references.
 | 
						|
 | 
						|
But we don't want all solutions, we want just one solution, the one where
 | 
						|
the left-most submatch is longest: x(aaaa)()y.  To select the preferred
 | 
						|
solution, joe_regexec has a helper function:
 | 
						|
 | 
						|
	static int add_thread(
 | 
						|
		struct thread *pool, /* Thread pool */
 | 
						|
		int l, int le, /* Current start and end of pool */
 | 
						|
		unsigned char *pc, /* Current thread's pc */
 | 
						|
		Regmatch_t *pos, /* Submatch addresses */
 | 
						|
		int bra_no, /* Size of submatch addresses */
 | 
						|
		int *stack, int sp /* Current thread's stack */
 | 
						|
	);
 | 
						|
 | 
						|
This adds the thread to the end of the new pool B (which is at
 | 
						|
pool[le]).  Before it adds, it checks if there are other existing threads
 | 
						|
(between l and le) at the same program counter.  If there are, then we pick
 | 
						|
the one with the "best" submatch, the one with the longest to the left.
 | 
						|
 | 
						|
The second complication is JOE's extension \\! (which used to be \\c in
 | 
						|
versions before 4.1).  \\! is like '.': it matches one character.  However,
 | 
						|
if that one character is the opening bracket of an expression, it matches
 | 
						|
the entire expression.  \\! will match all of "(1+(2+3))".  To support this,
 | 
						|
each thread needs a stack to record its progress in parsing the balanced
 | 
						|
expression.
 | 
						|
 | 
						|
## Unicode database
 | 
						|
 | 
						|
[unicode.c](http://sourceforge.net/p/joe-editor/mercurial/ci/default/tree/joe/unicode.c), [unicode.h](http://sourceforge.net/p/joe-editor/mercurial/ci/default/tree/joe/unicode.h) and [unicat.c](http://sourceforge.net/p/joe-editor/mercurial/ci/default/tree/joe/unicat.c) provide 
 | 
						|
JOE's database of Unicode facts.  Unicat.c is generated by [util/uniproc.c](http://sourceforge.net/p/joe-editor/mercurial/ci/default/tree/joe/util/uniproc.c) from the
 | 
						|
raw data provided on the [unicode.org](ftp://ftp.unicode.org/Public/UNIDATA/) website.
 | 
						|
 | 
						|
There is a table in unicat.c which has the name of, pointer to and size of
 | 
						|
each other table.
 | 
						|
 | 
						|
	struct unicat unicat[] = {
 | 
						|
		{ "Co", 3, Co_table, 0 },
 | 
						|
		{ "Cs", 1, Cs_table, 0 },
 | 
						|
		{ "Zp", 1, Zp_table, 0 },
 | 
						|
		...
 | 
						|
		{ "Domino Tiles", 1, uniblocks + 241, 0 },
 | 
						|
		{ "Mahjong Tiles", 1, uniblocks + 240, 0 },
 | 
						|
		{ "Arabic Mathematical Alphabetic Symbols", 1, uniblocks + 239, 0 },
 | 
						|
		...
 | 
						|
 | 
						|
The function unicode looks up one of these tables and returns a struct
 | 
						|
Cclass version of it:
 | 
						|
 | 
						|
	struct Cclass *unicode(const char *name);
 | 
						|
 | 
						|
If a single letter name is passed to the above function, a Cclass containing
 | 
						|
all two-letter named tables with the same matching first letter is returned. 
 | 
						|
So for example, unicode("L") returns a Cclass of all letters, including
 | 
						|
"Lu", "Ll", "Lt", etc.
 | 
						|
 | 
						|
Joe_iswinit initializes Cclasses for JOE's isw ctype functions.  The
 | 
						|
following functions are provided:
 | 
						|
 | 
						|
	int joe_iswupper(struct charmap *,int c);
 | 
						|
	int joe_iswlower(struct charmap *,int c);
 | 
						|
 | 
						|
	int joe_iswalpha(struct charmap *,int c);
 | 
						|
	int joe_iswalpha_(struct charmap *,int c); /* used for \i */
 | 
						|
 | 
						|
	int joe_iswalnum(struct charmap *,int c);
 | 
						|
	int joe_iswalnum_(struct charmap *,int c); /* used for \c */
 | 
						|
 | 
						|
	int joe_iswdigit(struct charmap *,int c);
 | 
						|
	int joe_iswspace(struct charmap *,int c);
 | 
						|
	int joe_iswctrl(struct charmap *,int c);
 | 
						|
	int joe_iswpunct(struct charmap *,int c);
 | 
						|
	int joe_iswgraph(struct charmap *,int c);
 | 
						|
	int joe_iswprint(struct charmap *,int c);
 | 
						|
	int joe_iswxdigit(struct charmap *,int c);
 | 
						|
	int joe_iswblank(struct charmap *,int c);
 | 
						|
 | 
						|
Another table in unicat.c is the width_table.  This indicates which
 | 
						|
characters are double-wide on terminal emulators.  It is used to implement
 | 
						|
the joe_wcwidth function:
 | 
						|
 | 
						|
	int joe_wcwidth(int wide,int c);
 | 
						|
 | 
						|
'wide' should be true if the character set of c is Unicode.  Both the
 | 
						|
terminal and the file must be Unicode (UTF-8 or UTF-16) or double-wide
 | 
						|
characters to be printed.
 | 
						|
 | 
						|
Another set of tables in unicat.c includes fold_table and fold_repl for case
 | 
						|
folding for string comparisons.  Fold_table is an array of ranges.  If a
 | 
						|
character is in one of these ranges, then the corresponding entry in
 | 
						|
fold_repl is a string (of 1 to 3 characters) which should replace it for
 | 
						|
case folding.  The first character of the string is an offset which should
 | 
						|
be added to the original character to get the first character of the
 | 
						|
replacement.
 | 
						|
 | 
						|
rtree_fold is an struct Rtree representation of the fold table.  It returns
 | 
						|
either a single character replacement offset, or an index into fold_repl if the
 | 
						|
character is to be replaced by a string longer than one character.
 | 
						|
 | 
						|
Lowerize can be called to convert a UTF-32 string into its case folded
 | 
						|
version:
 | 
						|
 | 
						|
	int *lowerize(
 | 
						|
		int *d, ptrdiff_t len, /* Destination string and length */
 | 
						|
		const int *s /* Source string */
 | 
						|
	);
 | 
						|
 | 
						|
Unicat.c also provides simple case conversion tables: toupper_table,
 | 
						|
tolower_table and totitle_table along with the corresponding replacement
 | 
						|
tables: toupper_cvt, tolower_cvt and totitle_cvt.
 | 
						|
 | 
						|
These are used to implement joe_towupper and joe_towlower:
 | 
						|
 | 
						|
	int joe_towupper(struct charmap *,int c); /* Convert to uppercase */
 | 
						|
	int joe_towlower(struct charmap *,int c); /* Convert to lowercase */
 | 
						|
 | 
						|
Digval is provided to return the decimal value of any unicode digit:
 | 
						|
 | 
						|
	int digval(int ch)
 | 
						|
 | 
						|
This function uses Nd_table from unicat.c.  Nd_table has the property that
 | 
						|
each range represents a single set of decimal digits.  Normally adjacent
 | 
						|
ranges would be combined together, but in this case they are separated to
 | 
						|
support digval.
 | 
						|
 | 
						|
Finally unicode.c provides the following function:
 | 
						|
 | 
						|
	int unictrl(int c);
 | 
						|
 | 
						|
Which returns true if JOE should treat the character as a control character
 | 
						|
which should not be sent directly to the terminal.  Instead the character
 | 
						|
value is displayed in hexadecimal: <FFFFF\>.  Basically this function
 | 
						|
returns true if the character is not printable- if it's not in cclass_print.
 | 
						|
 | 
						|
## Character sets
 | 
						|
 | 
						|
[charmap.c](http://sourceforge.net/p/joe-editor/mercurial/ci/default/tree/joe/charmap.c) and [charmap.h](http://sourceforge.net/p/joe-editor/mercurial/ci/default/tree/joe/charmap.h) implement JOE's representation of character
 | 
						|
sets.
 | 
						|
 | 
						|
Each character set is represented by a "struct charmap *".  A charmap
 | 
						|
provides information about a specific character set.  It provides these
 | 
						|
methods:
 | 
						|
 | 
						|
	/* True if character is punctuation */
 | 
						|
	int joe_ispunct(struct charmap *map, int ch);
 | 
						|
 | 
						|
	/* True if character is printable */
 | 
						|
	int joe_isprint(struct charmap *map, int ch);
 | 
						|
 | 
						|
	/* True if character is a space */
 | 
						|
	int joe_isspace(struct charmap *map, int ch);
 | 
						|
 | 
						|
	/* True if character is first of an identifier */
 | 
						|
	int joe_isalpha_(struct charmap *map, int ch);
 | 
						|
 | 
						|
	/* True if character is rest of an identifier */
 | 
						|
	int joe_isalnum_(struct charmap *map, int ch);
 | 
						|
 | 
						|
	/* True if character is blank: space or tab */
 | 
						|
	int joe_isblank(struct charmap *map, int ch);
 | 
						|
 | 
						|
	/* True if character is blank tab CR LF FF or NUL */
 | 
						|
	int joe_isspace_eos(struct charmap *map, int ch);
 | 
						|
 | 
						|
	/* Convert to lowercase */
 | 
						|
	int joe_tolower(struct charmap *map, int ch);
 | 
						|
 | 
						|
	/* Convert to uppercase */
 | 
						|
	int joe_tolower(struct charmap *map, int ch);
 | 
						|
 | 
						|
	/* Convert from character set to Unicode */
 | 
						|
	int to_uni(struct charmap *map, int ch);
 | 
						|
 | 
						|
	/* Convert from Unicode to character set */
 | 
						|
	int from_uni(struct charmap *map, int ch);
 | 
						|
 | 
						|
There are two types of character sets: Unicode and 8-bit.  If the character
 | 
						|
set is Unicode, then charmap->type will be true.
 | 
						|
 | 
						|
Joe_locale intializes some global character maps:
 | 
						|
 | 
						|
	extern struct charmap *locale_map;      /* Character map of terminal */
 | 
						|
	extern struct charmap *utf8_map;        /* UTF-8 character map */
 | 
						|
	extern struct charmap *utf16_map;       /* UTF-16 character map */
 | 
						|
	extern struct charmap *utf16r_map;      /* UTF-16 reversed  character map */
 | 
						|
	extern struct charmap *ascii_map;       /* Plain ASCII map */
 | 
						|
 | 
						|
Find_charmap can be used to find a named character set:
 | 
						|
 | 
						|
	/* Find (load if necessary) a character set */
 | 
						|
	struct charmap *find_charmap(const char *name);
 | 
						|
 | 
						|
If the character set is not already built into JOE, find_charmap tries to
 | 
						|
load it from a file.  It looks in $HOME/.joe/charmaps and /usr/share/joe/charmaps.
 | 
						|
 | 
						|
The following character sets are built into JOE:
 | 
						|
 | 
						|
	ASCII, ISO-8859-1, ISO-8859-2, ISO-8859-3, ISO-8859-4,
 | 
						|
	ISO-8859-5, ISO-8859-6, ISO-8859-7, ISO-8859-8,
 | 
						|
	ISO-8859-9, ISO-8859-10, ISO-8859-11, ISO-8859-13,
 | 
						|
	ISO-8859-14, ISO-8859-15, ISO-8859-16, ISO-8859-1,
 | 
						|
	KOI-8, KOI8-R, KOI8-T, KOI8-U, CP1251, CP1256,
 | 
						|
	CP1255, ARMSCII-8, TIS-620, GEORGIAN-PS
 | 
						|
 | 
						|
Guess_map tries to determine the character map based on the contents of a
 | 
						|
buffer:
 | 
						|
 | 
						|
	struct charmap *guess_map(const char *buf, ptrdiff_t len);
 | 
						|
 | 
						|
It selects the locale_map unless it detects valid UTF-8 sequences in the
 | 
						|
buffers.
 | 
						|
 | 
						|
Finally, my_iconv is provided to convert from one character set to another:
 | 
						|
 | 
						|
	void my_iconv(
 | 
						|
		char *dest, ptrdiff_t destsiz, struct charmap *dest_map,
 | 
						|
		const char *src,struct charmap *src_map
 | 
						|
	);
 | 
						|
 | 
						|
This is used in a few places in JOE, but most of JOE uses to_uni and
 | 
						|
from_uni to convert between character sets a character at a time.
 | 
						|
 | 
						|
## Coroutines in JOE
 | 
						|
 | 
						|
<p>JOE 3.6 now uses co-routines to help reduce the amount of event-driven
 | 
						|
code which has to be written.  All previous versions of JOE were entirely in
 | 
						|
the event driven coding style, which made user dialog code a chore to write. 
 | 
						|
Now, with the help of co-routines, dialog code can be written as old
 | 
						|
fashioned blocking code which is much cleaner, even though JOE can have
 | 
						|
multiple un-finished dialogs on the screen.</p>
 | 
						|
 | 
						|
<p>The coroutine library has only three (now four) functions:</p>
 | 
						|
 | 
						|
<table width="100%" cellspacing=20 border=0 cellpadding=0> <colgroup> <col width="300"> <tbody>
 | 
						|
 | 
						|
<tr valign="top"><td>int co_call(int (*func)(va_list args),...);</td><td>Call a function
 | 
						|
as a co-routine.</td></tr>
 | 
						|
 | 
						|
<tr valign="top"><td>int co_yield(Coroutine *t, int val);</td><td>Suspend current
 | 
						|
co-routine and return to caller with given return value.</td></tr>
 | 
						|
 | 
						|
<tr valign="top"><td>int co_resume(Coroutine *t, int val);</td><td>Suspend current
 | 
						|
co-routine and resume specified one with given return value.</td></tr>
 | 
						|
 | 
						|
<tr valign="top"><td>int co_suspend(Coroutine *u, int val);</td><td>Suspend
 | 
						|
current co-routine (remembering its suspended invoking coroutines) and
 | 
						|
resume the top-level.  The current coroutine is stored in u.  When u is
 | 
						|
resumed and returns, the suspended chain is resumed (the resumer is
 | 
						|
continued when the chain tries to continue the top level.</td></tr>
 | 
						|
 | 
						|
</tbody> </table>
 | 
						|
 | 
						|
<p>co_call() invokes 'func' as a co-routine- it will run on its own stack.
 | 
						|
'func' is called immediately upon the call to co_call. All of the arguments
 | 
						|
after func are passed as a va_list to the function.  co_call calls va_start
 | 
						|
and va_end, 'func' should call va_arg once for each passed argument (see
 | 
						|
stdarg.h).  The co-routine should not yield before it has finished reading
 | 
						|
all of the arguments, otherwise va_end will be called early.  When 'func'
 | 
						|
returns, the co-routine is destroyed and co_call() returns with the return
 | 
						|
value of 'func'.  If the co-routine yields by calling co_yield, co_call
 | 
						|
returns with the return value specified in the call to co_yield.  Note that
 | 
						|
co_call only ever returns once for a particular call.</p>
 | 
						|
 | 
						|
<p>co_yield() suspends the current co-routine and returns with the specified
 | 
						|
return value to the caller (either to co_call or co_resume).  co_yield()
 | 
						|
writes information about the co-routine to the 'Coroutine' structure whose
 | 
						|
address is passed to co_yield.  The Coroutine structure's address is used as
 | 
						|
a handle for the co-routine and can be passed to co_resume().</p>
 | 
						|
 | 
						|
<p>co_resume() suspends the current co-routine (henceforth called the
 | 
						|
original co-routine) and resumes the specified co-routine.  When the
 | 
						|
specified co-routine returns (when the function given in the co_call that
 | 
						|
created the co-routine returns), the original co-routine is continued and
 | 
						|
co_resume() returns with the function's return value. If the specified
 | 
						|
co-routine yields, the original co-routine is continued and co_resume()
 | 
						|
returns with the return value specified in the call to co_yield.</p>
 | 
						|
 | 
						|
<p>co_suspend() is like co_resume, except that the top-level task is resumed
 | 
						|
instead of the calling co-routine.  This has the effect of suspending the
 | 
						|
entire chain of co-routines which invoked the caller of co_suspend. 
 | 
						|
Co_suspend stores the suspended co_routine in another co_routine.  That
 | 
						|
co_routine will resume this one when it finally returns.</p>
 | 
						|
 | 
						|
<p>The co-routine library is structured this way (which is different from
 | 
						|
the usual way where you get a co-routine handle immediately upon creation,
 | 
						|
for example see Knuth) to encourage proper stack un-winding.  Basically, a
 | 
						|
co-routine can not be forcibly destroyed.  Instead, each created co-routine
 | 
						|
is only destroyed when the initial function returns.</p>
 | 
						|
 | 
						|
<p>The co-routine library is based on ucontext.h and the getcontext,
 | 
						|
swapcontext, makecontext and setcontext system calls.  If the system does
 | 
						|
not have these calls, it uses setjmp/longjmp tricks in a highly portable
 | 
						|
way: the stacks for each co-routine are allocated off the top of the main
 | 
						|
stack.  The main stack grows after each allocation.  The stacks are never
 | 
						|
freed, but they are reused.  This works for any machine with a single normal
 | 
						|
stack.  It will not work on IA64 (two stacks) or Cray (the stack is really a
 | 
						|
linked list).</p>
 | 
						|
 | 
						|
<p>I was originally going to use simple cooperative multi-threading, with
 | 
						|
only two basic calls: sleep_on() and wake_up() (inspired by old versions of
 | 
						|
the Linux kernel) and a top-level scheduler.  Sleep_on would suspend the
 | 
						|
current thread, and create a new thread which runs just the scheduler. 
 | 
						|
Wake_up would mark a thread as ready to run, and on the next return to the
 | 
						|
scheduler, it would be run.</p>
 | 
						|
 | 
						|
<p>This turns out to not be powerful enough to implement JOE's keyboard
 | 
						|
macro language, at least without much rewriting and inelegance. It could be
 | 
						|
done, but the keyboard macro player would have to feed events into the
 | 
						|
top-level scheduler or would have to be integrated with it, whereas in older
 | 
						|
versions of JOE (even without co-routines) the macro player just calls the
 | 
						|
edit functions (more or less) directly.  The problem is that sleep_on
 | 
						|
(called by a function asking for user input) would suspend the task all the
 | 
						|
way up to the scheduler including the macro player, thus preventing the
 | 
						|
macro player from simulating user input.</p>
 | 
						|
 | 
						|
<p>On the other hand with real co-routines, the macro player will call each
 | 
						|
edit function as a co-routine (which I like to think of as a task with a
 | 
						|
very clear return point).  If the edit function has to yield, it returns
 | 
						|
immediately to the macro player for the next macro step.  If the yield is
 | 
						|
due to a request for input, the macro player supplies it by calling the
 | 
						|
recorded edit functions.  If the macro wants to allow the user to provide
 | 
						|
input (using the "query" function, also known as an interactive macro), it
 | 
						|
can itself yield (since it is also running in a coroutine) back to the
 | 
						|
top-level edit loop (in actuality, the macro player calls a function which
 | 
						|
calls co_suspend() to do this).  The macro player is continued when the user
 | 
						|
finishes supplying the input (hits the return key at a prompt).  Several
 | 
						|
interactive macros can be running at once with this scheme.</p>
 | 
						|
 | 
						|
## Semi-Automatic Variables
 | 
						|
 | 
						|
<p>JOE 3.6 uses "semi-automatic" variables for strings.  There is a global
 | 
						|
"object-stack", accessed through these functions:</p>
 | 
						|
 | 
						|
<table width="100%" cellspacing=20 border=0 cellpadding=0>
 | 
						|
<colgroup>
 | 
						|
<col width="300">
 | 
						|
<tbody>
 | 
						|
<tr valign="top"><td>void *obj_malloc(int size);</td><td>Allocate a block of memory.</td></tr>
 | 
						|
<tr valign="top"><td>void *obj_realloc(void *ptr, int size);</td><td>Change allocation size.</td></tr>
 | 
						|
<tr valign="top"><td>void obj_free(void *ptr);</td><td>Free a specified block and all newer blocks.</td></tr>
 | 
						|
<tr valign="top"><td>void obj_perm(void *ptr);</td><td>Mark a block as permanent, which in effect moves the block from the stack to the heap.</td></tr>
 | 
						|
<tr valign="top"><td>int obj_len(void *ptr);</td><td>Access hidden "length" part of block (string length).</td></tr>
 | 
						|
<tr valign="top"><td>int obj_size(void *ptr);</td><td>Access hidden "size" part of block (allocation size).</td></tr>
 | 
						|
</tbody>
 | 
						|
</table>
 | 
						|
 | 
						|
<p>These functions remember the order of allocation.  When an allocated
 | 
						|
block is freed, it and all newer blocks are freed as well.  For example, in:</p>
 | 
						|
 | 
						|
<pre>
 | 
						|
unsigned char *a = obj_malloc(10);
 | 
						|
unsigned char *b = obj_malloc(10);
 | 
						|
unsigned char *c = obj_malloc(10);
 | 
						|
unsigned char *d = obj_malloc(10);
 | 
						|
 | 
						|
obj_free(b);
 | 
						|
</pre>
 | 
						|
 | 
						|
<p>b, c, and d are all freed, but a remains allocated.  The top-level edit
 | 
						|
loop is structured to call obj_free after every editor command:</p>
 | 
						|
 | 
						|
<pre>
 | 
						|
unsigned char *gc = obj_malloc(1);
 | 
						|
execmd(...); /* Execute edit commands */
 | 
						|
obj_free(gc); /* Collect garbage */
 | 
						|
</pre>
 | 
						|
 | 
						|
<p>Thus, you rarely have to call obj_free- the only time you might want to do
 | 
						|
it is if you are calling obj_malloc in a loop.  This could cause a large
 | 
						|
amount of memory to be allocated, so obj_free() should be called at the
 | 
						|
end of each loop iteration.</p>
 | 
						|
 | 
						|
<p>obj_perm() marks an object as a heap object instead of a stack object. 
 | 
						|
When a heap object is freed, only itself is freed- the stack is not
 | 
						|
modified.  Also when the stack is freed, heap objects are not touched.</p>
 | 
						|
 | 
						|
## Variable Strings
 | 
						|
 | 
						|
<p>JOE uses dynamic strings built on top of the global object stack
 | 
						|
functions. These strings automatically resize themselves to fit their
 | 
						|
contents.  They also know their own length, which allows you to have NULs in
 | 
						|
the middle of a string.  They also always have a terminating NUL to make
 | 
						|
them compatible with C-strings.  In fact they have (almost) the same type as
 | 
						|
C-strings: unsigned char *.  The length is hidden in the obj_malloc data
 | 
						|
structure.  These functions are provided:</p>
 | 
						|
 | 
						|
<table width="100%" cellspacing=20 border=0 cellpadding=0>
 | 
						|
<colgroup>
 | 
						|
<col width="400">
 | 
						|
<tbody>
 | 
						|
<tr valign="top"><td>unsigned char *vsmk(int prealloc);</td><td>Allocate a zero length string, but with enough space to grow to the specified length.</td></tr>
 | 
						|
<tr valign="top"><td>int vslen(unsigned char *s);</td><td>Return string length or 0 if s is NULL.</td></tr>
 | 
						|
<tr valign="top"><td>unsigned char *vsensure(unsigned char *s, int len);</td><td>Make sure there is enough space for a string of the specified length.</td></tr>
 | 
						|
<tr valign="top"><td>unsigned char *vstrunc(unsigned char *s, int len);</td><td>Set length of string: the string is truncated or extended with spaces to the specified length.</td></tr>
 | 
						|
<tr valign="top"><td>unsigned char *vsncpy(unsigned char *s,int offset,unsigned char *block, int block_size);</td><td>Copy a memory block to a string.  The string is extended with spaces if offset > string length.</td></tr>
 | 
						|
<tr valign="top"><td>unsigned char *vsdupz(unsigned char *s);</td><td>Duplicate a z-string to a variable string.</td></tr>
 | 
						|
<tr valign="top"><td>unsigned char *vsdup(unsigned char *s);</td><td>Duplicate a variable string.</td></tr>
 | 
						|
<tr valign="top"><td>unsigned char *vscpyz(unsigned char *s,unsigned char *z);</td><td>Copy a z-string to a variable string.</td></tr>
 | 
						|
<tr valign="top"><td>unsigned char *vscpy(unsigned char *s,unsigned char *block,int size);</td><td>Copy a memory block to a variable string.</td></tr>
 | 
						|
<tr valign="top"><td>unsigned char *vscatz(unsigned char *s,unsigned char *z);</td><td>Append a z-string to a variable string.</td></tr>
 | 
						|
<tr valign="top"><td>unsigned char *vscat(unsigned char *s,unsigned char *block,int size);</td><td>Append a memory block to a variable string.</td></tr>
 | 
						|
<tr valign="top"><td>unsigned char *vsadd(unsigned char *s,unsigned char c);</td><td>Append a single character to a string.</td></tr>
 | 
						|
<tr valign="top"><td>unsigned char *vsfmt(unsigned char *s,int offset,const char *format,...);</td><td>Sprintf() to a variable string.</td></tr>
 | 
						|
<tr valign="top"><td>unsigned char *vsgets(unsigned char **sp,FILE *f);</td><td>fgets() to a variable string (returns NULL on EOF, does not put '\n' in the string).</td></tr>
 | 
						|
</tbody>
 | 
						|
</table>
 | 
						|
 | 
						|
<p>Note that it is OK to pass NULL in place of 's' in any of the above
 | 
						|
functions.  In this case, a new string is allocated.</p>
 | 
						|
 | 
						|
<p>Most of these function return the address of the variable string, in case
 | 
						|
it was modified.  One exception is vsgets()- it takes an address of a
 | 
						|
variable holding the string.  It returns the string address, but it also
 | 
						|
writes it to the variable directly.</p>
 | 
						|
 | 
						|
<p>Many of the above functions take a memory block, which is specified by an
 | 
						|
address and a length.  These macros can be used to help construct these
 | 
						|
arguments:</p>
 | 
						|
 | 
						|
<table width="100%" cellspacing=20 border=0 cellpadding=0>
 | 
						|
<colgroup>
 | 
						|
<col width="350">
 | 
						|
<tbody>
 | 
						|
<tr valign="top"><td>#define sc(s) (s), (sizeof(s)-1)</td><td>For string constants.</td></tr>
 | 
						|
<tr valign="top"><td>#define sz(s) (s), zlen(s)</td><td>For C-strings (zlen is the same as strlen).</td></tr>
 | 
						|
<tr valign="top"><td>#define sv(s) (s), vslen(s)</td><td>For variable strings.</td></tr>
 | 
						|
</tbody>
 | 
						|
</table>
 | 
						|
 | 
						|
 | 
						|
<p>So for example, vscatz is defined as:</p>
 | 
						|
 | 
						|
<pre>
 | 
						|
unsigned char *vscatz(unsigned char *s, unsigned char *z);
 | 
						|
{
 | 
						|
	return vsncpy(sv(s), sz(z));
 | 
						|
}
 | 
						|
</pre>
 | 
						|
 | 
						|
## Variable arrays of variable strings
 | 
						|
 | 
						|
<p>JOE also semi-automatic variable arrays of variable strings.  These
 | 
						|
functions are provided:</p>
 | 
						|
 | 
						|
<table width="100%" cellspacing=20 border=0 cellpadding=0>
 | 
						|
<colgroup>
 | 
						|
<col width="400">
 | 
						|
<tbody>
 | 
						|
<tr valign="top"><td>unsigned char *vamk(int prealloc);</td><td>Allocate a zero length array, but with enough space to grow to the specified length.</td></tr>
 | 
						|
<tr valign="top"><td>void varm(unsigned char **a);</td><td>Free an array.</td></tr>
 | 
						|
<tr valign="top"><td>int valen(unsigned char *a);</td><td>Return array length or 0 if a is NULL.</td></tr>
 | 
						|
<tr valign="top"><td>unsigned char *vaensure(unsigned char *s, int len);</td><td>Make sure there is enough space for an array of the specified length.</td></tr>
 | 
						|
<tr valign="top"><td>unsigned char *vatrunc(unsigned char *s, int len);</td><td>Set length of array: the array is truncated or extended with NULLs to the specified length.</td></tr>
 | 
						|
<tr valign="top"><td>unsigned char *vaadd(unsigned char **a,unsigned char *s);</td><td>Append a single string to an array.</td></tr>
 | 
						|
<tr valign="top"><td>unsigned char *vasort(unsigned char **a,int len);</td><td>Sort an array.</td></tr>
 | 
						|
<tr valign="top"><td>unsigned char *vawords(unsigned char **a,unsigned char *s,int len, unsigned char *sep, int seplen);</td><td>Convert a word list in s (words separated by characters given in sep/seplen) into an array of strings.</td></tr>
 | 
						|
<tr valign="top"><td>void vaperm(unsigned char **a);</td><td>Mark an array as preferment (heap instead of stack).</td></tr>
 | 
						|
</tbody>
 | 
						|
</table>
 | 
						|
 | 
						|
<p>When an array is marked as permanent, any strings in the array also
 | 
						|
marked as permanent, and any string later added to the array are also marked
 | 
						|
as permanent.</p>
 | 
						|
 | 
						|
 | 
						|
## Edit Buffers
 | 
						|
<p>API:
 | 
						|
</p>
 | 
						|
<p>  Look at the comments in b.h for more information.
 | 
						|
</p>
 | 
						|
<p>  B *bfind(unsigned char *name);
 | 
						|
		Load disk file into memory buffer 'B'.
 | 
						|
</p>
 | 
						|
<p>  bsave(P *p,unsigned char *name,int size);
 | 
						|
		Write size bytes from buffer beginning at p to disk file
 | 
						|
</p>
 | 
						|
<p>  brm(b);	Free data structure
 | 
						|
</p>
 | 
						|
<p>Once you have a B you can access the characters in it via P pointers (which
 | 
						|
are like C++ STL iterators).
 | 
						|
</p>
 | 
						|
<p>  B *b = bfind("foo");	Load file into memory
 | 
						|
</p>
 | 
						|
<p>  P *p = pdup(b->bof);	Get pointer to beginning of file (duplicate
 | 
						|
			b->bof which is a P).
 | 
						|
</p>
 | 
						|
<p>  prm(p);		Free p when we're done with it.
 | 
						|
</p>
 | 
						|
<p>  int c=brch(p);	Get character at p.</p>
 | 
						|
<p>  int c=pgetc(p);	Get character at p and advance it.</p>
 | 
						|
<p>  int c=prgetc(p);	Retract p, then return character at it.</p>
 | 
						|
</p>
 | 
						|
<p>    - These return -1 (NO_MORE_DATA) if you try to read end of file or
 | 
						|
      before beginning of file.
 | 
						|
</p>
 | 
						|
<p>    - A pointer can point to any character of the file and right after the
 | 
						|
      end of the file.
 | 
						|
</p>
 | 
						|
<p>    - For UTF-8 files, character can be between 0 and 0x7FFFFFFF
 | 
						|
</p>
 | 
						|
<p>  Publicly readable members of P:</p>
 | 
						|
<p>	p->byte		The byte offset into the buffer</p>
 | 
						|
<p>	p->line		The line number</p>
 | 
						|
<p>	p->xcol		If P is the cursor, this is the column number
 | 
						|
			where the cursor will be displayed on the screen
 | 
						|
			(which does not have to match the column of the
 | 
						|
			character at P).</p>
 | 
						|
<p>  Some predicates:</p>
 | 
						|
<p>	pisbof(p);	True if pointer is at beginning of buffer</p>
 | 
						|
<p>	piseof(p);	True if pointer is at end of buffer</p>
 | 
						|
<p>	pisbol(p);	True if pointer is at beginning of line</p>
 | 
						|
<p>	piseol(p);	True if pointer is at end of line</p>
 | 
						|
<p>	pisbow(p);	True if pointer is at beginning of a word</p>
 | 
						|
<p>	piseow(p);	True if pointer is at end of a word</p>
 | 
						|
 | 
						|
<p>  More information about character at p:</p>
 | 
						|
<p>	piscol(p);	Get column number of character at p.
 | 
						|
</p>
 | 
						|
<p>  Some other ways of moving a P through a B:
 | 
						|
</p>
 | 
						|
<p>	pnextl(p);	Go to beginning of next line</p>
 | 
						|
<p>	pprevl(p);	Go to end of previous line</p>
 | 
						|
<p>	pfwrd(p,int n);	Move forward n bytes</p>
 | 
						|
<p>	pbkwd(p,int n);	Move backward n bytes</p>
 | 
						|
<p>	p_goto_bof(p);</p>
 | 
						|
<p>	p_goto_eof(p);</p>
 | 
						|
<p>	p_goto_bol(p);</p>
 | 
						|
<p>	p_goto_eol(p);</p>
 | 
						|
<p>	pset(p,q);	Move p to same position as q.
 | 
						|
</p>
 | 
						|
<p>	pline(p,n);	Goto to beginning of a specific line.</p>
 | 
						|
<p>	pgoto(p,n);	Goto a specific byte position.</p>
 | 
						|
<p>	pfind(P,unsigned char *s,int len);
 | 
						|
			Fast Boyer-Moore search forward.</p>
 | 
						|
<p>	prfind(P,unsigned char *s,int len);
 | 
						|
			Fast Boyer-Moore search backward.</p>
 | 
						|
<p>		These are very fast- they look at low level
 | 
						|
	data structure and don't go through pgetc().  Boyer-Moore
 | 
						|
	allows you to skip over many characters without reading
 | 
						|
	them, so you can get something like O(n/len).</p>
 | 
						|
<p>  Some facts:</p>
 | 
						|
<p>    Local operations are fast: pgetc(), prgetc().</p>
 | 
						|
<p>    Copy is fast: pset().</p>
 | 
						|
<p>    pline() and pgoto() are slower, but look for the closest existing
 | 
						|
    P to start from.</p>
 | 
						|
<p>    The column number is stored in P, but it is only updated if
 | 
						|
    it is easy to do so.  If it's hard (like you crossed a line
 | 
						|
    boundary backward) it's marked as invalid.  piscol() then has
 | 
						|
    to recalculate it.
 | 
						|
</p>
 | 
						|
<p>  Modifying a buffer:
 | 
						|
</p>
 | 
						|
<p>    binsc(p,int c);		Insert character at p.</p>
 | 
						|
<p>    bdel(P *from,P *to);	Delete character between two Ps.</p>
 | 
						|
<p>  Note that when you insert or delete, all of the Ps after the insertion/
 | 
						|
  deletion point are adjusted so that they continue to point to the same
 | 
						|
  character before the insert or delete.
 | 
						|
</p>
 | 
						|
<p>  Insert and Delete create undo records.
 | 
						|
</p>
 | 
						|
<p>  Insert and Delete set dirty flags on lines which are currently being
 | 
						|
  displayed on the screen, so that when you return to the edit loop, these
 | 
						|
  lines automatically get redrawn.
 | 
						|
</p>
 | 
						|
<p>Internal:
 | 
						|
</p>
 | 
						|
<p>  An edit buffer is made up of a doubly-linked list of fixed sized (4 KB)
 | 
						|
gap buffers.  A gap buffer has two parts: a ~16 byte header, which is always
 | 
						|
in memory, and the actual buffer, which can be paged out to a swap file (a
 | 
						|
vfile- see vfile.h).  A gap buffer consists of three regions: text before
 | 
						|
the gap, the gap and text after the gap (which always goes all the way to
 | 
						|
the end of buffer). (hole and ehole in the header indicate the gap
 | 
						|
position).  The size of the gap may be 0 (which is the case when a file is
 | 
						|
first loaded).  Gap buffers are fast for small inserts and deletes when the
 | 
						|
cursor is at the gap (for delete you just adjust a pointer, for insert you
 | 
						|
copy the data into gap).  When you reposition the cursor, you have to move
 | 
						|
the gap before any inserts or deletes occur.  If you had only one window and
 | 
						|
a single large gap buffer for the file, you could always keep the gap at the
 | 
						|
cursor- the right-arrow key copies one character across the gap.
 | 
						|
</p>
 | 
						|
<p>  Of course for edits which span gap buffers or which are larger than a gap
 | 
						|
buffer, you get a big mess of gap buffer splitting and merging plus
 | 
						|
doubly-linked list splicing.
 | 
						|
</p>
 | 
						|
<p>  Still, this buffer method is quite fast: you never have to do large memory
 | 
						|
moves since the gap buffers are limited in size.  To help search for line
 | 
						|
numbers, the number of newlines '\n's contained in each gap buffer is stored
 | 
						|
in the header.  Reads are fast as long as you have a P at the place you
 | 
						|
want to read from, which is almost always the case.
 | 
						|
</p>
 | 
						|
<p>  It should be possible to quickly load files by mapping them directly into
 | 
						|
memory (using mmap()) and treating each 4KB page as a gap buffer with 0 size
 | 
						|
gap.  When page is modified, use copy-on-write to move the page into the
 | 
						|
swap file (change pointer in header).  This is not done now.  Instead the
 | 
						|
file is copied when loaded.
 | 
						|
</p>
 | 
						|
## Windowing System
 | 
						|
<p>There is a tiny object-oriented windowing system built into JOE.  This is
 | 
						|
the class hierarchy:
 | 
						|
</p>
 | 
						|
<p>SCRN
 | 
						|
  A optimizing terminal screen driver (very similar to 'curses').
 | 
						|
    has a pointer to a CAP, which has the terminal capabilities read
 | 
						|
    from termcap or terminfo.
 | 
						|
</p>
 | 
						|
<p>    writes output to screen with calls to the macro ttputc(). (tty.c is the
 | 
						|
    actual interface to the tty device).
 | 
						|
</p>
 | 
						|
<p>    cpos()    - set cursor position</p>
 | 
						|
<p>    outatr()  - draw a character at a screen position with attributes</p>
 | 
						|
<p>    eraeol()  - erase from some position to the end of the line</p>
 | 
						|
<p>SCREEN
 | 
						|
  Contains list of windows on the screen (W *topwin).
 | 
						|
</p>
 | 
						|
<p>  Points to window with focus (W *curwin).
 | 
						|
</p>
 | 
						|
<p>  Contains pointer to a 'SCRN', the tty driver for the particular terminal
 | 
						|
  type.
 | 
						|
</p>
 | 
						|
<p>W
 | 
						|
  A window on a screen.
 | 
						|
</p>
 | 
						|
<p>  Has position and size of window.
 | 
						|
</p>
 | 
						|
<p>  Has:
 | 
						|
    void *object- pointer to a structure which inherits window (W should
 | 
						|
    really be a base class for these objects- since C doesn't have this
 | 
						|
    concept, a pointer to the derived class is used instead- the derived
 | 
						|
    class has a pointer to the base class: it's called 'parent').
 | 
						|
</p>
 | 
						|
<p>      Currently this is one of:</p>
 | 
						|
<p>        BW *    a text buffer window (screen update code is here.)</p>
 | 
						|
<p>        QW *    query window (single character yes/no questions)</p>
 | 
						|
<p>        MENU *  file selection menu</p>
 | 
						|
<p>      BW * is inherited by (in the same way that a BW inherits a W):</p>
 | 
						|
<p>        PW *    a single line prompt window (file name prompts)</p>
 | 
						|
<p>        TW *    a text buffer window (main editing window).</p>
 | 
						|
<p>    WATOM *watom- Gives type of this window.  The WATOM structure has
 | 
						|
    pointers to virtual member functions.</p>
 | 
						|
<p>    KBD *kbd- The keyboard handler for this window.  When window has
 | 
						|
    focus, keyboard events are sent to this object.  When key sequences
 | 
						|
    are recognized, macros bound to them are invoked.</p>
 | 
						|
<p>Some window are operators on others.  For example ^K E, load a file into a
 | 
						|
window prompt operates on a text window.  If you hit tab, a file selection
 | 
						|
menu which operates on the prompt window appears below this.  When a window
 | 
						|
is the target of operator windows is killed, the operators are killed also.</p>
 | 
						|
<p>Currently all windows are currently the width of the screen (should be fixed
 | 
						|
in the future).  The windows are in a big circular list (think of a big loop
 | 
						|
of paper).  The screen is small window onto this list.  So unlike emacs, you
 | 
						|
can have windows which exist, but which are not on the screen.</p>
 | 
						|
<p>^K N and ^K P move the cursor to next or previous window.  If the next
 | 
						|
window is off the screen it is moved onto the screen, along with any
 | 
						|
operator windows are target it.</p>
 | 
						|
## Macros
 | 
						|
<p>- add something here.
 | 
						|
</p>
 | 
						|
## Screen update
 | 
						|
<p>- add something here.
 | 
						|
</p>
 | 
						|
## Syntax Highlighter
 | 
						|
 | 
						|
<p>There are two parts to the syntax highlighter: the parser (in syntax.c)
 | 
						|
and the line attribute database (in lattr.c).  The parser is simple enough:
 | 
						|
it is a finite state machine interpreter.  You supply a line of text and the
 | 
						|
starting state, and it returns the starting state of the next line and an
 | 
						|
array containing the color to be used for each character.  Each state has a
 | 
						|
color.  The line is scanned left to right, character by character: the state
 | 
						|
and the character index a table, which gives an action, which includes the
 | 
						|
next state.</p>
 | 
						|
 | 
						|
<p>The action specifies the next state, and also indicates whether some
 | 
						|
previous N characters (including the current) should be "recolored" with the
 | 
						|
color of the next state, or whether the color of the current state should be
 | 
						|
used for the character.  The action also indicates whether to move on to the
 | 
						|
next character, or to stay put and re-evaluate the current character with
 | 
						|
the new state.  This leads to simpler hand-coding of the state machine, but
 | 
						|
can create an infinite loop.  The action also indicates if the character
 | 
						|
should be added to a string buffer.  The string buffer can be used (when
 | 
						|
specified by an action) as a key for a hash table look-up (typically filled
 | 
						|
with key-words).  If there is a match, the hash-table entry supplies the
 | 
						|
action instead of the state table.</p>
 | 
						|
 | 
						|
<p>This parser is fast and simple and powerful enough to lexically analyze
 | 
						|
more than 40 languages.  However a few enhancements were made to improve
 | 
						|
both the class of languages which can be highlighted and to improve the ease
 | 
						|
of writing the state machine.  To support "here" documents in perl and sh,
 | 
						|
as well as to simplify matching delimiters (like open with close
 | 
						|
parenthesis), a string (from the string buffer) can be stored along with the
 | 
						|
state.  This can be matched against the string buffer.  So first the
 | 
						|
terminating word from the here document is stored in the string, and then
 | 
						|
the state machine looks forward until it finds this string to determine the
 | 
						|
end of the here document.</p>
 | 
						|
 | 
						|
<p>The state is only a single integer, not an entire stack.  The parser can
 | 
						|
therefore not handle recursive grammars. Unfortunately, some languages have
 | 
						|
recursive elements in their lexical structure.  One way to deal with this is
 | 
						|
to support only a very limited recursion depth by tediously repeating a set
 | 
						|
of states in the hand written state machine N times, where N is your maximum
 | 
						|
recursion depth.  To ease writing this type of thing, and also to allow
 | 
						|
state machine to be reused in cases where one language is embedded within
 | 
						|
another (PHP in HTML, for example), a macro system was added to the state
 | 
						|
machine loader.  Recursion is allowed in the macro calls, but is limited to
 | 
						|
a depth of 5.</p>
 | 
						|
 | 
						|
<p>As of JOE 3.6, the macro instantiation was replaced with a real stack, so
 | 
						|
now there is no recursion depth limit.</p>
 | 
						|
 | 
						|
## Line attribute cache
 | 
						|
 | 
						|
<p>The second part of the syntax highlighter is the line attribute cache.
 | 
						|
In the original implementation, the state of the first line of each window
 | 
						|
is maintained and the parser is invoked during the screen update algorithm
 | 
						|
to highlight all of the lines on the screen.  This works well when the
 | 
						|
window remains in place or when it scrolls forward through the file. 
 | 
						|
However, if the window scrolls backward you may have to reparse the entire
 | 
						|
file or from the nearest previous window if there happens to be one.  This
 | 
						|
can be slow for large files, and certainly wastes a lot of CPU time when
 | 
						|
scrolling backward through a file.</p>
 | 
						|
 | 
						|
<p>One attempt to solve this was to instead of starting at the beginning of
 | 
						|
the file, to instead assume that the line 200 lines (a configurable option)
 | 
						|
back was at the initial state, parse forward from there and hope for the
 | 
						|
best.  Sometimes the state machine would find itself back in the proper
 | 
						|
state, but not always- to the point where most syntaxes had specified to
 | 
						|
just always reparse from the beginning of the file.</p>
 | 
						|
 | 
						|
<p>Finally, the line attribute cache was added.  It stores the parser state
 | 
						|
at the beginning of each line, from the beginning of the file to last line
 | 
						|
actually displayed on the screen.  This solves the scrolling problem, but
 | 
						|
introduces a new one: how to keep the line attribute cache up to date.  This
 | 
						|
turns out to not be so difficult.  First of all, the states are stored in a
 | 
						|
gap-buffer, which makes it easy to perform insertions and deletions. 
 | 
						|
Whenever lines are inserted or deleted in the file, the line attribute cache
 | 
						|
is expanded or contracted to keep in sync.  Now, when any change is made to
 | 
						|
the file, we reparse from just before the start of the change until the
 | 
						|
state of the beginning of a line again matches the corresponding entry in
 | 
						|
the cache.  If a match is found, the rest of the cache is assumed to be
 | 
						|
correct.  Some changes will require a large amount of reparsing, but many
 | 
						|
will not.  In any case, the the highlighting on the screen is always
 | 
						|
correct.</p>
 | 
						|
 | 
						|
## Files
 | 
						|
 | 
						|
<table width="100%" cellspacing=20 border=0 cellpadding=0>
 | 
						|
<colgroup>
 | 
						|
<col width="100">
 | 
						|
<tbody>
 | 
						|
<tr valign="top"><td>b.c</td><td>Text buffer management</td></tr>
 | 
						|
<tr valign="top"><td>blocks.c</td><td>Library: fast memcpy() functions (necessary on really old versions of UNIX).</td></tr>
 | 
						|
<tr valign="top"><td>bw.c</td><td>A class: text buffer window (screen update code is here)</td></tr>
 | 
						|
<tr valign="top"><td>charmap.c</td><td>UNICODE to 8-bit conversion functions</td></tr>
 | 
						|
<tr valign="top"><td>cmd.c</td><td>Table of user edit functions</td></tr>
 | 
						|
<tr valign="top"><td>coroutine.c</td><td>Coroutine library</td></tr>
 | 
						|
<tr valign="top"><td>dir.c</td><td>Directory reading functions (for old UNIXs).</td></tr>
 | 
						|
<tr valign="top"><td>hash.c</td><td>Library: automatically expanding hash table functions.</td></tr>
 | 
						|
<tr valign="top"><td>help.c</td><td>Implement the on-line help window</td></tr>
 | 
						|
<tr valign="top"><td>i18n.c</td><td>Unicode character type information database</td></tr>
 | 
						|
<tr valign="top"><td>kbd.c</td><td>Keymap data structure (keysequence to macro bindings).</td></tr>
 | 
						|
<tr valign="top"><td>lattr.c</td><td>Line attribute cache</td></tr>
 | 
						|
<tr valign="top"><td>macro.c</td><td>Keyboard and joerc file macros</td></tr>
 | 
						|
<tr valign="top"><td>main.c</td><td>Has main() and top level edit loop</td></tr>
 | 
						|
<tr valign="top"><td>menu.c</td><td>A class: menu windows</td></tr>
 | 
						|
<tr valign="top"><td>obj.c</td><td>Semi-automatic objects, plus dynamic string and array library</td></tr>
 | 
						|
<tr valign="top"><td>path.c</td><td>Library: file name and path manipulation functions</td></tr>
 | 
						|
<tr valign="top"><td>poshist.c</td><td>Cursor position history</td></tr>
 | 
						|
<tr valign="top"><td>pw.c</td><td>A class: prompt windows</td></tr>
 | 
						|
<tr valign="top"><td>queue.c</td><td>Library: doubly linked lists</td></tr>
 | 
						|
<tr valign="top"><td>qw.c</td><td>A class: single-key query windows</td></tr>
 | 
						|
<tr valign="top"><td>rc.c</td><td>Joerc file loader</td></tr>
 | 
						|
<tr valign="top"><td>regex.c</td><td>Regular expressions</td></tr>
 | 
						|
<tr valign="top"><td>scrn.c</td><td>Terminal update functions (like curses)</td></tr>
 | 
						|
<tr valign="top"><td>selinux.c</td><td>Secure linux (for RedHat) functions</td></tr>
 | 
						|
<tr valign="top"><td>syntax.c</td><td>Syntax highlighter</td></tr>
 | 
						|
<tr valign="top"><td>tab.c</td><td>Tab completion for file selection prompt</td></tr>
 | 
						|
<tr valign="top"><td>termcap.c</td><td>Load terminal capabilities from /etc/termcap file or terminfo database</td></tr>
 | 
						|
<tr valign="top"><td>tw.c</td><td>A class: main text editing window</td></tr>
 | 
						|
<tr valign="top"><td>ublock.c</td><td>User edit functions: block moves</td></tr>
 | 
						|
<tr valign="top"><td>uedit.c</td><td>User edit functions: basic edit functions</td></tr>
 | 
						|
<tr valign="top"><td>uerror.c</td><td>User edit functions: parse compiler error messages and goto next error, previous error</td></tr>
 | 
						|
<tr valign="top"><td>ufile.c</td><td>User edit functions: load and save file</td></tr>
 | 
						|
<tr valign="top"><td>uformat.c</td><td>User edit functions: paragraph formatting, centering</td></tr>
 | 
						|
<tr valign="top"><td>uisrch.c</td><td>User edit functions: incremental search</td></tr>
 | 
						|
<tr valign="top"><td>umath.c</td><td>User edit functions: calculator</td></tr>
 | 
						|
<tr valign="top"><td>undo.c</td><td>Undo system</td></tr>
 | 
						|
<tr valign="top"><td>usearch.c</td><td>User edit functions: search & replace</td></tr>
 | 
						|
<tr valign="top"><td>ushell.c</td><td>User edit functions: subshell</td></tr>
 | 
						|
<tr valign="top"><td>utag.c</td><td>User edit functions: tags file search</td></tr>
 | 
						|
<tr valign="top"><td>utf8.c</td><td>UTF-8 to unicode coding functions</td></tr>
 | 
						|
<tr valign="top"><td>utils.c</td><td>Misc. utilities</td></tr>
 | 
						|
<tr valign="top"><td>vfile.c</td><td>Library: virtual memory functions (allows you to edit files larger than memory)</td></tr>
 | 
						|
<tr valign="top"><td>w.c</td><td>A class: base class for all windows</td></tr>
 | 
						|
</tbody>
 | 
						|
</table>
 |