International Characters and GED2HTML

Starting with version 3.0, GED2HTML now includes some explicit support for alternate character sets, including international characters. The GEDCOM 5.5 specifications require the use of the ANSEL character set, but in my experience many GEDCOMs are not actually encoded using this character set. More common are ASCII, ISO-Latin-1 (ISO-8859-1), as well as IBM-PC encodings based on various DOS code pages, which are in fact explicitly disallowed by the GEDCOM 5.5 specification!

Internally, GED2HTML "supports" the ISO-Latin-1 character set. What this really means is that, aside from possibly converting ASCII characters with codes 127 or less from lower case to upper case, the processing performed by GED2HTML simply treats characters in the GEDCOM as 8-bit values to be passed through from input to output. According to my reading of the HTML specification, a compliant ``user agent'' (e.g. a browser) should accept HTML files encoded using this character set. Netscape, for example, will read and display such characters properly. If a browser ``barfs'' on characters with codes 128 and above, then it is not an HTML-compliant browser.

A number of people have complained that GED2HTML versions prior to version 3.0 "didn't support international characters". What this generally meant was that their GEDCOMs were encoded in some IBM-PC character set, and when the 8-bit codes from their GEDCOMs were passed through to the HTML output and interpreted by an HTML-compliant browser, the characters that were displayed by the browser did not appear the same as the ones they originally entered using their genealogy program.

To help with this problem, GED2HTML version 3.0 and later now applies the following procedure which is intended to make more GEDCOMs produce output that matches the user's expectations:

When a GEDCOM is read, GED2HTML uses the information in the CHAR field of the HEAD record to attempt to determine the character set in which the GEDCOM is encoded. Currently, GED2HTML recognizes the tokens "ASCII", "ANSI", "ANSEL", "IBM WINDOWS", "MS-DOS", and "IBMPC". The default is ANSI (ISO-Latin-1), in case the character set cannot otherwise be determined.
As the GEDCOM is read in, GED2HTML attempts to translate it, from the character set in which it is encoded, to the "equivalent" encoding in ISO-Latin-1. It is not always possible to do an exact job of this, because, for example, there are ANSEL sequences for which there is no ISO-Latin-1 equivalent. Anyway, it does the best it can, and I am open to suggestions for improvement.
Once the data has been translated internally into ISO-Latin-1, no further change (other than possible lower/upper case conversions to characters with codes less than 128) is made to the data, before it is eventually emitted as HTML output.

If you find that GED2HTML is assuming the wrong character set for your GEDCOM, you should override what the GEDCOM says by setting the CHARACTER_SET output interpreter variable to the appropriate string. NOTE: Brother's Keeper is known to lie about the character set it has used to encode the GEDCOM, using

1 CHAR IBMPC

when it really means

1 CHAR ANSI

This is a typical circumstance in which you would need to override the choice of character set. This would be done, e.g., by putting -D CHARACTER_SET=ANSI in the "Additional Options" field of the dialog box under Windows, or on the command line under Unix.

Locales

One problem I have had with the above scheme is figuring out a reasonable way of doing lower/upper case conversions on characters with codes 128 and above. A worse problem is obtaining the proper collating sequence for sorting names into alphabetical order, because the proper ordering can depend on the particular (human) language being used. It now appears to me that the so-called "locale" support is maturing under many operating systems, so starting in GED2HTML version 3.5, I am relying on this support to perform the proper lower/upper case conversions and comparison operations. If you find that conversions are not being done properly, you might be able to modify the default behavior by explicitly setting the LOCALE output interpreter variable. See here for more details on variables and the output interpreter. If all else fails, the best advice I can offer is to turn off lower/upper case conversion in surnames by setting the option variable UPPER_CASE_SURNAMES to 0.

Some people seem to feel compelled to put HTML "entity codes" (e.g. &ouml) in their GEDCOM's. My opinion on this is that these codes are HTML-specific, and have no business being in a GEDCOM. If I were trying to be really nice to non-compliant HTML browsers, I would translate ISO-Latin-1 characters in the range (128-255) to their HTML "entity codes" when creating the HTML files. I might still implement this, but it is not a high priority, because in my opinion (supported by the HTML spec cited above) a browser that cannot display these codes is broken.

GED2HTML home page