Documents conversion to W2ML

Table of content

Character coding

W2ML documents are XML documents, so they should be encoded in Unicode (UTF-8 or UTF-16) for better compatibility. Note that US-ASCII is a subset of UTF-8. The support of other encodings is not garanteed by XML.

Using GNU recode, the following command converts a Windows-1252 text file (infile) to UTF-8 (outfile):
recode -d Windows-1252..UTF-8 <infile >outfile
The same command also works for the ISO-8859 1 (aka ISO Latin 1) encoding, which is a subset of Windows-1252.

XHTML™ well-formedness and validity

W2ML imposes no validity constraint on documents. But of course only well-formed XML documents can be parsed. Using HTML Tidy, the following command converts an HTML-like UTF-8 document into a well-formed XHTML™ document: tidy -n -utf8 -asxhtml infile >outfile
In addition, the following configuration may be used in tidy.conf to avoid the addition of a Tidy meta element and to automatically add a DOCTYPE declaration:

tidy-mark: no
doctype: auto

Entity references

Note that an XML document containing undefined entity references with no external DTD is not well-formed. To avoid (un)defined entity references, the -n argument of Tidy converts HTML character entity references into numeric character references (&eacute; to &#233;). Note that all HTML character entity references can be replaced by Unicode characters, except predefined XML entities: &lt;, &gt;, &amp;, &apos;, and &quot;. But these predefined entities can be used in any XML document.

System identifier in document type declaration

In XML, contrary to SGML, a public identifier may not appear without a system identifier in the document type declaration (see doctypedecl and ExternalID productions in the XML 1.0 recommandation). It means that the following declarations often used for HTML 2.0 and HTML 3.2 are not well-formed:
<!DOCTYPE html PUBLIC "-//IETF//DTD HTML 2.0//EN">
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
But the HTML 4.01 declaration is well-formed:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
("-//W3C//DTD HTML 4.01//EN" is the public identifier, and "http://www.w3.org/TR/html4/strict.dtd" is the system identifier).

Site conversion to W2ML

This scenario demonstates some Unix scripts and commands that can be useful to convert a static HTML web site to W2ML.

In this scenario, we transform all (ASCII, ISO-8859-1 or Windows-1252) HTML documents into W2ML documents. Note that we keep the same file name extension, so links are not broken. Of course, the HTTP server must be configured to handle .html files with the W2ML handler.

The simplest way is to run the html2w2ml script on each file that has an HTML file name extension:
find example.com -type f \( -iname \*.htm -or -iname \*.html \) -exec html2w2ml '{}' \;
Here is the html2xhtml script: It converts US-ASCII, ISO-8859 1 and Windows-1252 encoded files to UTF-8; then it converts in-place the HTML file to an XHTML™ file:
#!/bin/bash
recode Windows1252..UTF-8 "$*"
tidy -i -wrap 120 -m -q -n -utf8 -asxhtml "$*"
Or more silent:
#!/bin/bash
recode Windows1252..UTF-8 "$*"
tidy -i -wrap 120 -m -q -n -utf8 -asxhtml "$*" 2&>1|grep -v lacks\|proprietary
Or silent:
#!/bin/bash
recode Windows1252..UTF-8 "$*"
tidy -i -wrap 120 -m -q -n -utf8 -asxhtml -f /dev/null "$*"
If the file name extension is not enough to find HTML files, then the tryhtml2w2ml script can recognize them; just call it for all files:
find site_dir -type f -exec tryhtml2w2ml '{}' \;
Here is the tryhtml2w2ml script: if the file appears to be HTML, it calls the ./html2w2ml script:
#!/bin/bash
if file "$*"|grep HTML; then
	./html2w2ml $*
fi
exit 0

Usefull conversion utilities are GNU Wget, GNU recode, HTML Tidy, and xsltproc from the the XSLT C library for GNOME. An XSLT processor can be used to systematically add W2ML markup with a command like xsltproc --novalid --nonet sheet.xslt example.html (the sheet.xslt will depend on the structure of your XHTML documents).