Problems converting HTML to other languages
My company wants to convert their html files to xml for two reasons: (1) to enable us to present our documents in any format possible and (2) to allow us to extract parts of our documents to create new custom configurations. My problem is that I can convert our HTML to XHTML and then our own XML for which I have a DTD (and schema) but I am basically "wrapping" the XHTML in the elements I've defined. Converting back to HTML is not a problem as I can use <xsl:copy-of select="."/> and get the old HTML back. However, converting to WML or anything else is not possible right now since the text is full of old html tags: <ol>, <ul>, <li> and even <table>, <tr> and <td> that will not work for WML and probably not for other formats either. Any suggestions?
The crux of the problem is the wrapping of the html elements in the new DTD. Instead of wrapping existing elements you need to add a subset of XHTML's tagset into your own DTD. The subset should be aimed at simple presentation elements that you know you can map to, or directly use, in other formats e.g. p, b, perhaps ul etc. Note that table markup is particularly troublesome as it is often used in HTML to achieve layout effects that are difficult/impossible to re-use in other formats.
This was first published in May 2003