Is there a DTD for HTML? If so, can an XML parser parse HTML documents with
it? I'm looking for a simple way to parse HTML documents the same as I can
parse XML documents with JAXP.
HTML is an SGML application and although it has a DTD (http://www.w3.org/TR/html401/)
DTDs are different from XML DTDs.
An SGML system can work with all XML DTDs (in theory at least) but XML systems
cannot work with all SGML DTDs.
The big problem is that HTML parsers need to infer the structure of the
document when tags (such as </p>) are missing. This is hard and not all tools give
you the same result.
There are two tools worth looking at for getting your HTML into shape to be parseable as XML: Dave Ragget's HTML Tidy and James Clark's SGML to XML conversion.
This was first published in April 2002