Is there a DTD for HTML?
Is there a DTD for HTML? If so, can an XML parser parse HTML documents with it? I'm looking for a simple way to parse HTML documents the same as I can parse XML documents with JAXP.

    Requires Free Membership to View

    When you register, you'll begin receiving targeted emails from my team of award-winning writers. Our goal is to keep you informed on recent service-oriented architecture (SOA) and SOA-related topics such as integration, governance, Web services, Cloud and more.

    Hannah Smalltree, Editorial Director

    By submitting your registration information to SearchSOA.com you agree to receive email communications from TechTarget and TechTarget partners. We encourage you to read our Privacy Policy which contains important disclosures about how we collect and use your registration and other information. If you reside outside of the United States, by submitting this registration information you consent to having your personal data transferred to and processed in the United States. Your use of SearchSOA.com is governed by our Terms of Use. You may contact us at webmaster@TechTarget.com.

HTML is an SGML application and although it has a DTD (http://www.w3.org/TR/html401/), SGML DTDs are different from XML DTDs.

An SGML system can work with all XML DTDs (in theory at least) but XML systems cannot work with all SGML DTDs.

The big problem is that HTML parsers need to infer the structure of the document when tags (such as </p>) are missing. This is hard and not all tools give you the same result.

There are two tools worth looking at for getting your HTML into shape to be parseable as XML: Dave Ragget's HTML Tidy and James Clark's SGML to XML conversion.

This was first published in April 2002