Java parsers that follow the JAXP API provide fine grained control over the type of validity checking. In addition to the standard library parsers, there are other parsers such as those created by the Apache Xerces project that implement the JAXP API and can be a direct replacement.
The most basic validity requirement for an XML document is that must be "well-formed." A well-formed document follows the XML syntax requirements such as having a single root element, having all tags properly terminated and using a legal character set. All XML parsers can detect failure to conform to the syntax rules.
When using the Java JAXP standard API, parsers are obtained from "factory" methods rather than by creating an instance of a parser class directly. For example, to get a DOM parser you create an instance of DocumentBuilderFactory. The class used for the creation of the factory instance can be set as a system property although most users will simply accept the default.
With a factory instance in hand, you then call various methods to tell the factory the characteristics of the parser you want it to build. For example, whether the parser is to verify the XML according to a schema. If the factory can't create a parser with those characteristics, an exception will be thrown. These extra levels of indirection are necessary to meet the JAXP goal of an API that is independent of the underlying parser implementation. They provide flexibility at the cost of extra code
Generally speaking, schemas provide a way to specify rules for the content and structure of XML documents. The most basic format for rule specification, as found in XML 1.0, is the DTD or Document Type Definition. A DTD simply defines the allowed names of elements and attributes and the rules for nesting them in a complete document. Although DTDs fit the lax definition of a schema, by convention the term "XML schema" is used to refer to systems more complex than DTDs.
Expanded capabilities now built into Java 1.5 standard library in the javax.xml.validation package provide a schema class that can represent more complex validation rules. You can tell a parser factory to create a schema aware parser by supplying a schema object. The W3C 2001 XML schema recommendation, which was created by two years of effort by XML experts, is the only implementation provided in the standard Java 1.5 library. With the W3C schema definition language you can specify requirements such as numeric values must be inside a given range.
Next I want to discuss what happens when a parser detects an invalid document and contrast the DOM versus SAX models. With DOM processing, after you have configured the parser, it takes over and tries to completely parse the document. The only control you have over treatment of possible validation errors is by supplying the parser with an object implementing the ErrorHandler interface. If you do not supply a custom ErrorHandler, parse errors result in an exception being thrown. No document object will be built and all of the information parsed out of the document up to the error will be lost.
When writing code to handle parsing exceptions, you should not rely on the typical Java PrintStackTrace() method. That may give you a cause, but not tell you where the error occurs in the document. Your code should first try to catch an SAXParseException, it may be able to tell you the location of the XML text causing the problem in terms of the line number and column number. See the JavaDocs in the org.xml.sax package for details.
With SAX processing, your custom code will have to handle events that represent three kinds of parsing errors - warnings, plain errors such as failure to follow a DTD, and fatal errors such as errors in syntax. Your custom event handling methods will have received valid data up to the point of the parse error. Your code may be able to recover usable data from the events already processed and may be able to provide extra information in the error event reporting.
Expanded capablilities now built into Java 1.5 standard library in the javax.xml.validation package provide for more complex validation approaches outside the parser classes. A validator object can work with files, stream sources or in memory document objects. This capability means that you can parse a document into a DOM object with a less stringent parser and then check it against various schema with a validator.
W3C XML schema activities, tutorials and lists of schema building tools
Up to date review of XML schema projects
This was first published in December 2005