XML things that bite you

In this tip, William Brogden covers some XML quirks that frequently mystify Java programmers new to working with XML.

This Content Component encountered an error

SAXParseException Mysteries

Many SAXParseException reports can be interpreted fairly easily, but what are we to make of this report which you may see when parsing an XML document created or edited using a text editor?

The processing instruction target matching "[xX][mM][lL]" is not allowed, especially mystifying when you know you have not tried to create a processing instruction. It turns out that this is what you get if the "<" character of the starting XML declaration --

<?xml version="1.0" encoding="UTF-8"?>
-- is not the first character in the file. It is particularly mystifying because the file looks perfectly ok when viewed or printed.

The Case of the Vanishing Node

For purposes of illustrating this problem, assume you have an XML document used to keep a list of users who sign up online, with this as a typical entry.

< user unum="101">
 < firstname>Bill
 < lastname>Brogden
< /user>

When parsed into memory as an org.w3c.dom.Document, the firstname Element has a child node that is a TEXT_NODE type. Programmers may be tempted to use Java code like the following to get the firstname String:

 // where fnE is the firstname org.w3c.dom.Element reference
 String name = fnE.getFirstChild().getNodeValue();

With similar code to set a new value for firstname:

  fnE.getFirstChild().setNodeValue( newfirstname );

This code will compile and appear to work correctly until the fateful day when for some reason an empty string is used to set the value of the text node. While the Document is still in memory, the above code will continue to work. However, when the Document is serialized to a file with a Transformer, instead of the expected text:

  < firstname>< /firstname>

What you actually get is:

  < firstname/>

The Transformer recognizes that the firstname element is empty and considers this the preferred form. Now when the revised document is parsed into memory, that firstname element does not have a child Node and the statement --

  String f = fnE.getFirstChild().getNodeValue();

-- causes a NullPointerException, giving the programmer a nasty shock.

The solution is of course to code defensively, checking for the presence of the child node and providing a default value if it is not there. For example, use this code to get the first name:

 // where fnE is the firstname org.w3c.dom.Element reference
  Node fnNode = fnE.getFirstChild();
  if( fnNode == null ){  name= "" ;
  } else { name = fnNode.getNodeValue();
  }  

Setting a firstname value when the child node does not exist requires more complex code because we have to create the Node first.

  Node fnNode = fnE.getFirstChild();
  if( fnNode == null ){
     Document doc = fnE.getOwnerDocument();
     fnNode = doc.createTextNode( newfirstname) ;
     fnE.appendChild( fnNode );
  } else {
     fnNode.setNodeValue( newfirstname );
  } 

Why is my Document null?

Some programmers have been accustomed to writing code like the following in a method that parses XML into a Document object to assure themselves that the Document was created.

  // where builder is an instance of DocumentBuilder
  Document doc = builder.parse( f );
  System.out.println("Document is:" + doc );

With the Java 1.5 XML library this results in output that looks like:

  Document is:[#document: null]

To a new programmer this appears to be saying that the Document has no content. Actually all it is saying is that you do have a Document object. I always find the Javadocs table for the Node interface to be a big help in cases like this. It tells you what to expect from the getNodeValue and getNode name methods for various DOM objects. Here is a link to Sun's online documentation for org.w3c.dom.Node: http://java.sun.com/j2se/1.5.0/docs/api/org/w3c/dom/Node.html.

From this table you can see that the getNodeValue() method always returns null from a Document type node. The toString() method for Document combines the "#document" name plus the value from getNodeValue().

Unicode Errors

A very frustrating type of error you many encounter when dealing with XML documents is the invalid Unicode character error resulting an an exception report that looks like this:

org.xml.sax.SAXParseException: An invalid XML character (Unicode: 0x1a) was found in the element content of the document.

Or perhaps the even more alarming:

java.io.UTFDataFormatException: Invalid byte 1 of 1-byte UTF-8 sequence.

Chasing down the source of invalid characters can be quite a detective job, especially since they may look perfectly normal on casual examination.

The 0x1a character turns out to be a control code used as an end of file mark in certain applications. In one case, this character ended up in a database field and was subsequently inserted in an XML document with unfortunate results.

A common source of an invalid character is a document created with a word processor that use Microsoft's convention for "smart" punctuation. If you see different characters for open and close quotes, you have "smart" punctuation.

Unfortunately, Microsoft selected character codes for "smart" punctuation that lie in the range 0x82 through 0x95, which Unicode reserves for control codes and are not legal in XML. Thus when Microsoft documents are used as a source for cut and paste operations with XML documents there is a danger of introducing characters that will prevent a document from parsing.

If you have an SAXParseException, you can extract the location of the offending character from the exception with code like the following:

 }catch(SAXParseException spe ){
    String err = spe.toString() +
       "\n  Line number: " + spe.getLineNumber() +
       "\nColumn number: " + spe.getColumnNumber()+
       "\n Public ID: " + spe.getPublicId() +
       "\n System ID: " + spe.getSystemId() ;
    System.out.println( err );
 }

It is also a big help if you have a programmer's editor which can switch between display of normal text and character codes in hex. Personally I am fond of UltraEdit-32 for this sort of detective work.

About the author
Bill Brogden is a computer consultant who enjoys exploring new technologies. He has written study guides for Java certifications and several books on using XML with Java. You can reach Bill at wbrogden@bga.com.


This was first published in February 2006

Dig deeper on XML and XML schema

Pro+

Features

Enjoy the benefits of Pro+ membership, learn more and join.

0 comments

Oldest 

Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to:

-ADS BY GOOGLE

SearchSoftwareQuality

SearchCloudApplications

SearchAWS

TheServerSide

SearchWinDevelopment

Close