As organizations look around for data that should be made available by Web services, publishing or data mining,
inevitably they will find data in odd formats. Writing new specialized programs for converting each oddball format into something usable can be a real time consumer. However there are some existing toolkits oriented around the XML pipeline processing concept that may save you a lot of time. I did some digging and located some interesting goodies.
The ServingXML Toolkit
The open source ServingXML toolkit, hosted at SourceForge, supports a wide variety of content transformations. Currently at version 0.5.4, the download from SourceForge includes an extensive set of libraries for parsers and other tools such as the Saxon XSLT processor and the Apache FOP toolkit so that you do not have to chase down other libraries in order to get started.
The architecture is designed to make it easy to add your own custom components. Furthermore, data sources and output formats are not limited to simple files. Here I summarize a few of the many examples provided with the toolkit, just to give you an idea of the flexibility and power available:
- Converting a single CSV (comma separated values) flat file to XML.
- Combining multiple text files where values start in specific columns to XML.
- Converting an EDI (Electronic Data Interchange) file to XML with a complex hierarchy.
- Converting a Java properties file with multiple line properties and comments to XML.
- Performing an SQL query and writing the results as XML.
- Converting XML files to CSV and other styles of flat file.
- Output of any of the above examples to a remote FTP server.
Like the other toolkits discussed below, ServingXML uses XML documents to set up the pipeline of processes that the various components then execute. This approach opens up the possibility that a Web service could create a custom pipeline configuration and output with no programmer intervention.
SmallX XML Infoset & Pipelining Toolkit
SmallX is a toolkit emphasizing complete handling of the XML Infoset. The current release 1.0 is more compact than the ServingXML download. Some important aspect of this toolkit are the ability to mix XSLT and XPath operations with other pipeline component types, emphasis on integration with J2EE and integration with the Netbeans IDE.
To give you an idea of what pipeline programming with SmallX looks like, here are the pipeline steps in the SmallX sample project to extract BART train schedule data from a query to the BART site that generates an HTML page and generate a simpler format. These steps are of course defined in an XML script.
- Read an XML source that defines the stations and times to look up.
- Using XSLT, generate a URL that expresses a schedule query.
- Selectively delete HTML formatting elements from the response stream.
- Extract the contents of the HTML schedule table.
- Apply XSLT to create the desired output format.
The Cocoon Project
The Cocoon open source project is one of the pioneers in the use of the pipeline concept to separate the various concerns of content and presentation. It has inspired some specialized spin-off projects such as the "Lenya" content management system. Although "Web services" were not around when Cocoon started, the mechanisms it provides are very flexible and can easily be adapted to act as client or server.
A Cocoon pipeline is defined in a single XML configuration file called a sitemap. In a Cocoon-based Web service, all request handling is controlled by the sitemap. Available pipeline components are divided into the following classes:
- Generators: create SAX events to feed subsequent components.
- Serializers: the terminal component of a pipeline, turn SAX events to formatted output.
- Matchers: perform logical operations to select parts of the pipeline.
- Selectors: perform more complex logical operations than Matchers can.
- Transformers: receive SAX events and conduct XSLT or other transformations creating output SAX events.
- Actions: provide a mechanism for custom programming to manipulate input request data.
- Readers: handle Web requests that do not require XML processing.
XML Pipeline Definition Language at the W3C
Authors at Sun Mincrosystems and nine other organizations active in the use of XML submitted a proposed definition of an XML Pipeline language to the W3C in 2002. This proposal consists mainly of a vocabulary and schema rather than a complete API. So far, the only commercial implementation I have been able to find has been released by Oracle.
Oracle implementation of XML pipeline
The Oracle XML Developer's Toolkit 10g includes an XML pipeline processor Java implementation that uses the proposed definition language with some slight differences. The supplied processor classes wrap various utility classes in the developer's toolkit so that they can be controlled by a pipeline document.
XML processes that can be controlled by the Oracle pipeline include compression and expansion of XML data streams, application of XPath selection patterns, validation versus a schema and application of XSLT stylesheets.
For turning odd data sources into XML, the ServingXml toolkit appears to be the most highly developed and supported. If your data is already in an XML format, the SmallX toolkit is simpler and has the advantage of integrating with the NetBeans IDE. Cocoon offers a very powerful API, but may be harder to get started with. If you are already using the Oracle Java toolkits, the pipeline processor tools offer a convenient way to get started.