Text-based data transfer formats:
First, let's take a look at text-based data transfer formats.
XML
Requires Free Membership to View
PROS:
-
Readable and editable by developers
Error checking by means of Schema and DTDs
Can represent complex hierarchies of data
Unicode gives flexibility for international operation
Plenty of tools in all computer languages for both creation and parsing
-
Bulky text with low payload/formatting ratio (but can be compressed)
Both creation and client side parsing are CPU intensive
Some common word processing characters are illegal (MS Word "smart" punctuation, for example)
Images and other binary data require extra encoding
The development and standardization of JavaScript has made the Web browser a powerful tool for dynamic presentation of data by manipulating the appearance and content of HTML elements. In recent years, it has become possible to assemble the JavaScript component of a webpage from multiple sources which can be updated repeatedly with data objects, thus JavaScript Object Notation or JSON.
JavaScript recognizes the usual set of variable types, strings, numbers, arrays and simple objects. The data structures that JSON excels at representing are collections of name/value pairs and ordered lists of data values. Since JavaScript is transmitted as plain text, JSON can be read by other languages so the uses extend far beyond the Web browser. Thus, JSON is strong competition for data transmission in many areas. Recognizing this, RESTful Web service frameworks, such as Jersey and Restlet put a lot of effort into supporting JSON.
PROS:
-
Readable and editable by developers
Plenty of JavaScript developers
Highly developed browser toolkits such as Dojo and jQuery
-
Bulky text with low payload/formatting ratio, but not as bad as XML
Client CPU time required to parse
Not as flexible as XML for some data structures and binary data
|
||||
It is rather easy to represent some sorts of data as lines of plain text in which one line corresponds to a single data item. Spreadsheet rows can be expressed this way using "comma separated values" or CSV. Another common approach is a list of "properties" where each line contains a name/value pair.
PROS:
-
Readable and editable by developers
Fairly compact representation for simple types
-
Possible confusion introduced by punctuation in values
Limited to very simple structures
So much for formats based on text, next let's look at some binary formats.
CORBA
The Common Object Request Broker Architecture or CORBA was the first serious effort to provide for communication of complex data objects between completely different systems. Much of CORBA is concerned with aspects of communication that we are not talking about here. The CORBA standard, now at version 3.1 (2008), is maintained by the Object Management Group.
PROS:
-
Language and operating system independence
Compact data representation
Built in mapping in Java covers almost all features
Open-source versions are available
-
The complete standard is quite complex
Interfacing to non-object-oriented languages not easy
Incomplete implementation on many systems
Back in the days when dinosaurs roamed the earth and 300 baud modems were as good as you could get for a remote system, programmers put a lot of ingenuity into packing maximum information into the minimum number of bits. If we only needed integers between 0 and 63, we only used 5 bits which could share a byte with 3 true or false bits. I suspect that only programmers of deep space probes do much packed binary by hand these days.
PROS:
-
Very compact, approaching theoretical maximum.
-
Computation intensive
Fragile, dropped or damaged data bits are hard to detect and correct
Modern programmers not familiar with the idea
Google has made the idea of packed binary more practical for real applications with the introduction of Protocol Buffers. This toolkit evolved as a replacement for hand coded packed binary for exchanging requests and responses with Google index servers. The tools were released to open source distribution just 2 years ago. The intent of this API is to provide a language-neutral, platform-neutral, extensible way of serializing structured data for use in communications protocols, data storage, and more.
This API produces an encoding of typical data values almost as compact as a hand optimized packed binary. The documentation serves as an introduction to the concept and is suitable for modern programmers. Programmers must use a "proto" syntax to specify the data types to be transmitted and the toolkit takes care of generating the support code for packing and unpacking.
PROS:
-
Very compact representation, approaching theoretical maximum
Tools for many languages
Not sensitive to version changes
Open source license
-
Not readable or editable by developers
Yet another data definition syntax (proto) to learn
This was first published in August 2010

Join the conversationComment
Share
Comments
Results
Contribute to the conversation