Can you provide some background on what is being called ''big data?''
''Big data'' is the latest buzzword to capture the attention of IT workers. Think of it as the
terabytes (or even petabytes) of unique information streaming into our datacenters as more
and more systems and sensors are added to our networks.
It’s not often that a cerebral concept from IT breaks out of that intangible realm of ideas and manifests itself in the real world, but big data has managed to do just that. Lincoln Center in New York City has recently been adorned with 40 seven-foot high media panels and a 123-foot long wall of LEDs. But this isn’t your typical unique art installation: it’s part of IBM’s Think exhibition. Big data is behind the big display.
The 123-foot wall is actually a real-time graphical visualization of data from the systems surrounding the exhibit. These systems mark everything from traffic on Broadway to solar energy and air quality readings.
The exhibit proposes that progress throughout human history follows a common pattern: Seeing (acquiring data), Mapping (organizing data), Understanding (figuring out the relationships and rules governing data), Believing (accepting what the data is telling you), and Acting (using the data to drive towards some specific goal).
Following this path isn’t always easy, especially when the amount of information you are trying to handle is enormous. Big data comes in two major categories, and the first challenge is recognizing which one you are dealing with.
Structured big data, as the name implies, follows a standard schema. Typical examples include stock tick data, user activity (imagine the usage logs maintained by Google or Facebook), or the auto traffic data collected by IBM in their exhibit. The two biggest challenges here are storing the information, and manipulating it for either reporting or data mining purposes.
Unstructured big data has the same challenges as structured data, but compounded by the problem that there is no easy way to pigeon-hole the information into a specific format. For example, by some estimates there may be almost 1 billion blogs worldwide. What secrets about political unrest, consumer sentiment, or pending health crises might lie within? Before you can even think about mining this resource, you’d have to actually harvest the data and store it somewhere where you could manipulate it. First, you’d probably want to convert it all to the same language. And then things really get tough. Techniques such as sentiment analysis or opinion mining might help you extract some subjective information, but it’s difficult to know what information to keep and what to throw away.
Now bring Correlation into the picture, and things get even more complicated. For at least some of the blogs in this example, we might be able to find out where the blog was written thanks to GPS data included when the post was written. That means we can look up related geographical information. Maybe an author is in the middle of a revolution, seeking shelter from a hurricane, or 1000 miles away from where they say they are. The complexity only increases, which makes it hard to imagine ever completing what IBM calls the Mapping stage, let alone progressing to Understanding.
This is the real challenge of big data: pulling meaning out of the noise - trying to piece together a melody when you’re sitting in a room listening to the cacophony of a million different instruments.
As is typical, many vendors are already offering solutions in this space, and many IT departments are treating this problem as something new and unique. But there are other fields that have already faced similar challenges that we can learn from. For example, almost everything we know about prehistoric life on earth is based on the relatively small representational sample of skeletons and fragments that have been discovered. Or consider the field of Physics. In the grand scheme of things, our knowledge and principles have been defined using an extraordinarily small subset of the amount events that happens around us or are purposely staged. Yet this field’s conclusions have held up fairly well over time.
In other words, big data is not the “next big thing”. We’ve faced this challenge before. It’s not even that we’re creating more unique data. The new wrinkle is that we’re able to capture more data than ever. The temptation (and marketing material) would have you believe it’s all valuable, but it’s not. My email in-box “captures” more messages than it did 10 years ago, but most of them are spam. My phone used to “capture” a lot more calls before the Do-not-call list went into effect.
I’m reminded of the time I took a magnifying glass to a Sunday comic strip when I was growing up. I was surprised to find that the pictures were composed on tiny colored dots. If the printer missed a few of them, the overall picture wasn’t noticeably affected at all. Keep that in mind before you become a “data hoarder”, afraid to part with a single bit that crosses your network. I’m not saying there isn’t a place for big data – NASA’s search for extra-solar planets is one prominent example – but not every situation warrants this level of scrutiny.
Details are important, but you can learn a lot of information much more easily by applying proven statistical methods to an aggregated sample. The other lesson we can learn from the Sunday funnies and the IBM exhibit is that the right visualization can help weave small points into a useful picture.
This was first published in October 2011