An architect's guide: How to use big data
A comprehensive collection of articles, videos and more, hand-picked by our editors
One of the unfortunate things about the concept of big data is that it can envelop a number of database applications that are similar in that they involve large quantities of information, but are very different in terms of how information is changed and used. One particular problem is how to provide an audit trail in large databases, a problem usually called "data persistence."
Data persistence issues can be particularly problematic because data persistence is often directly related to application functionality. One can argue data persistence creates not only a data model, but software architecture.
Adding data structure
A fully persistent data structure exists in current, all prior and updated states. This level of persistence is not easily created efficiently in modern software architectures. In fact, many users find full persistence isn't necessary because their use of data persistence is to support an audit trail. Where that's the goal, it's not only unnecessary to update past states, but undesirable to allow it. However, any level of data persistence has to be considered as an added complexity in big data architecture.
Where big data is unstructured, the challenge is linking states of information represented by what might appear as separate data elements. An email accepting, then rejecting a proposition is an example. Since little structure means a lack of a systematic description on the state of information, most people agree unstructured big data doesn't have a persistence problem.
Adding structure means data can be maintained, rather than simply recorded as a succession of states.
Adding structure means data can be maintained, rather than simply recorded as a succession of states. Auditability can be provided by recording transactions and data persistence may not be an issue. However, where real-time systems can be acted on by multiple agents driving concurrent changes, particularly when all the real-time actions don't actually change state, a data persistence strategy may be indicated.
Implementing linked lists
Most current data persistence approaches build on the notion of a linked list, where successive values of data elements can be connected vertically apart from the normal presumptive linkage of data elements by key fields and record structures. Thus, five different values for an inventory level might be linked to the current value. Each value shows how it was derived, at least to the point of identifying the transaction or source.
Implementing a linked list can be done using pure linked-list architecture. It's also possible to use a polyglot approach that combines SQL and non-SQL structures to record information in a form easily queried (SQL) but to then link past-state information in some way (NoSQL). IBM, Microsoft and Oracle all support forms of this approach in some applications.
Data persistence can be applied as a layer to database technology, but the strategy is most likely to be used to integrate database persistence and process state persistence. So-called back-end state control systems can be used to provide state information for load balancing and failover, particularly in cloud applications. These persistence layers can also record enough information on database state to provide for auditing and state recovery at the database level. This approach can be applied to big data; popular big data architectures for the cloud like Hadoop don't mandate persistence, but can be made to support a persistence layer.
Persistence layers at the database level can be created using what was once a popular database architecture in its own right -- the semantic database model. The semantic model records data relationships as well as horizontal relationships more traditionally defined by key-value associations (RDBMS).
Using data persistence layers
The state of the art in persistence is arguably the use of persistence layers or design patterns to manage the state not only of databases, but also of processes. This is because it's difficult to build applications that operate on persistent-state data without making them aware of persistence issues and vulnerable to problems of design and compliance that arise out of separating maintenance of past states from the creation of new states.
What does an inventory application do with yesterday's inventory levels? Nothing valid, in most cases. However, creating applications that are persistence-aware at the process and database level will generally mean a complete redesign, and users report both a lack of suitable tools and inconsistencies in best practices.
More on data persistence
Approaches for selecting data-persistent technology
Continuing the debate: Beyond JPA and data persistence
A recent innovation in persistence is the use of a flat data structure that defines databases by a semantic overlay layer. Since there aren't specific data structures, a database can be visualized in either a current-state form or as a collection of prior and current states. With these more flexible approaches, users can define information and process structures based on common semantics and ensure applications and data models are synchronized in state and auditable in a compliance sense.
The key to making this new approach work is a highly elastic and extensible architecture. These new data persistence strategies should be evaluated and pilot-tested at scale to ensure they'll deliver the performance and reliability demanded.
What the best persistence approach will be for big data and the cloud is impossible to predict, but it seems certain it will be a dualistic process/data persistence approach. Big data and cloud application designers should review the state of the art in this area before committing to a design. Otherwise, the persistence strategy may be obsolete before the project is over.
About the author:
Tom Nolle is president of CIMI Corporation, a strategic consulting firm specializing in telecommunications and data communications since 1982.