If you ask Gartner, big data is best understood by the 3 V’s: volume, velocity and variety. And the Mike 3.0 open source project seemingly tries to confuse everyone by stating that: “Big data can be very small and not all large datasets are big.” In my most recent glance at Wikipedia, Big Data was simply defined as any dataset that is “beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time.” Like so many buzzwords introduced into the technology dialogue, Big Data is a marketing term. If you have any doubt just look at what Oracle is saying. According to Oracle, Big Data is just unstructured metadata lying around your business that you can analyze, organize and utilize to create business benefit.
Selling Big Data to business is going to be a marketing trend that continues for awhile. When reduced to the common commercial messages, the reality is that most Big Data solutions are really solutions for dealing with large amounts of unstructured data.
I propose that a more useful construct than the 3 V’s exists to help define the characteristics of Big Data: structured vs. unstructured, defined relationships vs. inferred relationships, static vs. dynamic, and stable vs. growing. In each case, the second term in the pair is more characteristic of Big Data, but that is not to say that Big Data does not encompass some data at both ends of the spectrum.
If you look at a dataset such as the combined real estate multiple listing databases in the United States, for example, you have a truly complex set of Big Data. At any given time, there are about 2 million real estate listings in the US. Something like 85% of these listings are being updated every fifteen minutes (that’s 6.8 million updates per hour). 85% of all those listings are served up by an application running on the Magic xpa Application Platform. That’s something like 5.5 million updates per hour or more than 1500 updates per second. Why so many updates? Because of the complex interrelationships of data in a real estate multiple listing. It’s not that the seller is changing the price every fifteen minutes or the agent is changing the advertising description of the property. But the record is related to all sorts of interrelated data and metadata as well as community data and averages that are constantly changing.
In-memory computing and data grid computing provide an interesting means for dealing with Big Data. The paradigm shift of using memory as storage allows you to access data randomly with near zero latency as opposed to sequential disk access methods which require sequential access for optimal reduction in latency. As more and more enterprise and cloud architectures incorporate in-memory computing the currently accepted definition of Big Data then becomes problematic. If Big Data only includes datasets that are “beyond the ability of commonly used software tools” then the bar for the Big Data definition is raised by the eventual proliferation of in-memory approaches. I like the fact that the Big Data buzzword focuses attention on the problems associated with the proliferation and increasing complexity of data. I am not convinced, however, that the commonly accepted definitions are useful. Big Data is any complex dataset that includes large amounts of dynamic, growing, unstructured data where relationships between data are frequently inferred rather than declared.