Big Data As Defined By Constraints: "Camel Meets Eye Of Needle", Or, "Surviving Your First Week On The Job"

Needle-And-ThreadCamelIf you are an executive charged with a "big data project", here's a prediction. On your first week on the job you'll likely be surprised. And the surprise concerns what you'll actually spend your time doing.

What might be the surprise on your first week on the job? The surprise is that your biggest responsibility won't concern what you thought you signed up for, which is likely something like "generating analytical insights". Rather, your biggest responsibility is almost the opposite, which is focusing on everything but generating analytical insights.

And if you've been dreaming of fishing in sea of data, with a big net to scoop up any of the myriad insights just swimming by your boat, your surprise might even be one of disappointment. Because this blue ocean metaphor, lovely to imagine, is also seriously misleading about the nature of the world of big data. It's more likely that your ocean will consist of nothing but fish, all of which have three eyes and which are inedible! You'll have lots of data but nothing to take home that you can call a great insight.

Here's the problem, which could define the basis of your surprise, your job responsibility and the sort of metaphor that you should use to describe the world of big data. The problem with big data concerns the very definition of big data, which is that it is "big" -- but although big is subjective, there's a very practical and non-trivial definition of "big". Your set of big data is defined as such if it is too big for any of your machines to process in one chunk in a reasonable amount of time. In other words, the world of big data is defined by the system capacity constraints.

Now you can see that there are really two related meanings of big data:

1. Big-Data-As-Foundation-Of-Actionable-Insight is a definition concerning what big data can do for you. This definition is the "benefit" which confers an "advantage" on the organization that masters big data. It's the "sales pitch" for investing in big data.

2. Big-Data-As-Problem-Bigger-Than-Affordable-Technology-Platforms is a definition concerning how you work system capacity constraints. And this definition is the "feature" which is fundamental to achieving Big-Data-As-Foundation-Of-Actionable-Insight. Big Data wouldn't be a challenge if you had either an unlimited budget for systems or no problem waiting for a week for an analysis to complete. But in the real world you have system capacity constraints, budget constraints and time constraints, all of which are interrelated.

The second definition is sometimes the source of surprise to those business executive on their first week on the job. Because while imagining the kudos attached to delivering the No. 1 Big Data definition, it dawns on the executive that much of their time is spent on the No. 2 Big Data definition.

What does it mean, on a practical day-to-day basis, that your big data data set is bigger than what can be processed by your affordable technology platform? In other words, it means that your big data is useless. This technology limitation means that, without corrective action, you can't do anything with your data. And recall from our definition above, this is the very definition of big data.  And with M2M ("machine-to-machine") and IoT ("Internet of Things) data coming on stream, and other sources of new data, the problem will only get worse.

This is where your new responsibilities take over! It's your job to solve meet this challenge. And fortunately, there are several easily understood strategies to deal with data sets that are too big for your affordable technology platform.

Four Ways Of Overcoming Big Data Constraints

Here are four ways of overcoming big data constraints:

1. CHUNK -- If you can't do the job all at once, you break it up into chunks, but still process an entire data set.

2. TRIM -- Or you can trim out some of your data, i.e. just ignore or discard it, based on some selection. Trimming is distinct from chunking in that whereas chunking breaks up a data set into smaller data sets, by some dimension, trimming discards some data. (From a table perspective, you can think of chunking as segmenting or selecting by row, whereas trimming is selecting by column. If one wants to be really strict, from a set theory point of view, the two strategies are the same. Nevertheless, from a management perspective it is still helpful to think of them as distinct.) Chunking and trimming strategies can be used together.

3. SAMPLE -- Or you can employ statistical sampling strategies which can generate insights but from a tiny fraction of your overall universe.

4. MORE CAPACITY -- You can also add more capacity.

These seemingly simple strategies become much more interesting, or challenging, when you start to look at what they mean in practice. Whether you employ one or all the above strategies, they will come at a cost. And the costs that you incur to employ a strategy include management time costs for designing the strategy and costs associated with a loss of information.

Employing these strategies is not something that can be just handed off to the IT department either, because there are business issues and business semantic implications here which are properly the purview of management.

Simple Data Management Constraint Example

Here's a simple but useful personal example from your host's experience.

He once had to write a data set sorting program for a data management class (a common assignment, known as a "polyphase merge sort").

The assignment was interesting because the data was stored on tapes, multiple tapes, which could be mounted on multiple available tape drives. As you can see this school assignment could be our canonical example of capacity constraint. The data set was larger than the capacity of one component of the available system.

Individual data set segments on each tape were merge sorted with data sets from another tape. In this way, the volume of sorted data would grow until it encompassed the entire set. A successful completion of the assignment was based on generating the correct tape mounting schedule which would result in a fully sorted data set, although distributed now across all tapes.

Note that there were two "algorithms" used during this assignment, the trivial in-place "bubble sort" algorithm, and the "polyphase merge sort" algorithm, the best version of which is apparently based on a Fibonacci series.  (You can read about this algorithm here: http://en.wikipedia.org/wiki/Polyphase_merge_sort.)

  Here are some relevant things to note about this assignment:

1. COMPARE TO BIG DATA ANALYSIS -- A simple "bubble sort" utility was provided for the assignment. A simple sort is comparable to performing any basic operation on your data set. It's the capability required to support No. 1 Big Data Definition.

2. COMPARE TO BIG DATA CONSTRAINT -- The full data set was larger than a given constraint of the in-place system, in this case both individual tape capacity and number of tape drives. (By the way, the assignment specified "six" tape drives. On the basis that this was an unrealistic assumption, i.e. one might be out-of-service, or alternatively the CIO could buy more tape drives, your host delivered a solution for "n" tape drives.) This system-derived constraint is the key to the definition of big data; if you have big data, you have a constraint.

3. WHERE THE EFFORT WENT -- What was desired was a "sort". But all the effort went into working around the constraint that one tape could not hold all the data, and there were a limited number of tape drives. This effort is the effort identified at the beginning of this post, and which may be a source of surprise.

4. A COST CONSTRAINT -- The constraint is technical, but the technical constrain of "n" tape drives is really a cost constraint.

5. A CONSTRAINT MODEL -- Although the example here concerns tape drives, the principle is about constraints and the fact that by definition big data is bigger than your networking, storage, CPU and memory capacity in some way.

The big data effort highlighted in Note No. 3 above ("Where The Effort Went") is rightly a focus of management attention. While this example here had a precise, simple and highly technical solution ("coding a polyphase merge sort using a Fibonacci series-based algorithm"), in the real world of big data analytics, the solutions are not so simple. And the solutions that are chosen often have management policy implications.

Management Policy Implications Of Big Data Constraint Choices

What are some of the management policy implications for building around the constraints inherent in big data? We can return to three strategies for dealing with big data, listed above:

1) CHUNK -- The merge sort story above is an example of "chunking". Management decisions are required to balance processing time and timeliness of insight. You could conduct an analysis "by territory", for example delivering actionable results by state, which may cut down analysis time by a factor of 50, or more if there are exponential analysis time issues. Chunking might be done across a dimension of time, for instance a rolling month. There are implications however in that any behaviour which spans more than a month will be lost. You can see that "chunking" now becomes both a big data management strategy and an analytic category, highlighting the dependency of one on the other.

2) TRIM -- In your data analysis you'll want to "go hunting" for new business opportunities which may be revealed by your data, and which may be behavioural correlations or causations. And to go hunting, you'll need data points in one or many dimensions. But the more data points, the more costly the analysis, in time and other computing resources. So you may need to drop some of your data so that your data set is manageable. What data you drop may change from analysis to analysis. But notice the management implications. Again you are making management choices on the meaning of the data, in support of the data being manageable. And these choices may not only be for the analysis we are going to run this afternoon, but possibly for the specification of a data feed which may be in place for much longer than a day. And the choices you make automatically define the data you work with -- and reduce the information available and possible conclusions.

3) SAMPLE -- The first two strategies above both concern all members of a set which meet certain criteria, a universe or "census" if you will. However, taking statistical or random samples of a data set can be an excellent way to cut down an enormous set of data to very manageable size. Interestingly, in the world of market research a well-conducted random sample can often contain better information than a "census", which will contain all kinds of bias and data quality issues. Implementing sampling can be tricky however, and may require a well-trained data analyst to get it right. Today's analytic tools are incredibly powerful and the resulting charts and visualizations can be very compelling -- but note that charts and visualizations can be compelling whether they are correct or not! Sampling is especially useful if you are going to do some upgrading, for example associated with data quality, on the data. In summary, sampling is a key strategy for managing the challenge of big data, but again there are aspects of the sample strategy that require management input and involvement.

4) MORE CAPACITY -- Given that capacity typically grows "linearly", whereas data acquisition, storage, management, analysis and presentation may be susceptable to non-linear (for example "exponential") growth, it's quite likely that more capacity won't be effective, aside from any cost issues. Again we have an indication for management input and involvement.

These policy-driven requirements for management involvement only concern the question of big data constraints. There are closely related issues that also require management attention. For example, the issues of data quality, data governance, as well as data provenance, liability, privacy and more are all fundamental corporate and management policy issues. For these reasons, any organization embarking on a big data program, and that should likely be all organizations that wish to survive, will require senior executives to become involved in understanding and working with big data.

Your New Job Description Include Data Analysis

As your host has written elsewhere concerning business process management, it is incumbent on managers to "step up" and take responsibility for the operational realities of their organization. Taking a "hands-off" approach, a species of "magical thinking" wherein big data is a "black box", is not a recipe for success. Much has been written about the "need for data scientists" or the "shortage of qualified data analysts". While there is certainly a role for specialists in the world of big data, the suggestion should not be an excuse for managers to ignore the opportunity to become directly involved. Would anyone every suggest that business or governments "need more management scientists"? The idea is a non-starter. Managers are expected to know management. In the same way, managers will be expected to know data analysis, including the special requirements for managing big data under conditions of constraint.

With your foundation of good data flowing and in place, and with a good understanding of the options and tradeoffs you have for defining management data, you'll be ready for analysis. And the whole exercise of understanding data constraints will be the ideal start for all your analysis. As you build reports and uncover new insights and defined fresh KPIs, you'll have the confidence that your analysis is well founded.

 In Biblical terms, it is said that it is difficult for a camel to go through the eye of a needle. Now whether the "eye of a needle" is literally that, or a mistranslation of a very narrow gate, the idea is the same, which is a situation where there is a constraint. Truly it's more difficult for a camel to go through the eye of a needle than to enjoy the heaven of big data without any consideration of system constraints!

The world of big data has lots of opportunities, even a need, for metaphor. You'll find the metaphors that help you communicate the challenges and joys of working with big data.

quot;chunking