ISACA Now Blog

Knowledge & Insights > ISACA Now > Posts > Big data defined

Big data defined

| Posted at 1:54 PM by ISACA News | Category: Security | Permalink | Email this Post | Comments (2)

Mario BojilovThere are a number of definitions of big data presently being used. The origins of the term come from a 2001 paper by Doug Laney of Meta Group. In the paper, Laney defines big data as data sets where the three Vs—volume, velocity and variety—present specific challenges in managing these data sets.

Velocity refers to the speed with which data is created. And, this speed has been increasing dramatically. Looking at the infographic below, we can see some staggering examples of data velocity: each minute, 48 hours of video are uploaded to YouTube, Twitter users send 100,000 tweets and Instagram users share 3,600 photos.
 


Figure 1
Figure 1 (Source: Domo.com)

Velocity is also quickly becoming the key aspect of big data that warrants management. Visitors to LinkedIn, for example, are not prepared to wait more than a few seconds for the “People You May Know” screen to display. For speedy results, LinkedIn needs to process terabytes of data (and do it fast).

Variety, as we can see, is the result of all this activity not limited to certain types of data. We, as connected citizens of the world, now create and consume video, audio and photos in various formats. We tweet. We blog.

At the same time, various organisations are collecting and storing more and more of the data produced by their corporate systems in order to get better insights into their businesses and to enable easier interaction with partners and customers. And, a number of “intelligent” devices, such as water- or electricity-meters, generate types of data specific to their industry, application or design.

Variety of data requires new ways of storing and accessing it. Traditional databases are no longer adequate for a number of these tasks and that’s the reason new tools and frameworks are coming onto the market. Examples include Hadoop, Cassandra and MongoDB.

Volume is the natural consequence of velocity and variety and is somewhat of a moving target. Often we expect to be able to nominate a specific boundary, e.g. 5TB, 1PB*, etc., beyond which we can start talking about big data. But with ever-increasing “creator” activity and, consequently, data volumes, it is difficult to say, “We have accumulated 1.5PB of data, so we now need to start thinking about big data.” If we do this, we will have to change definitions every few months and it will be almost impossible to write business cases and get them approved.

In my view, quite often, big data is in the eye of the beholder. If an organisation is starting to encounter limitations from its current data-processing infrastructure, then it is time to get involved with big data, especially when these challenges cannot be addressed through “brute force”, e.g. buying more storage, RAM, etc.

Here is a personal example—a number of years ago, I was responsible for the implementation of a high-volume transaction-monitoring system for a lottery. As part of the project, we wanted to store the sales information for each product line, product, day and location separately, enabling us to quicken the production of various reports and allow for drill-downs along the above parameters.

Initially, I decided to put everything into a four-dimensional array and use it to produce necessary reports. One small problem—the compiler could not process the source code. We simply hit some hardware limitations and using “brute force” by just throwing more RAM wouldn’t suffice. This was a very similar situation to today’s big data challenges, although the volumes were not as high as today.

What is big data? Simply put, it is data sets that—due to their size (volume), the speed they are created with (velocity), and the type of information they contain (variety)—are pushing the existing infrastructure to its limits.

In subsequent ISACA Now Blog posts, I will address the rise of big data and explore how it is changing our lives in some unusual ways.

Mario Bojilov
Meta Business Systems founder
President, Board of ISACA-Brisbane

*1PB = 1,024TB = 1,048,576GB

Note: For more information on big data, download ISACA’s free white paper here or visit the Big Data topic in ISACA’s Knowledge Center.

Comments

The 4th V can be a problem

Veracity.  How can we rely on the data produced by non-trusted systems?  GIGO at a bigger, faster level.  Be it relational or associative, data has be clean and accurate.
Patrick Lynch at 5/18/2013 10:09 PM

Re: The 4th V can be a problem

Clean and accurate data seems to have been a problem for most systems, trusted and non-trusted, relational and non-relational. We still have trusted, relational purchasing systems where duplicate payments are between 0.1% and 0.5% of the total, according to the IIA.

Not sure if this issue ever will be solved completely. In a sense, I think, it's better to be acknowledge it and deal with it by assigning a "reliability score" to our data/datasources. This will allow for better and much more reliable decisions.  For example, data coming from enterprise systems and machine-generated sources can be assigned higher score than human-generated data from photos, videos and social media.
MarioGB at 5/24/2013 1:58 AM
You must be logged in and a member to post a comment to this blog.
Email