ISACA Journal
Volume 6, 2,014 


Bridging the Gap Between Access and Security in Big Data 

Ulf T. Mattsson 

Organizations are failing to truly secure their sensitive data in big data environments. Data analysts require access to the data to efficiently perform meaningful analysis and gain a return on investment (ROI), and traditional data security has served to limit that access. The result is skyrocketing data breaches and diminishing privacy, accompanied by huge fines and disintegrating public trust. It is critical to ensure individuals’ privacy and proper security while retaining data usability and enabling organizations to responsibly utilize sensitive information for gain.

(Big) Data Access

The Hadoop platform for big data is used here to illustrate the common security issues and solutions. Hadoop is the dominant big data platform, used by a global community, and it lacks needed data security. The platform provides a massively parallel processing platform1 designed for access to extremely large amounts of data and experimentation to find new insights by analyzing and comparing more information than was previously practical or possible.

Data flow in faster, in greater variety, volume and levels of veracity, and can be processed efficiently by simultaneously accessing data split across up to hundreds or thousands of data nodes in a cluster. Data are also kept for much longer periods of time than would be in databases or relational database management systems (RDBMS), as the storage is more cost-effective and historical context is part of the draw.

A False Sense of Security

Figure 1If the primary goal of Hadoop is data access, data security is traditionally viewed as its antithesis. There has always been a tug of war between the two based on risk, balancing operational performance and privacy, but the issue is magnified exponentially in Hadoop (figure 1).

For example, millions of personal records may be used for analysis and data insights, but the privacy of all of those people can be severely compromised from one data breach. The risk involved is far too high to afford weak security, but obstructing performance or hindering data insights will bring the platform to its knees.

Despite the perception of sensitive data as obstacles to data access, sensitive data in big data platforms still require security according to various regulations and laws,2 much the same as any other data platform. Therefore, data security in Hadoop is most often approached from the perspective of regulatory compliance.

One may assume that this helps to ensure maximum security of data and minimal risk, and, indeed, it does bind organizations to secure their data to some extent. However, as security is viewed as obstructive to data access and, therefore, operational performance, the regulations actually serve as a guide to the least-possible amount of security necessary to comply. Compliance does not guarantee security.

Obviously, organizations do want to protect their data and the privacy of their customers, but access, insights and performance are paramount. To achieve maximum data access and security, the gap between them must be bridged. So how can this balance best be achieved?

Data Security Tools

Hadoop, as of this writing, has no native data security, although many vendors both of Hadoop and data security provide add-on solutions.3 These solutions are typically based on access control and/or authentication, as they provide a baseline level of security with relatively high levels of access.

Access Control and Authentication
The most common implementation of authentication in Hadoop is Kerberos.4 In access control and authentication, sensitive data are displayed in the clear during job functions—in transit and at rest. In addition, neither access control nor authentication provides much protection from privileged users, such as developers or system administrators, who can easily bypass them to abuse the data. For these reasons, many regulations, such as the Payment Card Industry Data Security Standard (PCI DSS)5 and the US Health Insurance Portability and Accountability Act (HIPAA),6 require security beyond them to be compliant.

Coarse-grained Encryption
Starting from a base of access controls and/or authentication, adding coarse-grained volume or disk encryption is the first choice typically for actual data security in Hadoop. This method requires the least amount of difficulty in implementation while still offering regulatory compliance. Data are secure at rest (for archive or disposal), and encryption is typically transparent to authorized users and processes. The result is still relatively high levels of access, but data in transit, in use or in analysis are always in the clear and privileged users can still access sensitive data. This method protects only from physical theft.

Fine-grained Encryption
Adding strong encryption for columns or fields provides further security, protecting data at rest, in transit and from privileged users, but it requires data to be revealed in the clear (decrypted) to perform job functions, including analysis, as encrypted data are unreadable to users and processes.

Format-preserving encryption preserves the ability of users and applications to read the protected data, but is one of the slowest performing encryption processes.

Implementing either of these methods can significantly impact performance, even with the fastest encryption/decryption processes available, such that it negates many of the advantages of the Hadoop platform. As access is paramount, these methods tip the balance too far in the direction of security to be viable.

Some vendors offer a virtual file system above the Hadoop Distributed File System (HDFS), with role-based dynamic data encryption. While this provides some data security in use, it does nothing to protect data in analysis or from privileged users, who can access the operating system (OS) and layers under the virtual layer and get at the data in the clear.

Data Masking
Masking preserves the type and length of structured data, replacing it with an inert, worthless value. Because the masked data look and act like the original, they can be read by users and processes.

Static data masking (SDM) permanently replaces sensitive values with inert data. SDM is often used to perform job functions by preserving enough of the original data or de-identifying the data. It protects data at rest, in use, in transit, in analysis and from privileged users. However, should the cleartext data ever be needed again (i.e., to carry out marketing operations or in health care scenarios), they are irretrievable. Therefore, SDM is utilized in test/development environments in which data that look and act like real data are needed for testing, but sensitive data are not exposed to developers or systems administrators. It is not typically used for data access in a production Hadoop environment. Depending on the masking algorithms used and what data are replaced, SDM data may be subject to data inference and be de-identified when combined with other data sources.

Dynamic data masking (DDM) performs masking “on the fly.” As sensitive data are requested, policy is referenced and masked data are retrieved for the data the user or process is unauthorized to see in the clear, based on the user’s/process’s role. Much like dynamic data encryption and access control, DDM provides no security to data at rest or in transit and little from privileged users. Dynamically masked values can also be problematic to work with in production analytic scenarios, depending on the algorithm/method used.77

Tokenization also replaces cleartext with a random, inert value of the same data type and length, but the process can be reversible. This is accomplished through the use of token tables, rather than a cryptographic algorithm. In vaultless tokenization, small blocks of the original data are replaced with paired random values from the token tables overlapping between blocks. Once the entire value has been tokenized, the process is run through again to remove any pattern in the transformation.

However, because the exit value is still dependent upon the entering value, a one-to-one relationship with the original data can still be maintained and, therefore, the tokenized data can be used in analytics as a replacement for the cleartext. Additionally, parts of the cleartext data can be preserved or “bled through” to the token, which is especially useful in cases where only part of the original data is required to perform a job.

Tokenization also allows for flexibility in the levels of data security privileges, as authority can be granted on a field-by-field or partial field basis. Data are secured in all states: at rest, in use, in transit and in analytics.

Bridging the Gap

In comparing the methods of fine-grained data security (figure 2), it becomes apparent that tokenization offers the greatest levels of accessibility and security. The randomized token values are worthless to a potential thief, as only those with authorization to access the token table and process can ever expect to return the data to their original value. The ability to use tokenized values in analysis presents added security and efficiency, as the data remain secure and do not require additional processing to unprotect or detokenize them.

Figure 2

This ability to securely extract value from de-identified sensitive data is the key to bridging the gap between privacy and access. Protected data remain useable to most users and processes, and only those with privileges granted through the data security policy can access the sensitive data in the clear.

Data Security Methodology

Figure 3Data security technology on its own is not enough to ensure an optimized balance of access and security. After all, any system is only as strong as its weakest link and, in data security, that link is often a human one. As such, a clear, concise methodology can be utilized to help optimize data security processes and minimize impact on business operations (figure 3).

The first consideration of data security implementation should be a clear classification of which data are considered sensitive, according to outside regulations and/or internal security mandates. This can include anything from personal information to internal operations analysis results.

Determining where sensitive data are located, their sources and where they are used are the next steps in a basic data security methodology. A specific data type may also need different levels of protection in different parts of the system. Understanding the data flow is vital to protecting it.

Also, Hadoop should not be considered a silo outside of the enterprise. The analytical processing in Hadoop is typically only part of the overall process—from data sources to Hadoop, up to databases, and on to finer analysis platforms. Implementing enterprisewide data security can more consistently secure data across platforms, minimizing gaps and leakage points.

Next, selecting the security method(s) that best fit the risk, data type and use case of each classification of sensitive data, or data element, ensures that the most effective solution across all sensitive data is employed. For example, while vaultless tokenization offers unparalleled access and security for structured data, such as credit card numbers or names, encryption may be employed for unstructured, nonanalytical data, such as images or other media files.

It is also important to secure data as early as possible, both in Hadoop implementation and in data acquisition/creation. This helps limit the possible exposure of sensitive data in the clear.

Design a data security policy based on the principle of least privilege (i.e., revealing the least possible amount of sensitive data in the clear in order to perform job functions). This may be achieved by creating policy roles that determine who has access or who does not have access, depending on which number of members is least. A modern approach to access control can allow a user to see different views of a particular data field, thereby exposing more or less of the sensitive content of that data field.

Assigning the responsibility of data security policy administration and enforcement to the security team is very important. The blurring of lines between security and data management in many organizations leads to potentially severe abuses of sensitive data by privileged users. This separation of duties prevents most abuses by creating strong automated control and accountability for access to data in the clear.

As with any data security solution, extensive sensitive data monitoring should be employed in Hadoop. Even with proper data security in place, intelligent monitoring can add a context-based data access control layer to ensure that data are not abused by authorized users.

What separates an authorized user and a privileged user? Privileged users are typically members of IT who have privileged access to the data platform. These users may include system administrators or analysts who have relatively unfettered access to systems for the purposes of maintenance and development. Authorized users are those who have been granted access to view sensitive data by the security team.

Highly granular monitoring of sensitive data is vital to ensure that both external and internal threats are caught early.


Following these best practices would enable organizations to securely extract sensitive data value and confidently adopt big data platforms with much lower risk of data breach. In addition, protecting and respecting the privacy of customers and individuals helps to protect the organization’s brand and reputation.

The goal of deep data insights, together with true data security, is achievable. With time and knowledge, more and more organizations will reach it.


1 The Apache Software Foundation, The Apache Hadoop software library is a framework that allows for distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than relying on hardware to deliver high availability, the library itself is designed to detect and handle failures at the application layer, thus delivering a highly available service on top of a cluster of computers, each of which may be prone to failures.
2 Commonly applicable regulations include US Health Insurance Portability and Accountability Act (HIPAA), the Payment Card Industry Data Security Standard (PCI DSS), US Sarbanes-Oxley, and state or national privacy laws.
3 These solution providers include Cloudera, Gazzang, IBM, Intel (open source), MIT (open source), Protegrity and Zettaset, each of which provide one or more of the following solutions: access control, authentication, volume encryption, field/column encryption, masking, tokenization and/or monitoring.
4 Massachusetts Institute of Technology (MIT), USA, Kerberos, originally developed for MIT’s Project Athena, is a widely adopted network authentication protocol. It is designed to provide strong authentication for client-server applications by using secret-key cryptography.
5 PCI Security Standards Council, PCI DSS provides guidance and regulates the protection of payment card data, including the primary account number (PAN), names, personal identification number (PIN) and other components involved in payment card processing.
6 US Department of Health and Human Services, www.hhs. gov/ocr/privacy. HIPAA Security Rule specifies a series of administrative, physical and technical safeguards for covered entities and their business associates to use to ensure the confidentiality, integrity and availability of electronic protected health information.
7 Dynamically masked values are often independently shuffled, which can dramatically decrease the utility of the data in relationship analytics, as the reference fields no longer line up. In addition, values may end up cross-matching or false matching, if they are truncated or partially replaced with nonrandom data (such as hashes). The issue lies in the fact that masked values are not usually generated dynamically, but referenced dynamically, as a separate masked subset of the original data.

Ulf T. Mattsson is the chief technology officer (CTO) of Protegrity. He created the initial architecture of Protegrity’s database security technology, for which the company owns several key patents. His extensive IT and security industry experience includes 20 years with IBM as a manager of software development and a consulting resource to IBM’s research and development organization in the areas of IT architecture and IT security.


Add Comments

Recent Comments

Opinions expressed in the ISACA Journal represent the views of the authors and advertisers. They may differ from policies and official statements of ISACA and from opinions endorsed by authors’ employers or the editors of the Journal. The ISACA Journal does not attest to the originality of authors’ content.