The number of data breaches annually totals thousands in the United States alone.1 More importantly, the total number of records exposed has reached the billions.2 The average cost of a data breach to a single enterprise is now averaging US $3.6 million per incident.3
While the frequency of incidents has been on the rise, the impact to people is, unfortunately, increasing in severity. The most recent incident was the Equifax breach, in which 143 million Americans’ Social Security numbers, background information, birth dates, and even driver’s license numbers and credit card numbers were accessed. While debates ensue regarding the use of Social Security numbers as identification, the issue that can be immediately addressed today is the protection of databases by encrypting data.
While human error is the cause of many data breaches, another leading cause is the lack of strong cryptographic mechanisms. Security professionals must rely on security by layers. While the recent Equifax hack is widely believed to have occurred due to unapplied security patches, if the database was encrypted, the attackers would have had a more difficult time accessing the data.
Thus, rather than storing and protecting data by building a fortress of software security applications, a promising approach of encrypting the database itself is emerging. Naturally, the question that arises is how to encrypt the database yet allow the data to be queried and accessed. This article surveys an array of cryptographic approaches to protect against data breaches while allowing computation and queries to be executed by applications.
In addition, when encryption is not enough, enterprises can leverage collecting and storing privatized data and then computing over the privatized information. For example, in the self-driving car scenario, suppose manufacturers are unwilling to share original data with other manufacturers, but the machine learning engineers need data to understand how to model the interactions with other vehicles. Rather than releasing the data in its entirety, privatized data can be shared. Sharing privatized data avoids the issue of incurring privacy violations that occur when the original data have been shared.
In the past, there have been numerous privacy violations that have led to fines4 in the millions of US dollars, such as Google being fined US $22.5 million,5 or even the shutdown of projects, such as the Netflix competition.6 As such, this article also examines privacy protection methods such as differential privacy, which has become the gold standard in the privacy research community, and how these methods can benefit enterprises.
Another example that has affected government and high-profile individuals is Uber sharing passengers’ ride data with China’s Baidu.7 This is a huge risk, as high-profile US government employees’ personal information and mobility data were shared. Thus, it is very important to pay attention to the privatization of data when publicly posting data and when evaluating business partner data-sharing agreements.
This article also examines privacy protection methods and how they can benefit enterprises when applied judiciously. Privatizing data provides a complement to simply confidentially storing data.
Cryptographic Data Breach Protection
There are several state-of-the-art cryptographic techniques to protect against unintentional data breaches.
Fully homomorphic encryption (FHE) was first described in 2009.8 FHE enabled, for the first time, the ability to compute both addition and multiplication gates on encrypted data. This breakthrough means that any arbitrary computation can be computed. For example, data can be encrypted by the data owner and then the encrypted data can be uploaded to the cloud for computation. Then the encrypted result is downloaded by the data owner and must be decrypted.
However, there are drawbacks to FHE. One is that the output result is encrypted and, thus, requires that decryption be appropriately handled and secured. The only guarantee is that the cloud will not learn either the data or the result. Another drawback is that FHE is still not practical because arbitrary computations are still roughly six to seven orders of magnitude slower than computing an unencrypted program.9
Functional encryption was introduced in 2005.10 The beauty of functional encryption is that the data owner encrypts the data only once, yet an authorized party is able to compute over the encrypted data and learn the result. This means a database could be encrypted and authorized applications or analysts can compute and learn only approved functions. This could greatly limit unintentional data breaches. One drawback is that functional encryption for arbitrary functions is not yet practical. However, for simpler functionalities such as role-based access controls, attribute-based encryption is practical and runs on smartphones.11, 12, 13 For example, the US Defense Advanced Research Projects Agency (DARPA) Content-Based Mobile Edge Networking (CBMEN) Edge Networking with Content-Oriented Declarative Enhanced Routing and Storage (ENCODERS)14 project15 utilized multiauthority, attribute-based encryption16 on Android smartphones to enable devices to securely communicate and share data with each other without relying on a centralized service.
Thus, functional encryption holds promise as architectures transition from centralized strongholds of data to more decentralized data stores, greatly improving security. Over time, it can be expected that cryptography performance will improve, enabling more applications to be protected with functional encryption.
A more practical approach has been taken by mechanisms that seek to protect against database thefts. That is, if an adversary obtains a “snapshot” of the database, the adversary should not be able to recover any data from the database.
There are several cryptographic-specific techniques to enable a wide array of use cases including order-revealing encryption, deterministic encryption and searchable encryption.
Order-revealing encryption enables the ability to perform comparisons over encrypted data such as range queries, sorting and threshold filters.17 Deterministic encryption generates the same ciphertext given a plaintext and key combination.18 Deterministic encryption is useful for searching over encrypted data and quickly finding a matching row. Searchable encryption is a symmetric primitive that enables the ability to search the rows of a database and return rows that match a specific keyword.19 Typically, this has been implemented as conjunctive Boolean queries.20
There are a couple of deployable and practical encrypted solutions such as CryptDB21 and Microsoft’s Always Encrypted Structured Query Language (SQL) Server.22 Both of these solutions strive to execute SQL queries over encrypted data with low overhead and an additional throughput cost of no more than 30 percent.23
However, there are some hidden drawbacks. These cryptographic techniques “leak” data such that an adversary can reconstruct many of the original database rows. For example, researchers were able to recover the majority of sensitive encrypted data by performing frequency analysis combined with auxiliary analysis of readily available public records.24 In another example, researchers were able to exploit the query times and cache analysis to reconstruct rows of the database.25, 26
The authors of CryptDB have claimed improper and invalid use of the database. Though it is important to note that when designing and using secure end-to-end systems, hackers will try to exploit exactly these weak points. Thus, research continues in this area.
Learning Over Privatized Data
The final approach to protecting against unintentional breaches is learning over privatized data. In contrast to the previously described approaches where computation is over encrypted data, computation is instead performed over data that have been privatized. That is, the data are stored as plaintext and the result is immediately known.
Why would someone do this? The data stored are not the original ground truth data. Rather, the data are perturbed (e.g., randomized) such that if they are taken alone, they will not provide an attacker with enough information to recover the original data. For example, rather than an individual revealing their exact birthday for census purposes, they could reveal the year of their birth plus or minus a few years so that the census bureau is still able to learn aggregate statistics regarding the population.
Thus, to execute meaningful queries over the privatized data, there must be a sufficient number of privatized database rows so that the accuracy of the final output can be achieved. There exists a trade-off between the privacy strength and accuracy. Stronger privacy typically results in more error. In the birth date example, strong privacy that adjusts the year by a couple decades would not give meaningful, accurate information to the census bureau. The exact privacy strength varies for each application, individual preferences and adversary settings.
Why go through all this trouble to privatize data? The benefit is that even if a database breach occurred, any particular data owner’s data are unable to be recovered, as they are perturbed and randomized. At the same time, privacy is ensured, as the original data are never stored.
Today, differential privacy has become the gold standard by which to measure privacy.27, 28 Roughly speaking, differential privacy says that a query against two databases that differ by at most one row is ε indistinguishable (where ε is the acceptable privacy leakage). Thus, a data owner is able to safely participate in sharing his/her personal data since there is minimal risk that personal information will be leaked.
Interestingly, differential privacy was proposed as a direct response to the Netflix Prize fiasco. Today, it serves as the gold standard by which all other privacy definitions are measured.
Privacy-Preserving Data Collection
Centralized and distributed are two architecture models used to privately collect personal data. This article does not examine centralized, as centralized represents a single point of failure for attacks and data breaches.
The distributed model is better suited to defend against data breaches, as any data that are collected and stored have already been sanitized and privatized. Each data owner individually and independently privatizes their data and then uploads those data to the cloud storage. This is still an ongoing area of research whereby scalable techniques to preserve accuracy and protect privacy are being developed.
However, there are several practical deployments by companies such as Google and Apple.29 In addition, universities are deploying differential privacy systems to enable the “smart campus,” such as CrowdZen at the University of California, Los Angeles (USA).
Cryptographic approaches and privacy techniques are critical to preventing data breaches. State-of-the-art cryptographic techniques and methods to protect data owners’ personal data have pros and cons and are continually evolving.
1 Identity Theft Resource Center, “Data Breaches Increase 40 Percent in 2016, Finds New Report From Identity Theft Resource Center and Cyberscout,” 19 January 2017
2 Breach Level Index, “Data Breach Statistics,” http://breachlevelindex.com
3 IBM Security, “Examining the Cost of a Data Breach,” https://databreachcalculator.mybluemix.net/
4 International Association of Privacy Professionals, “Top 20 Government-Imposed Data Privacy Fines Worldwide, 1999-2014,” https://iapp.org/resources/article/top-20-government-imposed-data-privacy-fines-worldwide-1999-2014/
5 Federal Trade Commission, “Google Will Pay $22.5 Million to Settle FTC Charges It Misrepresented Privacy Assurances to Users of Apple’s Safari Internet Browser,” USA, 9 August 2012, https://www.ftc.gov/news-events/press-releases/2012/08/google-will-pay-225-million-settle-ftc-charges-it-misrepresented
6 Singel, R.; “Netflix Cancels Recommendation Contest After Privacy Lawsuit,” Wired, 12 March 2010, https://www.wired.com/2010/03/netflix-cancels-contest/
7 Rodriguez, S.; “Someone in China Might Know Where You Rode in Ubers Last Year,” Inc., 9 March 2017, https://www.inc.com/salvador-rodriguez/uber-baidu-security.html
8 Gentry, C.; “Fully Homomorphic Encryption Using Ideal Lattices,” Proceedings of the 41st Annual ACM Symposium on Theory of Computing, 2009, p. 169–178
9 Brown, B.; “How to Make Fully Homomorphic Encryption ‘Practical and Usable’,” Network World, 15 May 2017, www.networkworld.com/article/3196121/security/how-to-make-fully-homomorphic-encryption-practical-and-usable.html
10 Sahai, A.; B. Waters; “Fuzzy Identity-Based Encryption,” Advances in Cryptology—EUROCRYPT 2005, Aarhus, Denmark, 22–26 May 2005, p. 457–473
11 Goyal, V.; A. Jain; O. Pandey; A. Sahai; “Bounded Ciphertext Policy Attribute Based Encryption,” Automata, Languages and Programming, 2008, p. 579–591
12 Lee, U.; J. Joy; Y. Noh; “Secure Personal Content Networking Over Untrusted Devices,” Wireless Personal Communications, vol. 80, iss. 4, p. 1449–1473
13 Bethencourt, J.; A. Sahai; B. Waters; “Ciphertext-Policy Attribute-Based Encryption,” Institute of Electrical and Electronics Engineers (IEEE) Symposium on Security and Privacy, 2007
14 ENCODERS, Github Repository, https://github.com/SRI-CSL/ENCODERS
15 SRI International, “ENCODERS,” http://encoders.csl.sri.com/
16 Chase, M.; Multi Authority Attribute Based Encryption, TCC 2007, p. 515–534
17 Boneh, D.; K. Lewi; M. Raykova; A. Sahai; M. Zhandry; J. Zimmerman; “Semantically Secure Order-Revealing Encryption: Multi-Input Functional Encryption Without Obfuscation,” Advances in Cryptology—EUROCRYPT, 2015, p. 563–594
18 Bellare, M.; A. Boldyreva; A. O’Neill; “Deterministic and Efficiently Searchable Encryption,” Advances in Cryptology—CRYPTO 2007, 2007, p. 535–552
19 Curtmola, R.; J. Garay; S. Kamara; R. Ostrovsky; “Searchable Symmetric Encryption: Improved Definitions and Efficient Constructions,” Proceedings of the 13th ACM Conference on Computer and Communications Security, 2006, p. 79–88
20 Cash, D.; S. Jarecki; C. Jutla; H. Krawczyk; M. Rosu, M. Steiner; “Highly-Scalable Searchable Symmetric Encryption With Support for Boolean Queries,” Advances in Cryptology—CRYPTO 2013, 2013, p. 353–373
21 Popa, R. A.; C. M. S. Redfield; N. Zeldovich; H. Balakrishnan; “CryptDB: Protecting Confidentiality With Encrypted Query Processing,” Proceedings of the 23rd ACM Symposium on Operating Systems Principles, 2011, p.85–100
22 Microsoft, “Always Encrypted (Database Engine),” 24 April 2017, https://docs.microsoft.com/en-us/sql/relational-databases/security/encryption/always-encrypted-database-engine
24 Naveed, M.; S. Kamara; C. V. Wright; “Inference Attacks on Property-Preserving Encrypted Databases,” Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, 2015, p. 644–655
25 Grubbs, P.; K. Sekniqi; V. Bindschaedler; M. Naveed; T. Ristenpart; “Leakage-Abuse Attacks Against Order-Revealing Encryption,” 2017 IEEE Symposium on Security and Privacy, 2017, p. 655–72
26 Grubbs, P.; R. McPherson; M. Naveed; T. Ristenpart; V. Shmatikov; “Breaking Web Applications Built on Top of Encrypted Data,” Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, 2016, p. 1353–1364
27 Dwork, C.; “Differential Privacy,” Automata, Languages and Programming, 2006, p. 1–12
28 Dwork, C.; F. McSherry; K. Nissim; A. Smith; “Calibrating Noise to Sensitivity in Private Data Analysis,” Theory of Cryptography, 2006, p. 265–284
29 Erlingsson, U.; V. Pihur; A. Korolova; “RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response,” Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security, 2014, p. 1054–1067
Is a Ph.D. candidate in computer science. He has developed scalable privacy mechanisms and is interested in data collection techniques enabling open data platforms. Joy’s recent projects include the highly successful CrowdZen at the University of California, Los Angeles (USA), which is known as “Waze for the college campus.”