Machine Learning (ML) and, in particular, Deep Learning is drastically changing the way we conduct business as now data can be utilized to guide business strategies to create new value, analyze customers and predict their behavior, or even provide medical diagnosis and care. We may think that data is at risk when these algorithms recommend and direct our purchases on social media or monitor our doorways, elderly, and youngsters, but that is only the tip of the iceberg. Data is used to make banking decisions, detect fraudulent transactions, and decide insurance rates. In all these cases, the data is embroiled with sensitive information regarding enterprises or even individuals and the benefits are entangled with data risks.
One of the most critical challenges that companies are facing today is understanding how to handle and protect their own data while using it to improve their businesses through ML solutions. The data includes customers’ personal information as well as business data, such as data regarding the sales of a company itself. Clearly, it is essential for a company to correctly handle and protect such data since its exposure would be a massive vulnerability.
In the sections that follow, we detail how existing methods are complementary to Protopia AI’s Stained Glass Transform™ solution and where other solutions fall short.
Federated Learning: To protect training data, Google presented Federated Learning , a distributed learning framework in which the devices on which data are locally stored collaboratively learn a shared ML model without the need to expose training data to a centralized training platform. The idea is to send only the ML models’ parameters to the cloud, thus protecting the sensitive training data. However, different works in the literature demonstrated that an attacker could use observations on an ML model’s parameters to infer private information included in the training data, such as class representatives, membership, and properties of a training data’s subset . Moreover, Federated Learning ignores the inference stage of the ML lifecycle, and therefore running inference still exposes the data to the ML model whether it is running on the cloud on the edge device.
Differential Privacy: There has been significant attention to the use of Differential Privacy. This method provides margins on how much a single data record from the training dataset contributes to the machine learning model. This is a membership test on the training data records and it ensures if a single data record is removed from the dataset, the output should not change beyond a certain threshold. Although very important, training in a differentially private manner still requires access to plain-text data. More importantly, differential privacy does not deal with the inferencing stage in any form or way.
Synthetic Data: Another method to protect sensitive training data is just training the ML model using Synthetic Data. However, the generated synthetic data might not cover possible real-world data subspaces essential to train a predictive model which will be reliable during the inference stage. This could cause significant accuracy losses that make the model unusable after its deployment. Moreover, the trained model still needs to use real data to perform inferencing and prediction and there is no escaping the challenges of this stage where synthetic data cannot be used.
Secure Multi-Party Computation and Homomorphic Encryption: Two cryptographic techniques for privacy-preserving computations are Secure Multi-Party Computation (SMC) and Homomorphic Encryption (HE). In SMC, the computation is distributed over multiple secure platforms that results in significant computation and communications costs which can be prohibitive in many cases . Homomorphic encryption is even more costly as it operates on the data in the encrypted fashion that even with custom hardware is orders of magnitude slower . Moreover, deep neural networks, which represent the most used ML solution in many domains nowadays, require some modifications to be used in a framework that relies on HE .
Confidential Computing: Confidential computing focuses on protecting data during use. Many big companies like Google, Intel, Meta, and Microsoft have already joined the Confidential Computing Consortium, established in 2019 to promote hardware-based Trusted Execution Environments (TEEs). This solution aims at protecting data while it is being used by isolating computations to these hardware-based TEEs. The main drawback of Confidential Computing is that it forces companies to increase their costs to migrate their ML-based services on platforms that provide such specialized hardware infrastructures. At the same time, this solution can not be considered risk-free. Indeed, in May 2021, a group of researchers introduced SmashEx , an attack that allows collecting and corrupting data from TEEs that rely on the Intel Software Guard Extension (SGX) technology. Protopia AI’s Stained Glass Transform™ technology can transform data before entering the trusted execution environment and as such it is complementary and minimizes the attack surface on an orthogonal axis. Even if the TEE is breached the plaintext data is not there anymore with Protopia AI’s solution.
In conclusion, enterprises have been struggling to understand how to protect sensitive information when using their data during training and inference stages of the ML lifecycle. Questions of data ownership and to whom, what platform, and what algorithms sensitive data gets exposed to during ML processes are a central challenge to enabling ML solutions and unlocking their value in today’s enterprise. Protopia AI’s Stained Glass Transform™ solution privatizes and protects ML data for both training and inference for any ML application and data type. These lightweight transformations decouple the ownership of plain/raw sensitive information in real data from the ML process without imposing significant overhead in the critical path nor requiring specialized hardware.
Note: Thanks to Protopia AI for the thought leadership/ Educational article above. Protopia AI has supported and sponsored this Content. For more information, products, sales, and marketing, please contact Protopia AI team at email@example.com
 McMahan, Brendan, et al. “Communication-efficient learning of deep networks from decentralized data.” Artificial intelligence and statistics. PMLR, 2017.
 Lyu, Lingjuan, et al. “Privacy and robustness in federated learning: Attacks and defenses.” arXiv preprint arXiv:2012.06337 (2020).
 Mohassel, Payman, and Yupeng Zhang. “Secureml: A system for scalable privacy-preserving machine learning.” 2017 IEEE symposium on security and privacy (SP). IEEE, 2017.
 Xie, Pengtao, et al. “Crypto-nets: Neural networks over encrypted data.” arXiv preprint arXiv:1412.6181 (2014).
 Chabanne, Hervé, et al. “Privacy-preserving classification on deep neural network.” Cryptology ePrint Archive (2017).
 Cui, Jinhua, et al. “SmashEx: Smashing SGX Enclaves Using Exceptions.” Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security. 2021.
Luca is Ph.D. student at the Department of Computer Science of the University of Milan. His interests are Machine Learning, Data Analysis, IoT, Mobile Programming, and Indoor Positioning. His research currently focuses on Pervasive Computing, Context-awareness, Explainable AI, and Human Activity Recognition in smart environments.