Amazon SageMaker is a service from Amazon Web Services (AWS) that makes it easy to build machine learning models and deploy them in the cloud. If a website is running a SageMaker model, site visitors can upload data and receive the results of running that data through the model.
An open-source prototype of the privacy-preserving version of XGBoost can be found at GitHub.
All transmissions to and from a SageMaker model are encrypted, but in some cases, customers may be wary of having sensitive data decrypted for analysis. Privacy-preserving machine learning (PPML) is a class of techniques that let machine learning models compute directly on encrypted data, returning encrypted results. Only the person who encrypted the input data can decrypt the result.
At the NeurIPS Workshop on Privacy-Preserving Machine Learning in December, we will present a privacy-preserving version of a machine learning algorithm called XGBoost, which produces models known as gradient-boosted decision trees. XGBoost is one of the most popular machine learning algorithms offered through Amazon SageMaker.
PPML is rarely used because it adds computational overhead, which often makes the resulting models too slow to be practical. But we tested our algorithm on a model that’s roughly 500 kilobytes in size and found that the privacy-preserving version takes around 0.4 seconds to produce a result (compared to 1 millisecond for the unencrypted version). That’s consistent with many cloud-based machine learning tasks currently initiated from smartphones.
We have open-sourced our prototype, and the code is available in AWS Labs.
XGBoost (for Extreme Gradient Boosting) is an optimized, distributed, gradient-boosting machine learning framework designed to be highly efficient and flexible. Given a set of training data, XGBoost produces a set of parallel classification and regression trees. Each tree evaluates an input query by making a branching decision at each node of the tree until a leaf node is reached, at which point it outputs a numerical score. The results across all of the trees are summed to give an overall output.
To design a privacy-preserving XGBoost inference algorithm, we use several cryptographic tools. One is order-preserving encryption (OPE), which allows data to be encrypted in such a way that the ciphertexts — the encrypted versions of the data — preserve the order for the plaintexts — the unencrypted versions. That is, for any plaintexts a and b, a > b if and only if the ciphertext of a is greater than that of b (Enc(a) > Enc(b)), and vice versa.
Another tool is pseudo-random functions (PRFs), which allow us to construct functions that are essentially indistinguishable from a random function. PRFs are used to generate pseudorandom “features names” for the values that the tree nodes are testing, and OPEs are used to encrypt the values those features are being tested against.
We also use additively homomorphic encryption (AHE), which is a semantically secure homomorphic encryption scheme that allows us to evaluate the addition function over ciphertexts. That is, there exists a homomorphic evaluation function that, given two ciphertexts, can compute the encryption of the sum of the corresponding plaintexts (Eval(Enc(a), Enc(b), +) = Enc(a+b)). AHE allows us to combine the outputs of the regression trees to obtain the final encrypted result.
During operation, the site visitor’s computer encrypts a plaintext query with OPE. It sends an encrypted query to the server hosting the PPML model. The server evaluates each tree on the encrypted query and obtains a set of encrypted leaf values. Then the server homomorphically sums all encrypted leaf values and returns the result to the visitor’s computer, which can decrypt it to obtain the final prediction. In our paper we show that the server can homomorphically compute the softmax function, commonly used in XGBoost, as well as the sum.
Future work includes support for more learning parameters in the privacy-preserving version and the use of secure multiparty computation that executes secure comparisons for each decision node in an encrypted regression tree.