[PPML] Refine readthedoc development guide (#6685)
* Add migrate without code change. * Add SGX/TDX design guide. * Refine link & add description.
This commit is contained in:
parent
ebe75782dd
commit
1b46df7e65
3 changed files with 51 additions and 15 deletions
|
|
@ -1,10 +1,43 @@
|
||||||
## Develop your own Big Data & AI applications with BigDL PPML
|
# Develop your own Big Data & AI applications with BigDL PPML
|
||||||
|
|
||||||
First you need to create a `PPMLContext`, which wraps `SparkSession` and provides methods to read encrypted data file into plain-text RDD/DataFrame and write DataFrame to encrypted data file. Then you can read & write data through `PPMLContext`.
|
### 0. Understand E2E Security with PPML
|
||||||
|
|
||||||
If you are familiar with Spark, you may find that the usage of `PPMLConext` is very similar to Spark.
|
Basic design guidelines for PPML applications are as follows:
|
||||||
|
|
||||||
### 1. Create PPMLContext
|
* Data in use/computation should be protected by SGX.
|
||||||
|
* Data in transmit/network should be protected by encryption or TLS.
|
||||||
|
* Data at rest/storage should be protected by encryption.
|
||||||
|
|
||||||
|
This design ensures plain text data only be used in SGX, while in all others stages data is fully encrypted.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
To our knowledge, most existing big data frameworks or systems have already provided network or storage protection. You can find more details in [Secure Your Services](https://bigdl.readthedocs.io/en/latest/doc/PPML/QuickStart/secure_your_services.html).
|
||||||
|
|
||||||
|
Please check with your admin or security department for security features and services available. We recommend building PPML applications based on the following conditions:
|
||||||
|
|
||||||
|
1. If you have network and storage protection enabled, and you want to secure computation with SGX. Then you can directly migrate your application into SGX with BigDL PPML. Please jump to [Migrate existing applications with BigDL PPML](#1-migrate-existing-applications-with-bigdl-ppml).
|
||||||
|
2. If you don't have any security features enabled, especially storage protection. Then you can use PPMLContext and recommended KMS. Please jump to [Enhance your applications with PPMLContext](#2-enhance-your-applications-with-ppmlcontext).
|
||||||
|
|
||||||
|
### 1. Migrate existing applications with BigDL PPML
|
||||||
|
|
||||||
|
This working model doesn't require any code change. You can reuse existing code and applications. The only difference is that your cluster manager/admin needs to set up a new execution environment for PPML applications.
|
||||||
|
|
||||||
|
You can find more details in these articles:
|
||||||
|
|
||||||
|
* [Installation for PPML](https://bigdl.readthedocs.io/en/latest/doc/PPML/Overview/install.html).
|
||||||
|
* [Hello World Example](https://bigdl.readthedocs.io/en/latest/doc/PPML/Overview/quicktour.html).
|
||||||
|
* [Deployment for production](https://bigdl.readthedocs.io/en/latest/doc/PPML/QuickStart/deploy_ppml_in_production.html).
|
||||||
|
|
||||||
|
### 2. Enhance your applications with PPMLContext
|
||||||
|
|
||||||
|
In this section, we will introduce how to secure your applications with `PPMLContext`. It requires a few code changes and configurations for your applications.
|
||||||
|
|
||||||
|
First, you need to create a `PPMLContext`, which wraps `SparkSession` and provides methods to read encrypted data files into plain-text RDD/DataFrame and write DataFrame to encrypted data files. Then you can read & write data through `PPMLContext`.
|
||||||
|
|
||||||
|
If you are familiar with Spark, you may find that the usage of `PPMLContext` is very similar to Spark.
|
||||||
|
|
||||||
|
#### 2.1 Create PPMLContext
|
||||||
|
|
||||||
- create a PPMLContext with `appName`
|
- create a PPMLContext with `appName`
|
||||||
|
|
||||||
|
|
@ -213,7 +246,7 @@ If you are familiar with Spark, you may find that the usage of `PPMLConext` is v
|
||||||
|
|
||||||
</details>
|
</details>
|
||||||
|
|
||||||
### 2. Read and Write Files
|
#### 2.2 Read and Write Files
|
||||||
|
|
||||||
To read/write data, you should set the `CryptoMode`:
|
To read/write data, you should set the `CryptoMode`:
|
||||||
|
|
||||||
|
|
@ -225,8 +258,8 @@ To read/write data, you should set the `CryptoMode`:
|
||||||
To write data, you should set the `write` mode:
|
To write data, you should set the `write` mode:
|
||||||
|
|
||||||
- `overwrite`: Overwrite existing data with the content of dataframe.
|
- `overwrite`: Overwrite existing data with the content of dataframe.
|
||||||
- `append`: Append content of the dataframe to existing data or table.
|
- `append`: Append new content of the dataframe to existing data or table.
|
||||||
- `ignore`: Ignore current write operation if data / table already exists without any error.
|
- `ignore: Ignore the current write operation if data/table already exists without any error.
|
||||||
- `error`: Throw an exception if data or table already exists.
|
- `error`: Throw an exception if data or table already exists.
|
||||||
- `errorifexists`: Throw an exception if data or table already exists.
|
- `errorifexists`: Throw an exception if data or table already exists.
|
||||||
|
|
||||||
|
|
@ -268,7 +301,7 @@ sc.write(dataframe = df, crypto_mode = CryptoMode.AES_CBC_PKCS5PADDING)
|
||||||
|
|
||||||
<details><summary>expand to see the examples of reading/writing CSV, PARQUET, JSON and text file</summary>
|
<details><summary>expand to see the examples of reading/writing CSV, PARQUET, JSON and text file</summary>
|
||||||
|
|
||||||
The following examples use `sc` to represent a initialized `PPMLContext`
|
The following examples use `sc` to represent an initialized `PPMLContext`
|
||||||
|
|
||||||
**read/write CSV file**
|
**read/write CSV file**
|
||||||
|
|
||||||
|
|
@ -501,4 +534,4 @@ rdd2 = sc.textfile(path=encrypted_csv_path, crypto_mode=CryptoMode.AES_CBC_PKCS5
|
||||||
|
|
||||||
</details>
|
</details>
|
||||||
|
|
||||||
More usage with `PPMLContext` Python API, please refer to [PPMLContext Python API](https://github.com/intel-analytics/BigDL/blob/main/python/ppml/src/bigdl/ppml/README.md).
|
For more usage with `PPMLContext` Python API, please refer to [PPMLContext Python API](https://github.com/intel-analytics/BigDL/blob/main/python/ppml/src/bigdl/ppml/README.md).
|
||||||
|
|
|
||||||
|
|
@ -2,22 +2,25 @@
|
||||||
|
|
||||||
This document is a gentle reminder for enabling security & privacy features for your services. To avoid privacy & security issues during deployment, we recommend Developer/Admin go through this document, which suits users/customers who want to apply BigDL into their production environment (not just for PPML).
|
This document is a gentle reminder for enabling security & privacy features for your services. To avoid privacy & security issues during deployment, we recommend Developer/Admin go through this document, which suits users/customers who want to apply BigDL into their production environment (not just for PPML).
|
||||||
|
|
||||||
## Security in data lifecycle
|
## Security in the data lifecycle
|
||||||
|
|
||||||
Almost all Big Data & AI applications are built upon large-scale datasets, we can simply go through security key steps in the data lifecycle. That is data protection in transit, in storage, and in use.
|
Almost all Big Data & AI applications are built upon large-scale datasets, we can simply go through security key steps in the data lifecycle. That is data protection:
|
||||||
|
* In transit, i.e., network.
|
||||||
|
* At rest, i.e., storage.
|
||||||
|
* In use, i.e., computation.
|
||||||
|
|
||||||
### Secure Network (in transit)
|
### Secure Network (in transit)
|
||||||
|
|
||||||
Big Data & AI applications are mainly distributed applications, which means we need to use lots of nodes to run our applications and get jobs done. During that period, not just control flows (command used to control applications running on different nodes), data partitions (a division of data) may also go through different nodes. So, we need to ensure all network traffic is fully protected.
|
Big Data & AI applications are mainly distributed applications, which means we need to use lots of nodes to run our applications and get jobs done. During that period, not just control flows (commands used to control applications running on different nodes), data partitions (a division of data) may also go through different nodes. So, we need to ensure all network traffic is fully protected.
|
||||||
|
|
||||||
Talking about secure data transit, TLS is commonly used. The server would provide a private key and certificate chain. To make sure it is fully secured, a complete certificate chain is needed (with two or more certificates built). In addition, SSL/TLS protocol and secure cipher tools would be used. It is also recommended to use forward secrecy and strong key exchange. However, it is general that secure approaches would bring some performance problems. To mitigate these problems, a series of approaches are available, including session resumption, cache, etc. For the details of this section, please see [SSL-and-TLS-Deployment-Best-Practices](https://github.com/ssllabs/research/wiki/SSL-and-TLS-Deployment-Best-Practices).
|
Talking about secure data transit, TLS is commonly used. The server would provide a private key and certificate chain. To make sure it is fully secured, a complete certificate chain is needed (with two or more certificates built). In addition, SSL/TLS protocol and secure cipher tools would be used. It is also recommended to use forward secrecy and strong key exchange. However, it is general that secure approaches would bring some performance problems. To mitigate these problems, a series of approaches are available, including session resumption, cache, etc. For the details of this section, please see [SSL-and-TLS-Deployment-Best-Practices](https://github.com/ssllabs/research/wiki/SSL-and-TLS-Deployment-Best-Practices).
|
||||||
### Secure Storage (in storage)
|
### Secure Storage (in storage)
|
||||||
|
|
||||||
Besides network traffic, we also need to ensure data is safely stored in storage. In Big Data & AI applications, data is mainly stored in distributed storage or cloud storage, e.g., HDFS, Ceph and AWS S3 etc. This makes storage security a bit different. We need to ensure each storage node is secured by correct settings, meanwhile we need to ensure the whole storage system is secured (network, access control, authentication etc).
|
Besides network traffic, we also need to ensure data is safely stored in storage. In Big Data & AI applications, data is mainly stored in distributed storage or cloud storage, e.g., HDFS, Ceph and AWS S3 etc. This makes storage security a bit different. We need to ensure each storage node is secured by the correct settings, meanwhile, we need to ensure the whole storage system is secured (network, access control, authentication etc).
|
||||||
|
|
||||||
### Secure Computation (in use)
|
### Secure Computation (in use)
|
||||||
|
|
||||||
Even if data is fully encrypted in transit and storage, we still need to decrypt it when we make some computation. If this stage is not safe, then security & secrets never exist. That's why TEE (SGX/TDX) is so important. In Big Data & AI, applications and data are distributed into different nodes. If any of these nodes are controlled by an adversary, he can simply dump sensitive data from memory or crash your applications. There are lots of security technologies to ensure computation safety. Please check if they are correctly enabled.
|
Even if data is fully encrypted in transit and storage, we still need to decrypt it when we make some computations. If this stage is not safe, then security & secrets never exist. That's why TEE (SGX/TDX) is so important. In Big Data & AI, applications and data are distributed into different nodes. If any of these nodes are controlled by an adversary, he can simply dump sensitive data from memory or crash your applications. There are lots of security technologies to ensure computation safety. Please check if they are correctly enabled.
|
||||||
|
|
||||||
## Example: Spark on Kubernetes with data stored on HDFS
|
## Example: Spark on Kubernetes with data stored on HDFS
|
||||||
|
|
||||||
|
|
@ -33,7 +36,7 @@ In most cases, AES encryption key is not necessary, because Hadoop KMS and Spark
|
||||||
|
|
||||||
### [HDFS Security](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SecureMode.html)
|
### [HDFS Security](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SecureMode.html)
|
||||||
|
|
||||||
Please ensure authentication and [access control](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsPermissionsGuide.html) is correctly configured. Note that HDFS authentication relies on [Kerberos](http://web.mit.edu/kerberos/krb5-1.12/doc/user/user_commands/kinit.html).
|
Please ensure authentication and [access control](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsPermissionsGuide.html) are correctly configured. Note that HDFS authentication relies on [Kerberos](http://web.mit.edu/kerberos/krb5-1.12/doc/user/user_commands/kinit.html).
|
||||||
|
|
||||||
Enable [Data_confidentiality](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SecureMode.html#Data_confidentiality) for network. This will protect PRC, block transfer and http.
|
Enable [Data_confidentiality](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SecureMode.html#Data_confidentiality) for network. This will protect PRC, block transfer and http.
|
||||||
|
|
||||||
|
|
|
||||||
BIN
docs/readthedocs/source/doc/PPML/images/ppml_dev_basic.png
Normal file
BIN
docs/readthedocs/source/doc/PPML/images/ppml_dev_basic.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 89 KiB |
Loading…
Reference in a new issue