[PPML] Refine Trusted FL Doc (#3967)

* Refine overview * Refine Prerequisite * Refine example * Refine typo
2022-02-09 10:09:19 +08:00 · 2022-02-09 10:09:19 +08:00 · 6846feb44e
commit 6846feb44e
parent 1e09804c37
1 changed files with 70 additions and 56 deletions
--- a/docs/readthedocs/source/doc/PPML/Overview/trusted_fl.md
+++ b/docs/readthedocs/source/doc/PPML/Overview/trusted_fl.md
@ -1,41 +1,72 @@
 # Trusted FL (Federated Learning)

-SGX-based End-to-end Trusted FL platform
+[Federated Learning](https://en.wikipedia.org/wiki/Federated_learning) is a new tool in PPML (Privacy Preserving Machine Learning), which empowers multi-parities to build united model across different parties without compromising privacy, even if these parities have different datasets or features. In FL training stage, sensitive data will be kept locally, only temp gradients or weights will be safely aggregated by a trusted third-parity. In our design, this trusted third-parity is fully protected by Intel SGX.

-## ID & Feature align
+A number of FL tools or frameworks have been proposed to enable FL in different areas, i.e., OpenFL, TensorFlow Federated, FATE, Flower and PySyft etc. However, none of them is designed for Big Data scenario. To enable FL in big data ecosystem, BigDL PPML provides a SGX-based End-to-end Trusted FL platform. With this platform, data scientist and developers can easily setup FL applications upon distributed large scale datasets with a few clicks. To achieve this goal, we provides following features:

-Before we start Federated Learning, we need to align ID & Feature, and figure out portions of local data that will participate in later training stage.
+ * ID & feature align: figure out portions of local data that will participate in training stage
+ * Horizontal FL: training across multi-parties with same features and different entities
+ * Vertical FL: training across multi-parties with same entries and different features.

-Let RID1 and RID2 be randomized ID from party 1 and party 2.
+To ensure sensitive data are fully protected in training and inference stages, we make sure:

-## Vertical FL
+ * Sensitive data and weights are kept local, only temp gradients or weights will be safely aggregated by a trusted third-parity
+ * Trusted third-parity, i.e., FL Server, is protected by SGX Enclaves
+ * Local training environment is protected by SGX Enclaves (recommended but not enforced)
+ * Network communication and Storage (e.g., data and model) protected by encryption and Transport Layer Security (TLS)](https://en.wikipedia.org/wiki/Transport_Layer_Security)

-Vertical FL training across multi-parties with different features.
+That is, even when the program runs in an untrusted cloud environment, all the data and models are protected (e.g., using encryption) on disk and network, and the compute and memory are also protected using SGX Enclaves.

-Key features:
+## Prerequisite

-* FL Server in SGX
-    * ID & feature align
-    * Forward & backward aggregation
-* Training node in SGX
+Please ensure SGX is properly enabled, and SGX driver is installed. If not, please refer to the [Install SGX Driver](https://bigdl.readthedocs.io/en/latest/doc/PPML/Overview/ppml.html#prerequisite).

-## Horizontal FL
+### Prepare Keys & Dataset

-Horizontal FL training across multi-parties.
+1. Generate the signing key for SGX Enclaves

-Key features:
+   Generate the enclave key using the command below, keep it safely for future remote attestations and to start SGX Enclaves more securely. It will generate a file `enclave-key.pem` in the current working directory, which will be the  enclave key. To store the key elsewhere, modify the output file path.

-* FL Server in SGX
-   * ID & feature align (optional)
-   * Weight/Gradient Aggregation in SGX
-* Training Worker in SGX
-## Example 
+    ```bash
+    cd scripts/
+    openssl genrsa -3 -out enclave-key.pem 3072
+    cd ..
+    ```

-### Before running code
+    Then modify `ENCLAVE_KEY_PATH` in `deploy_fl_container.sh` with your path to `enclave-key.pem`.

-#### **Prepare Docker Image**
+2. Prepare keys for TLS with root permission (test only, need input security password for keys). Please also install JDK/OpenJDK and set the environment path of the java path to get `keytool`.

-##### **Build jar from Source**
+    ```bash
+    cd scripts/
+    ./generate-keys.sh
+    cd ..
+    ```
+
+    When entering the passphrase or password, you could input the same password by yourself; and these passwords could also be used for the next step of generating other passwords. Password should be longer than 6 bits and contain numbers and letters, and one sample password is "3456abcd". These passwords would be used for future remote attestations and to start SGX enclaves more securely. And This script will generate 6 files in `./ppml/scripts/keys` dir (you can replace them with your own TLS keys).
+
+    ```bash
+    keystore.jks
+    keystore.pkcs12
+    server.crt
+    server.csr
+    server.key
+    server.pem
+    ```
+
+    If run in container, please modify `KEYS_PATH` to `keys/` you generated in last step in `deploy_fl_container.sh`. This dir will mount to container's `/ppml/trusted-big-data-ml/work/keys`, then modify the `privateKeyFilePath` and `certChainFilePath` in `ppml-conf.yaml` with container's absolute path. If not in container, just modify the `privateKeyFilePath` and `certChainFilePath` in `ppml-conf.yaml` with your local path. If you don't want to build tls channel with certificate, just delete the `privateKeyFilePath` and `certChainFilePath` in `ppml-conf.yaml`.
+
+3. Prepare dataset for FL training. For demo purposes, we have added a public dataset in [BigDL PPML Demo data](https://github.com/intel-analytics/BigDL/tree/branch-2.0/scala/ppml/demo/data). Please download these data into your local machine. Then modify `DATA_PATH` to `./data` with absolute path in your machine and your local ip in `deploy_fl_container.sh`. The `./data` path will mount to container's `/ppml/trusted-big-data-ml/work/data`, so if you don't run in container, you need to modify the data path in `runH_VflClient1_2.sh`.
+
+### Prepare Docker Image
+
+Pull image from Dockerhub
+
+```bash
+docker pull intelanalytics/bigdl-ppml-trusted-big-data-fl-scala-graphene:0.14.0-SNAPSHOT
+```
+
+If Dockerhub is not accessible, you can build docker image from BigDL source code

 ```bash
 cd BigDL/scala && bash make-dist.sh -DskipTests -Pspark_3.x
@ -43,43 +74,17 @@ mv ppml/target/bigdl-ppml-spark_3.1.2-0.14.0-SNAPSHOT-jar-with-dependencies.jar
 cd ppml/demo
 ```

-##### **Build Image**
 Modify your `http_proxy` in `build-image.sh` then run:

 ```bash
 ./build-image.sh
 ```

-#### **Enclave key**
-You need to generate your enclave key using the command below, and keep it safely for future remote attestations and to start SGX enclaves more securely.
+## Start FLServer

-It will generate a file `enclave-key.pem` in your present working directory, which will be your enclave key. To store the key elsewhere, modify the outputted file path.
+Before starting any local training client or worker, we need to start a Trusted third-parity, i.e., FL Server, for secure aggregation. In current design, this FL Server is running in SGX with help of Graphene or Occlum. Local workers/Clients can verify its integrity with SGX Remote Attestation.

-```bash
-openssl genrsa -3 -out enclave-key.pem 3072
-```
-
-Then modify `ENCLAVE_KEY_PATH` in `deploy_fl_container.sh` with your path to `enclave-key.pem`.
-
-#### **Tls certificate**
-If you want to build tls channel with certifacate, you need to prepare the secure keys. In this tutorial, you can generate keys with root permission (test only, need input security password for keys).
-
-**Note: Must enter `localhost` in step `Common Name` for test purpose.**
-
-```bash
-sudo bash ../../../ppml/scripts/generate-keys.sh
-```
-
-If run in container, please modify `KEYS_PATH` to `keys/` you generated in last step in `deploy_fl_container.sh`. This dir will mount to container's `/ppml/trusted-big-data-ml/work/keys`, then modify the `privateKeyFilePath` and `certChainFilePath` in `ppml-conf.yaml` with container's absolute path.
-
-If not in container, just modify the `privateKeyFilePath` and `certChainFilePath` in `ppml-conf.yaml` with your local path.
-
-If you don't want to build tls channel with cerfiticate, just delete the `privateKeyFilePath` and `certChainFilePath` in `ppml-conf.yaml`.
-
-Then modify `DATA_PATH` to `./data` with absolute path in your machine and your local ip in `deploy_fl_container.sh`. The `./data` path will mlount to container's `/ppml/trusted-big-data-ml/work/data`, so if you don't run in container, you need to modify the data path in `runH_VflClient1_2.sh`.
-
-### **Start container**
-Running this command will start a docker container and initialize the sgx environment.
+Running this command will start a docker container and initialize the SGX environment.

 ```bash
 bash deploy_fl_container.sh
@ -87,15 +92,18 @@ sudo docker exec -it flDemo bash
 ./init.sh
 ```

-### **Start FLServer**
 In container, run:

 ```bash
 ./runFlServer.sh
 ```
+
 The fl-server will start and listen on 8980 port. Both horizontal fl-demo and vertical fl-demo need two clients. You can change the listening port and client number by editing `BigDL/scala/ppml/demo/ppml-conf.yaml`'s `serverPort` and `clientNum`.  

-### **HFL Logistic Regression**
+Note that we skip ID & Feature for simplify demo. In practice, before we start Federated Learning, we need to align ID & Feature, and figure out portions of local data that will participate in later training stage. In horizontal FL, feature align is required to ensure each party is training on the same features. In vertical FL, both ID and feature align are required to ensure each party training on different features of the same record.
+
+## HFL Logistic Regression
+
 Open two new terminals, run:

 ```bash
@ -116,8 +124,9 @@ in another terminal run:

 Then we start two horizontal fl-clients to cooperate in training a model.

-### **VFL Logistic Regression**
-Open two new windows, run:
+## VFL Logistic Regression
+
+Open two new terminals, run:

 ```bash
 sudo docker exec -it flDemo bash
@ -140,4 +149,9 @@ Then we start two vertical fl-clients to cooperate in training a model.
 ## References

 1. [Intel SGX](https://software.intel.com/content/www/us/en/develop/topics/software-guard-extensions.html)
-2. Qiang Yang, Yang Liu, Tianjian Chen, and Yongxin Tong. 2019. Federated Machine Learning: Concept and Applications. ACM Trans. Intell. Syst. Technol. 10, 2, Article 12 (February 2019), 19 pages. DOI:https://doi.org/10.1145/3298981
+2. Qiang Yang, Yang Liu, Tianjian Chen, and Yongxin Tong. 2019. Federated Machine Learning: Concept and Applications. ACM Trans. Intell. Syst. Technol. 10, 2, Article 12 (February 2019), 19 pages. DOI:https://doi.org/10.1145/3298981
+3. [Federated Learning](https://en.wikipedia.org/wiki/Federated_learning)
+4. [TensorFlow Federated](https://www.tensorflow.org/federated)
+5. [FATE](https://github.com/FederatedAI/FATE)
+6. [PySyft](https://github.com/OpenMined/PySyft)
+7. [Federated XGBoost](https://github.com/mc2-project/federated-xgboost)