* Refine grammar of secure_your_services, trusted_fl, secure_your_services, trusted_big_data_analytics_and_ml and ppml.
		
			
				
	
	
	
	
		
			7.6 KiB
		
	
	
	
	
	
	
	
			
		
		
	
	Trusted FL (Federated Learning)
Federated Learning is a new tool in PPML (Privacy Preserving Machine Learning), which empowers multi-parities to build a united model across different parties without compromising privacy, even if these parties have different datasets or features. In FL training stage, sensitive data will be kept locally, and only temp gradients or weights will be safely aggregated by a trusted third-party. In our design, this trusted third-parity is fully protected by Intel SGX.
A number of FL tools or frameworks have been proposed to enable FL in different areas, i.e., OpenFL, TensorFlow Federated, FATE, Flower and PySyft etc. However, none of them is designed for Big Data scenarios. To enable FL in big data ecosystem, BigDL PPML provides a SGX-based End-to-end Trusted FL platform. With this platform, data scientists and developers can easily setup FL applications upon distributed large-scale datasets with a few clicks. To achieve this goal, we provide the following features:
- ID & feature align: figure out portions of local data that will participate in the training stage
 - Horizontal FL: training across multi-parties with the same features and different entities
 - Vertical FL: training across multi-parties with the same entries and different features.
 
To ensure sensitive data are fully protected in the training and inference stages, we make sure:
- Sensitive data and weights are kept local, only temp gradients or weights will be safely aggregated by a trusted third-party
 - Trusted third-party, i.e., FL Server, is protected by SGX Enclaves
 - Local training environment is protected by SGX Enclaves (recommended but not enforced)
 - Network communication and Storage (e.g., data and model) protected by encryption and Transport Layer Security (TLS)](https://en.wikipedia.org/wiki/Transport_Layer_Security)
 
That is, even when the program runs in an untrusted cloud environment, all the data and models are protected (e.g., using encryption) on disk and network, and the compute and memory are also protected using SGX Enclaves.
Prerequisite
Please ensure SGX is properly enabled, and SGX driver is installed. If not, please refer to the Install SGX Driver.
Prepare Keys & Dataset
- 
Generate the signing key for SGX Enclaves
Generate the enclave key using the command below, keep it safely for future remote attestations and to start SGX Enclaves more securely. It will generate a file
enclave-key.pemin the current working directory, which will be the enclave key. To store the key elsewhere, modify the output file path.cd scripts/ openssl genrsa -3 -out enclave-key.pem 3072 cd ..Then modify
ENCLAVE_KEY_PATHindeploy_fl_container.shwith your path toenclave-key.pem. - 
Prepare keys for TLS with root permission (test only, need input security password for keys). Please also install JDK/OpenJDK and set the environment path of the java path to get
keytool.cd scripts/ ./generate-keys.sh cd ..When entering the passphrase or password, you could input the same password by yourself; and these passwords could also be used for the next step of generating other passwords. Password should be longer than 6 bits and contain numbers and letters, and one sample password is "3456abcd". These passwords would be used for future remote attestations and to start SGX enclaves more securely. And This script will generate 6 files in
./ppml/scripts/keysdir (you can replace them with your own TLS keys).keystore.jks keystore.pkcs12 server.crt server.csr server.key server.pemIf run in container, please modify
KEYS_PATHtokeys/you generated in last step indeploy_fl_container.sh. This dir will mount to container's/ppml/trusted-big-data-ml/work/keys, then modify theprivateKeyFilePathandcertChainFilePathinppml-conf.yamlwith container's absolute path. If not in container, just modify theprivateKeyFilePathandcertChainFilePathinppml-conf.yamlwith your local path. If you don't want to build tls channel with certificate, just delete theprivateKeyFilePathandcertChainFilePathinppml-conf.yaml. - 
Prepare dataset for FL training. For demo purposes, we have added a public dataset in BigDL PPML Demo data. Please download these data into your local machine. Then modify
DATA_PATHto./datawith absolute path in your machine and your local ip indeploy_fl_container.sh. The./datapath will mount to container's/ppml/trusted-big-data-ml/work/data, so if you don't run in container, you need to modify the data path inrunH_VflClient1_2.sh. 
Prepare Docker Image
Pull image from Dockerhub
docker pull intelanalytics/bigdl-ppml-trusted-fl-graphene:2.1.0-SNAPSHOT
If Dockerhub is not accessible, you can build docker image. Modify your http_proxy in build-image.sh then run:
./build-image.sh
Start FLServer
Before starting any local training client or worker, we need to start a Trusted third-parity, i.e., FL Server, for secure aggregation. In our design, this FL Server is running in SGX with help of Graphene or Occlum. Local workers/Clients can verify its integrity with SGX Remote Attestation.
Running this command will start a docker container and initialize the SGX environment.
bash deploy_fl_container.sh
sudo docker exec -it flDemo bash
./init.sh
In container, run:
./runFlServer.sh
The fl-server will start and listen on 8980 port. Both horizontal fl-demo and vertical fl-demo need two clients. You can change the listening port and client number by editing BigDL/scala/ppml/demo/ppml-conf.yaml's serverPort and clientNum.
Note that we skip ID & Feature for simplifying demo. In practice, before we start Federated Learning, we need to align ID & Feature, and figure out portions of local data that will participate in later training stages. In horizontal FL, feature alignment is required to ensure each party is training on the same features. In vertical FL, both ID and feature alignment are required to ensure each party training on different features of the same record.
HFL Logistic Regression
Open two new terminals, run:
sudo docker exec -it flDemo bash
to enter the container, then in a terminal run:
./runHflClient1.sh
in another terminal run:
./runHflClient2.sh
Then we start two horizontal fl-clients to cooperate in training a model.
VFL Logistic Regression
Open two new terminals, run:
sudo docker exec -it flDemo bash
to enter the container, then in a terminal run:
./runVflClient1.sh
in another terminal run:
./runVflClient2.sh
Then we start two vertical fl-clients to cooperate in training a model.
References
- Intel SGX
 - Qiang Yang, Yang Liu, Tianjian Chen, and Yongxin Tong. 2019. Federated Machine Learning: Concept and Applications. ACM Trans. Intell. Syst. Technol. 10, 2, Article 12 (February 2019), 19 pages. DOI:https://doi.org/10.1145/3298981
 - Federated Learning
 - TensorFlow Federated
 - FATE
 - PySyft
 - Federated XGBoost