ipex-llm/docs/readthedocs/source/doc/PPML/Overview/userguide.md
Shengsheng Huang f2e4c40cee change the readthedocs theme and reorg the sections (#6056)
* refactor toc

* refactor toc

* Change to pydata-sphinx-theme and update packages requirement list for ReadtheDocs

* Remove customized css for old theme

* Add index page to each top bar section and limit dropdown maximum to be 4

* Use js to change 'More' to 'Libraries'

* Add custom.css to conf.py for further css changes

* Add BigDL logo and search bar

* refactor toc

* refactor toc and add overview

* refactor toc and add overview

* refactor toc and add overview

* refactor get started

* add paper and video section

* add videos

* add grid columns in landing page

* add document roadmap to index

* reapply search bar and github icon commit

* reorg orca and chronos sections

* Test: weaken ads by js

* update: change left attrbute

* update: add comments

* update: change opacity to 0.7

* Remove useless theme template override for old theme

* Add sidebar releases component in the home page

* Remove sidebar search and restore top nav search button

* Add BigDL handouts

* Add back to homepage button to pages except from the home page

* Update releases contents & styles in left sidebar

* Add version badge to the top bar

* Test: weaken ads by js

* update: add comments

* remove landing page contents

* rfix chronos install

* refactor install

* refactor chronos section titles

* refactor nano index

* change chronos landing

* revise chronos landing page

* add document navigator to nano landing page

* revise install landing page

* Improve css of versions in sidebar

* Make handouts image pointing to a page in new tab

* add win guide to install

* add dliib installation

* revise title bar

* rename index files

* add index page for user guide

* add dllib and orca API

* update user guide landing page

* refactor side bar

* Remove extra style configuration of card components & make different card usage consistent

* Remove extra styles for Nano how-to guides

* Remove extra styles for Chronos how-to guides

* Remove dark mode for now

* Update index page description

* Add decision tree for choosing BigDL libraries in index page

* add dllib models api, revise core layers formats

* Change primary & info color in light mode

* Restyle card components

* Restructure Chronos landing page

* Update card style

* Update BigDL library selection decision tree

* Fix failed Chronos tutorials filter

* refactor PPML documents

* refactor and add friesian documents

* add friesian arch diagram

* update landing pages and fill key features guide index page

* Restyle link card component

* Style video frames in PPML sections

* Adjust Nano landing page

* put api docs to the last in index for convinience

* Make badge horizontal padding smaller & small changes

* Change the second letter of all header titles to be small capitalizd

* Small changes on Chronos index page

* Revise decision tree to make it smaller

* Update: try to change the position of ads.

* Bugfix: deleted nonexist file config

* Update: update ad JS/CSS/config

* Update: change ad.

* Update: delete my template and change files.

* Update: change chronos installation table color.

* Update: change table font color to --pst-color-primary-text

* Remove old contents in landing page sidebar

* Restyle badge for usage in card footer again

* Add quicklinks template on landing page sidebar

* add quick links

* Add scala logo

* move tf, pytorch out of the link

* change orca key features cards

* fix typo

* fix a mistake in wording

* Restyle badge for card footer

* Update decision tree

* Remove useless html templates

* add more api docs and update tutorials in dllib

* update chronos install using new style

* merge changes in nano doc from master

* fix quickstart links in sidebar quicklinks

* Make tables responsive

* Fix overflow in api doc

* Fix list indents problems in [User guide] section

* Further fixes to nested bullets contents in [User Guide] section

* Fix strange title in Nano 5-min doc

* Fix list indent problems in [DLlib] section

* Fix misnumbered list problems and other small fixes for [Chronos] section

* Fix list indent problems and other small fixes for [Friesian] section

* Fix list indent problem and other small fixes for [PPML] section

* Fix list indent problem for developer guide

* Fix list indent problem for [Cluster Serving] section

* fix dllib links

* Fix wrong relative link in section landing page

Co-authored-by: Yuwen Hu <yuwen.hu@intel.com>
Co-authored-by: Juntao Luo <1072087358@qq.com>
2022-10-18 15:35:31 +08:00

14 KiB

Develop your own Big Data & AI applications with BigDL PPML

First you need to create a PPMLContext, which wraps SparkSession and provides methods to read encrypted data file into plain-text RDD/DataFrame and write DataFrame to encrypted data file. Then you can read & write data through PPMLContext.

If you are familiar with Spark, you may find that the usage of PPMLConext is very similar to Spark.

1. Create PPMLContext

  • create a PPMLContext with appName

    This is the simplest way to create a PPMLContext. When you don't need to read/write encrypted files, you can use this way to create a PPMLContext.

    scala
    import com.intel.analytics.bigdl.ppml.PPMLContext
    
    val sc = PPMLContext.initPPMLContext("MyApp")
    
    python
    from bigdl.ppml.ppml_context import *
    
    sc = PPMLContext("MyApp")
    

    If you want to read/write encrypted files, then you need to provide more information.

  • create a PPMLContext with appName & ppmlArgs

    ppmlArgs is ppml arguments in a Map, ppmlArgs varies according to the kind of Key Management Service (KMS) you are using. Key Management Service (KMS) is used to generate primaryKey and dataKey to encrypt/decrypt data. We provide 3 types of KMS ——SimpleKeyManagementService, EHSMKeyManagementService, AzureKeyManagementService.

    Refer to KMS Utils to use KMS to generate primaryKey and dataKey, then you are ready to create PPMLContext with ppmlArgs.

    • For SimpleKeyManagementService:

      scala
      import com.intel.analytics.bigdl.ppml.PPMLContext
      
      val ppmlArgs: Map[String, String] = Map(
             "spark.bigdl.kms.type" -> "SimpleKeyManagementService",
             "spark.bigdl.kms.simple.id" -> "your_app_id",
             "spark.bigdl.kms.simple.key" -> "your_app_key",
             "spark.bigdl.kms.key.primary" -> "/your/primary/key/path/primaryKey",
             "spark.bigdl.kms.key.data" -> "/your/data/key/path/dataKey"
         )
      
      val sc = PPMLContext.initPPMLContext("MyApp", ppmlArgs)
      
      python
      from bigdl.ppml.ppml_context import *
      
      ppml_args = {"kms_type": "SimpleKeyManagementService",
                   "simple_app_id": "your_app_id",
                   "simple_app_key": "your_app_key",
                   "primary_key_path": "/your/primary/key/path/primaryKey",
                   "data_key_path": "/your/data/key/path/dataKey"
                  }
      
      sc = PPMLContext("MyApp", ppml_args)
      
    • For EHSMKeyManagementService:

      scala
      import com.intel.analytics.bigdl.ppml.PPMLContext
      
      val ppmlArgs: Map[String, String] = Map(
             "spark.bigdl.kms.type" -> "EHSMKeyManagementService",
             "spark.bigdl.kms.ehs.ip" -> "your_server_ip",
             "spark.bigdl.kms.ehs.port" -> "your_server_port",
             "spark.bigdl.kms.ehs.id" -> "your_app_id",
             "spark.bigdl.kms.ehs.key" -> "your_app_key",
             "spark.bigdl.kms.key.primary" -> "/your/primary/key/path/primaryKey",
             "spark.bigdl.kms.key.data" -> "/your/data/key/path/dataKey"
      )
      
      val sc = PPMLContext.initPPMLContext("MyApp", ppmlArgs)
      
      python
      from bigdl.ppml.ppml_context import *
      
      ppml_args = {"kms_type": "EHSMKeyManagementService",
                   "kms_server_ip": "your_server_ip",
                   "kms_server_port": "your_server_port"
                   "ehsm_app_id": "your_app_id",
                   "ehsm_app_key": "your_app_key",
                   "primary_key_path": "/your/primary/key/path/primaryKey",
                   "data_key_path": "/your/data/key/path/dataKey"
                  }
      
      sc = PPMLContext("MyApp", ppml_args)
      
    • For AzureKeyManagementService

      the parameter clientId is not necessary, you don't have to provide this parameter.

      scala
      import com.intel.analytics.bigdl.ppml.PPMLContext
      
      val ppmlArgs: Map[String, String] = Map(
             "spark.bigdl.kms.type" -> "AzureKeyManagementService",
             "spark.bigdl.kms.azure.vault" -> "key_vault_name",
             "spark.bigdl.kms.azure.clientId" -> "client_id",
             "spark.bigdl.kms.key.primary" -> "/your/primary/key/path/primaryKey",
             "spark.bigdl.kms.key.data" -> "/your/data/key/path/dataKey"
         )
      
      val sc = PPMLContext.initPPMLContext("MyApp", ppmlArgs)
      
      python
      from bigdl.ppml.ppml_context import *
      
      ppml_args = {"kms_type": "AzureKeyManagementService",
                   "azure_vault": "your_azure_vault",
                   "azure_client_id": "your_azure_client_id",
                   "primary_key_path": "/your/primary/key/path/primaryKey",
                   "data_key_path": "/your/data/key/path/dataKey"
                  }
      
      sc = PPMLContext("MyApp", ppml_args)
      
  • create a PPMLContext with sparkConf & appName & ppmlArgs

    If you need to set Spark configurations, you can provide a SparkConf with Spark configurations to create a PPMLContext.

    scala
    import com.intel.analytics.bigdl.ppml.PPMLContext
    import org.apache.spark.SparkConf
    
    val ppmlArgs: Map[String, String] = Map(
        "spark.bigdl.kms.type" -> "SimpleKeyManagementService",
        "spark.bigdl.kms.simple.id" -> "your_app_id",
        "spark.bigdl.kms.simple.key" -> "your_app_key",
        "spark.bigdl.kms.key.primary" -> "/your/primary/key/path/primaryKey",
        "spark.bigdl.kms.key.data" -> "/your/data/key/path/dataKey"
    )
    
    val conf: SparkConf = new SparkConf().setMaster("local[4]")
    
    val sc = PPMLContext.initPPMLContext(conf, "MyApp", ppmlArgs)
    
    python
    from bigdl.ppml.ppml_context import *
    from pyspark import SparkConf
    
    ppml_args = {"kms_type": "SimpleKeyManagementService",
                 "simple_app_id": "your_app_id",
                 "simple_app_key": "your_app_key",
                 "primary_key_path": "/your/primary/key/path/primaryKey",
                 "data_key_path": "/your/data/key/path/dataKey"
                }
    
    conf = SparkConf()
    conf.setMaster("local[4]")
    
    sc = PPMLContext("MyApp", ppml_args, conf)
    

2. Read and Write Files

To read/write data, you should set the CryptoMode:

  • plain_text: no encryption
  • AES/CBC/PKCS5Padding: for CSV, JSON and text file
  • AES_GCM_V1: for PARQUET only
  • AES_GCM_CTR_V1: for PARQUET only

To write data, you should set the write mode:

  • overwrite: Overwrite existing data with the content of dataframe.
  • append: Append content of the dataframe to existing data or table.
  • ignore: Ignore current write operation if data / table already exists without any error.
  • error: Throw an exception if data or table already exists.
  • errorifexists: Throw an exception if data or table already exists.
scala
import com.intel.analytics.bigdl.ppml.crypto.{AES_CBC_PKCS5PADDING, PLAIN_TEXT}

// read data
val df = sc.read(cryptoMode = PLAIN_TEXT)
         ...

// write data
sc.write(dataFrame = df, cryptoMode = AES_CBC_PKCS5PADDING)
.mode("overwrite")
...
python
from bigdl.ppml.ppml_context import *

# read data
df = sc.read(crypto_mode = CryptoMode.PLAIN_TEXT)
  ...

# write data
sc.write(dataframe = df, crypto_mode = CryptoMode.AES_CBC_PKCS5PADDING)
.mode("overwrite")
...
expand to see the examples of reading/writing CSV, PARQUET, JSON and text file

The following examples use sc to represent a initialized PPMLContext

read/write CSV file

scala
import com.intel.analytics.bigdl.ppml.PPMLContext
import com.intel.analytics.bigdl.ppml.crypto.{AES_CBC_PKCS5PADDING, PLAIN_TEXT}

// read a plain csv file and return a DataFrame
val plainCsvPath = "/plain/csv/path"
val df1 = sc.read(cryptoMode = PLAIN_TEXT).option("header", "true").csv(plainCsvPath)

// write a DataFrame as a plain csv file
val plainOutputPath = "/plain/output/path"
sc.write(df1, PLAIN_TEXT)
.mode("overwrite")
.option("header", "true")
.csv(plainOutputPath)

// read a encrypted csv file and return a DataFrame
val encryptedCsvPath = "/encrypted/csv/path"
val df2 = sc.read(cryptoMode = AES_CBC_PKCS5PADDING).option("header", "true").csv(encryptedCsvPath)

// write a DataFrame as a encrypted csv file
val encryptedOutputPath = "/encrypted/output/path"
sc.write(df2, AES_CBC_PKCS5PADDING)
.mode("overwrite")
.option("header", "true")
.csv(encryptedOutputPath)
python
# import
from bigdl.ppml.ppml_context import *

# read a plain csv file and return a DataFrame
plain_csv_path = "/plain/csv/path"
df1 = sc.read(CryptoMode.PLAIN_TEXT).option("header", "true").csv(plain_csv_path)

# write a DataFrame as a plain csv file
plain_output_path = "/plain/output/path"
sc.write(df1, CryptoMode.PLAIN_TEXT)
.mode('overwrite')
.option("header", True)
.csv(plain_output_path)

# read a encrypted csv file and return a DataFrame
encrypted_csv_path = "/encrypted/csv/path"
df2 = sc.read(CryptoMode.AES_CBC_PKCS5PADDING).option("header", "true").csv(encrypted_csv_path)

# write a DataFrame as a encrypted csv file
encrypted_output_path = "/encrypted/output/path"
sc.write(df2, CryptoMode.AES_CBC_PKCS5PADDING)
.mode('overwrite')
.option("header", True)
.csv(encrypted_output_path)

read/write PARQUET file

scala
import com.intel.analytics.bigdl.ppml.PPMLContext
import com.intel.analytics.bigdl.ppml.crypto.{AES_GCM_CTR_V1, PLAIN_TEXT}

// read a plain parquet file and return a DataFrame
val plainParquetPath = "/plain/parquet/path"
val df1 = sc.read(PLAIN_TEXT).parquet(plainParquetPath)

// write a DataFrame as a plain parquet file
plainOutputPath = "/plain/output/path"
sc.write(df1, PLAIN_TEXT)
.mode("overwrite")
.parquet(plainOutputPath)

// read a encrypted parquet file and return a DataFrame
val encryptedParquetPath = "/encrypted/parquet/path"
val df2 = sc.read(AES_GCM_CTR_V1).parquet(encryptedParquetPath)

// write a DataFrame as a encrypted parquet file
val encryptedOutputPath = "/encrypted/output/path"
sc.write(df2, AES_GCM_CTR_V1)
.mode("overwrite")
.parquet(encryptedOutputPath)
python
# import
from bigdl.ppml.ppml_context import *

# read a plain parquet file and return a DataFrame
plain_parquet_path = "/plain/parquet/path"
df1 = sc.read(CryptoMode.PLAIN_TEXT).parquet(plain_parquet_path)

# write a DataFrame as a plain parquet file
plain_output_path = "/plain/output/path"
sc.write(df1, CryptoMode.PLAIN_TEXT)
.mode('overwrite')
.parquet(plain_output_path)

# read a encrypted parquet file and return a DataFrame
encrypted_parquet_path = "/encrypted/parquet/path"
df2 = sc.read(CryptoMode.AES_GCM_CTR_V1).parquet(encrypted_parquet_path)

# write a DataFrame as a encrypted parquet file
encrypted_output_path = "/encrypted/output/path"
sc.write(df2, CryptoMode.AES_GCM_CTR_V1)
.mode('overwrite')
.parquet(encrypted_output_path)

read/write JSON file

scala
import com.intel.analytics.bigdl.ppml.PPMLContext
import com.intel.analytics.bigdl.ppml.crypto.{AES_CBC_PKCS5PADDING, PLAIN_TEXT}

// read a plain json file and return a DataFrame
val plainJsonPath = "/plain/json/path"
val df1 = sc.read(PLAIN_TEXT).json(plainJsonPath)

// write a DataFrame as a plain json file
val plainOutputPath = "/plain/output/path"
sc.write(df1, PLAIN_TEXT)
.mode("overwrite")
.json(plainOutputPath)

// read a encrypted json file and return a DataFrame
val encryptedJsonPath = "/encrypted/parquet/path"
val df2 = sc.read(AES_CBC_PKCS5PADDING).json(encryptedJsonPath)

// write a DataFrame as a encrypted parquet file
val encryptedOutputPath = "/encrypted/output/path"
sc.write(df2, AES_CBC_PKCS5PADDING)
.mode("overwrite")
.json(encryptedOutputPath)
python
# import
from bigdl.ppml.ppml_context import *

# read a plain json file and return a DataFrame
plain_json_path = "/plain/json/path"
df1 = sc.read(CryptoMode.PLAIN_TEXT).json(plain_json_path)

# write a DataFrame as a plain json file
plain_output_path = "/plain/output/path"
sc.write(df1, CryptoMode.PLAIN_TEXT)
.mode('overwrite')
.json(plain_output_path)

# read a encrypted json file and return a DataFrame
encrypted_json_path = "/encrypted/parquet/path"
df2 = sc.read(CryptoMode.AES_CBC_PKCS5PADDING).json(encrypted_json_path)

# write a DataFrame as a encrypted parquet file
encrypted_output_path = "/encrypted/output/path"
sc.write(df2, CryptoMode.AES_CBC_PKCS5PADDING)
.mode('overwrite')
.json(encrypted_output_path)

read textfile

scala
import com.intel.analytics.bigdl.ppml.PPMLContext
import com.intel.analytics.bigdl.ppml.crypto.{AES_CBC_PKCS5PADDING, PLAIN_TEXT}

// read from a plain csv file and return a RDD
val plainCsvPath = "/plain/csv/path"
val rdd1 = sc.textfile(plainCsvPath) // the default cryptoMode is PLAIN_TEXT

// read from a encrypted csv file and return a RDD
val encryptedCsvPath = "/encrypted/csv/path"
val rdd2 = sc.textfile(path=encryptedCsvPath, cryptoMode=AES_CBC_PKCS5PADDING)
python
# import
from bigdl.ppml.ppml_context import *

# read from a plain csv file and return a RDD
plain_csv_path = "/plain/csv/path"
rdd1 = sc.textfile(plain_csv_path) # the default crypto_mode is "plain_text"

# read from a encrypted csv file and return a RDD
encrypted_csv_path = "/encrypted/csv/path"
rdd2 = sc.textfile(path=encrypted_csv_path, crypto_mode=CryptoMode.AES_CBC_PKCS5PADDING)

More usage with PPMLContext Python API, please refer to PPMLContext Python API.