Update part of Overview guide in mddocs (1/2) (#11378 )

* Create install.md

* Update install_cpu.md

* Delete original docs/mddocs/Overview/install_cpu.md

* Update install_cpu.md

* Update install_gpu.md

* update llm.md and install.md

* Update docs in KeyFeatures

* Review and fix typos

* Fix on folded NOTE

* Small fix

* Small fix

* Remove empty known_issue.md

* Small fix

* Small fix

* Further fix

* Fixes

* Fix

---------

Co-authored-by: Yuwen Hu <yuwen.hu@intel.com>

2024-06-21 10:45:17 +08:00

1.2 KiB

Raw Blame History

Native Format

You may also convert Hugging Face Transformers models into native INT4 format for maximum performance as follows.

Note

Currently only llama/bloom/gptneox/starcoder/chatglm model families are supported; you may use the corresponding API to load the converted model. (For other models, you can use the Hugging Face transformers format as described here)

# convert the model
from ipex_llm import llm_convert
ipex_llm_path = llm_convert(model='/path/to/model/',
                            outfile='/path/to/output/', outtype='int4', model_family="llama")

# load the converted model
# switch to ChatGLMForCausalLM/GptneoxForCausalLM/BloomForCausalLM/StarcoderForCausalLM to load other models
from ipex_llm.transformers import LlamaForCausalLM
llm = LlamaForCausalLM.from_pretrained("/path/to/output/model.bin", native=True, ...)

# run the converted model
input_ids = llm.tokenize(prompt)
output_ids = llm.generate(input_ids, ...)
output = llm.batch_decode(output_ids)

Note

See the complete example here

1.2 KiB Raw Blame History

Native Format

1.2 KiB

Raw Blame History