Commit graph

312 commits

Author SHA1 Message Date
Zijie Li
9e65cf00b3
Add openai-whisper pytorch gpu (#11736)
* Add openai-whisper pytorch gpu

* Update README.md

* Update README.md

* fix typo

* fix names update readme

* Update README.md
2024-08-08 12:32:59 +08:00
Jinhe
d0c89fb715
updated llama.cpp and ollama quickstart (#11732)
* updated llama.cpp and ollama quickstart.md

* added qwen2-1.5B sample output

* revision on quickstart updates

* revision on quickstart updates

* revision on qwen2 readme

* added 2 troubleshoots“
”

* troubleshoot revision
2024-08-08 11:04:01 +08:00
Ch1y0q
4676af2054
add gemma2 example (#11724)
* add `gemma2`

* update `transformers` version

* update `README.md`
2024-08-06 21:17:50 +08:00
Jin, Qiao
11650b6f81
upgrade glm-4v example transformers version (#11719) 2024-08-06 14:55:09 +08:00
Jin, Qiao
7f241133da
Add MiniCPM-Llama3-V-2_5 GPU example (#11693)
* Add MiniCPM-Llama3-V-2_5 GPU example

* fix
2024-08-06 10:22:41 +08:00
Jin, Qiao
808d9a7bae
Add MiniCPM-V-2 GPU example (#11699)
* Add MiniCPM-V-2 GPU example

* add example in README.md

* add example in README.md
2024-08-06 10:22:33 +08:00
Zijie Li
8fb36b9f4a
add new benchmark_util.py (#11713)
* add new benchmark_util.py
2024-08-05 16:18:48 +08:00
Wang, Jian4
493cbd9a36
Support lightweight-serving with internlm-xcomposer2-vl-7b multimodal input (#11703)
* init image_list

* enable internlm-xcomposer2 image input

* update style

* add readme

* update model

* update readme
2024-08-05 09:36:04 +08:00
Qiyuan Gong
762ad49362
Add RANK_WAIT_TIME into DeepSpeed-AutoTP to avoid CPU memory OOM (#11704)
* DeepSpeed-AutoTP will start multiple processors to load models and convert them in CPU memory. If model/rank_num is large, this will lead to OOM. Add RANK_WAIT_TIME to reduce memory usage by controlling model reading parallelism.
2024-08-01 18:16:21 +08:00
Zijie Li
5079ed9e06
Add Llama3.1 example (#11689)
* Add Llama3.1 example

Add Llama3.1 example for Linux arc and Windows MTL

* Changes made to adjust compatibilities

transformers changed to 4.43.1

* Update index.rst

* Update README.md

* Update index.rst

* Update index.rst

* Update index.rst
2024-07-31 10:53:30 +08:00
Jin, Qiao
6e3ce28173
Upgrade glm-4 example transformers version (#11659)
* upgrade glm-4 example transformers version

* move pip install in one line
2024-07-31 10:24:50 +08:00
Guoqiong Song
336dfc04b1
fix 1482 (#11661)
Co-authored-by: rnwang04 <ruonan1.wang@intel.com>
2024-07-26 12:39:09 -07:00
Wang, Jian4
23681fbf5c
Support codegeex4-9b for lightweight-serving (#11648)
* add options, support prompt and not return end_token

* enable openai parameter

* set do_sample None and update style
2024-07-26 09:41:03 +08:00
Wang, Jian4
1eed0635f2
Add lightweight serving and support tgi parameter (#11600)
* init tgi request

* update openai api

* update for pp

* update and add readme

* add to docker

* add start bash

* update

* update

* update
2024-07-19 13:15:56 +08:00
Guoqiong Song
380717f50d
fix gemma for 4.41 (#11531)
* fix gemma for 4.41
2024-07-18 15:02:50 -07:00
Guoqiong Song
5a6211fd56
fix minicpm for transformers>=4.39 (#11533)
* fix minicpm for transformers>=4.39
2024-07-18 15:01:57 -07:00
Guoqiong Song
bfcdc35b04
phi-3 on "transformers>=4.37.0,<=4.42.3" (#11534) 2024-07-17 17:19:57 -07:00
Guoqiong Song
d64711900a
Fix cohere model on transformers>=4.41 (#11575)
* fix cohere model for 4-41
2024-07-17 17:18:59 -07:00
Guoqiong Song
5b6eb85b85
phi model readme (#11595)
Co-authored-by: rnwang04 <ruonan1.wang@intel.com>
2024-07-17 17:18:34 -07:00
Wang, Jian4
9c15abf825
Refactor fastapi-serving and add one card serving(#11581)
* init fastapi-serving one card

* mv api code to source

* update worker

* update for style-check

* add worker

* update bash

* update

* update worker name and add readme

* rename update

* rename to fastapi
2024-07-17 11:12:43 +08:00
Heyang Sun
365adad59f
Support LoRA ChatGLM with Alpaca Dataset (#11580)
* Support LoRA ChatGLM with Alpaca Dataset

* refine

* fix

* add 2-card alpaca
2024-07-16 15:40:02 +08:00
Ch1y0q
50cf563a71
Add example: MiniCPM-V (#11570) 2024-07-15 10:55:48 +08:00
Xiangyu Tian
0981b72275
Fix /generate_stream api in Pipeline Parallel FastAPI (#11569) 2024-07-12 13:19:42 +08:00
binbin Deng
2b8ad8731e
Support pipeline parallel for glm-4v (#11545) 2024-07-11 16:06:06 +08:00
Xiangyu Tian
7f5111a998
LLM: Refine start script for Pipeline Parallel Serving (#11557)
Refine start script and readme for Pipeline Parallel Serving
2024-07-11 15:45:27 +08:00
Jason Dai
099486afb7
Update README.md (#11530) 2024-07-08 20:18:41 +08:00
binbin Deng
66f6ffe4b2
Update GPU HF-Transformers example structure (#11526) 2024-07-08 17:58:06 +08:00
Xiangyu Tian
7d8bc83415
LLM: Partial Prefilling for Pipeline Parallel Serving (#11457)
LLM: Partial Prefilling for Pipeline Parallel Serving
2024-07-05 13:10:35 +08:00
binbin Deng
60de428b37
Support pipeline parallel for qwen-vl (#11503) 2024-07-04 18:03:57 +08:00
Wang, Jian4
61c36ba085
Add pp_serving verified models (#11498)
* add verified models

* update

* verify large model

* update commend
2024-07-03 14:57:09 +08:00
binbin Deng
9274282ef7
Support pipeline parallel for glm-4-9b-chat (#11463) 2024-07-03 14:25:28 +08:00
Wang, Jian4
4390e7dc49
Fix codegeex2 transformers version (#11487) 2024-07-02 15:09:28 +08:00
Heyang Sun
913e750b01
fix non-string deepseed config path bug (#11476)
* fix non-string deepseed config path bug

* Update lora_finetune_chatglm.py
2024-07-01 15:53:50 +08:00
Heyang Sun
07362ffffc
ChatGLM3-6B LoRA Fine-tuning Demo (#11450)
* ChatGLM3-6B LoRA Fine-tuning Demo

* refine

* refine

* add 2-card deepspeed

* refine format

* add mpi4py and deepspeed install
2024-07-01 09:18:39 +08:00
Xiangyu Tian
fd933c92d8
Fix: Correct num_requests in benchmark for Pipeline Parallel Serving (#11462) 2024-06-28 16:10:51 +08:00
binbin Deng
987017ef47
Update pipeline parallel serving for more model support (#11428) 2024-06-27 18:21:01 +08:00
binbin Deng
508c364a79
Add precision option in PP inference examples (#11440) 2024-06-27 09:24:27 +08:00
Shaojun Liu
ab9f7f3ac5
FIX: Qwen1.5-GPTQ-Int4 inference error (#11432)
* merge_qkv if quant_method is 'gptq'

* fix python style checks

* refactor

* update GPU example
2024-06-26 15:36:22 +08:00
binbin Deng
e473b8d946
Add more qwen1.5 and qwen2 support for pipeline parallel inference (#11423) 2024-06-25 15:49:32 +08:00
Xiangyu Tian
8ddae22cfb
LLM: Refactor Pipeline-Parallel-FastAPI example (#11319)
Initially Refactor for Pipeline-Parallel-FastAPI example
2024-06-25 13:30:36 +08:00
SONG Ge
34c15d3a10
update pp document (#11421) 2024-06-25 10:17:20 +08:00
Heyang Sun
c985912ee3
Add Deepspeed LoRA dependencies in document (#11410) 2024-06-24 15:29:59 +08:00
SONG Ge
0c67639539
Add more examples for pipeline parallel inference (#11372)
* add more model exampels for pipelien parallel inference

* add mixtral and vicuna models

* add yi model and past_kv supprot for chatglm family

* add docs

* doc update

* add license

* update
2024-06-21 17:55:16 +08:00
ivy-lv11
21fc781fce
Add GLM-4V example (#11343)
* add example

* modify

* modify

* add line

* add

* add link and replace with phi-3-vision template

* fix generate options

* fix

* fix

---------

Co-authored-by: jinbridge <2635480475@qq.com>
2024-06-21 12:54:31 +08:00
binbin Deng
4ba82191f2
Support PP inference for chatglm3 (#11375) 2024-06-21 09:59:01 +08:00
Heyang Sun
67a1e05876
Remove zero3 context manager from LoRA (#11346) 2024-06-18 17:24:43 +08:00
Shaojun Liu
694912698e
Upgrade scikit-learn to 1.5.0 to fix dependabot issue (#11349) 2024-06-18 15:47:25 +08:00
Heyang Sun
00f322d8ee
Finetune ChatGLM with Deepspeed Zero3 LoRA (#11314)
* Fintune ChatGLM with Deepspeed Zero3 LoRA

* add deepspeed zero3 config

* rename config

* remove offload_param

* add save_checkpoint parameter

* Update lora_deepspeed_zero3_finetune_chatglm3_6b_arc_2_card.sh

* refine
2024-06-18 12:31:26 +08:00
binbin Deng
e50c890e1f
Support finishing PP inference once eos_token_id is found (#11336) 2024-06-18 09:55:40 +08:00
Qiyuan Gong
de4bb97b4f
Remove accelerate 0.23.0 install command in readme and docker (#11333)
*ipex-llm's accelerate has been upgraded to 0.23.0. Remove accelerate 0.23.0 install command in README and docker。
2024-06-17 17:52:12 +08:00