Zijie Li
9e65cf00b3
Add openai-whisper pytorch gpu ( #11736 )
...
* Add openai-whisper pytorch gpu
* Update README.md
* Update README.md
* fix typo
* fix names update readme
* Update README.md
2024-08-08 12:32:59 +08:00
Jinhe
d0c89fb715
updated llama.cpp and ollama quickstart ( #11732 )
...
* updated llama.cpp and ollama quickstart.md
* added qwen2-1.5B sample output
* revision on quickstart updates
* revision on quickstart updates
* revision on qwen2 readme
* added 2 troubleshoots“
”
* troubleshoot revision
2024-08-08 11:04:01 +08:00
Ch1y0q
4676af2054
add gemma2 example ( #11724 )
...
* add `gemma2`
* update `transformers` version
* update `README.md`
2024-08-06 21:17:50 +08:00
Jin, Qiao
11650b6f81
upgrade glm-4v example transformers version ( #11719 )
2024-08-06 14:55:09 +08:00
Jin, Qiao
7f241133da
Add MiniCPM-Llama3-V-2_5 GPU example ( #11693 )
...
* Add MiniCPM-Llama3-V-2_5 GPU example
* fix
2024-08-06 10:22:41 +08:00
Jin, Qiao
808d9a7bae
Add MiniCPM-V-2 GPU example ( #11699 )
...
* Add MiniCPM-V-2 GPU example
* add example in README.md
* add example in README.md
2024-08-06 10:22:33 +08:00
Zijie Li
8fb36b9f4a
add new benchmark_util.py ( #11713 )
...
* add new benchmark_util.py
2024-08-05 16:18:48 +08:00
Wang, Jian4
493cbd9a36
Support lightweight-serving with internlm-xcomposer2-vl-7b multimodal input ( #11703 )
...
* init image_list
* enable internlm-xcomposer2 image input
* update style
* add readme
* update model
* update readme
2024-08-05 09:36:04 +08:00
Qiyuan Gong
762ad49362
Add RANK_WAIT_TIME into DeepSpeed-AutoTP to avoid CPU memory OOM ( #11704 )
...
* DeepSpeed-AutoTP will start multiple processors to load models and convert them in CPU memory. If model/rank_num is large, this will lead to OOM. Add RANK_WAIT_TIME to reduce memory usage by controlling model reading parallelism.
2024-08-01 18:16:21 +08:00
Zijie Li
5079ed9e06
Add Llama3.1 example ( #11689 )
...
* Add Llama3.1 example
Add Llama3.1 example for Linux arc and Windows MTL
* Changes made to adjust compatibilities
transformers changed to 4.43.1
* Update index.rst
* Update README.md
* Update index.rst
* Update index.rst
* Update index.rst
2024-07-31 10:53:30 +08:00
Jin, Qiao
6e3ce28173
Upgrade glm-4 example transformers version ( #11659 )
...
* upgrade glm-4 example transformers version
* move pip install in one line
2024-07-31 10:24:50 +08:00
Jin, Qiao
a44ab32153
Switch to conhost when running on NPU ( #11687 )
2024-07-30 17:08:06 +08:00
Guoqiong Song
336dfc04b1
fix 1482 ( #11661 )
...
Co-authored-by: rnwang04 <ruonan1.wang@intel.com>
2024-07-26 12:39:09 -07:00
Wang, Jian4
23681fbf5c
Support codegeex4-9b for lightweight-serving ( #11648 )
...
* add options, support prompt and not return end_token
* enable openai parameter
* set do_sample None and update style
2024-07-26 09:41:03 +08:00
Wang, Jian4
1eed0635f2
Add lightweight serving and support tgi parameter ( #11600 )
...
* init tgi request
* update openai api
* update for pp
* update and add readme
* add to docker
* add start bash
* update
* update
* update
2024-07-19 13:15:56 +08:00
Guoqiong Song
380717f50d
fix gemma for 4.41 ( #11531 )
...
* fix gemma for 4.41
2024-07-18 15:02:50 -07:00
Guoqiong Song
5a6211fd56
fix minicpm for transformers>=4.39 ( #11533 )
...
* fix minicpm for transformers>=4.39
2024-07-18 15:01:57 -07:00
Guoqiong Song
bfcdc35b04
phi-3 on "transformers>=4.37.0,<=4.42.3" ( #11534 )
2024-07-17 17:19:57 -07:00
Guoqiong Song
d64711900a
Fix cohere model on transformers>=4.41 ( #11575 )
...
* fix cohere model for 4-41
2024-07-17 17:18:59 -07:00
Guoqiong Song
5b6eb85b85
phi model readme ( #11595 )
...
Co-authored-by: rnwang04 <ruonan1.wang@intel.com>
2024-07-17 17:18:34 -07:00
Wang, Jian4
9c15abf825
Refactor fastapi-serving and add one card serving( #11581 )
...
* init fastapi-serving one card
* mv api code to source
* update worker
* update for style-check
* add worker
* update bash
* update
* update worker name and add readme
* rename update
* rename to fastapi
2024-07-17 11:12:43 +08:00
Heyang Sun
365adad59f
Support LoRA ChatGLM with Alpaca Dataset ( #11580 )
...
* Support LoRA ChatGLM with Alpaca Dataset
* refine
* fix
* add 2-card alpaca
2024-07-16 15:40:02 +08:00
Ch1y0q
50cf563a71
Add example: MiniCPM-V ( #11570 )
2024-07-15 10:55:48 +08:00
Zhao Changmin
06745e5742
Add npu benchmark all-in-one script ( #11571 )
...
* npu benchmark
2024-07-15 10:42:37 +08:00
Xiangyu Tian
0981b72275
Fix /generate_stream api in Pipeline Parallel FastAPI ( #11569 )
2024-07-12 13:19:42 +08:00
Zhao Changmin
b9c66994a5
add npu sdp ( #11562 )
2024-07-11 16:57:35 +08:00
binbin Deng
2b8ad8731e
Support pipeline parallel for glm-4v ( #11545 )
2024-07-11 16:06:06 +08:00
Xiangyu Tian
7f5111a998
LLM: Refine start script for Pipeline Parallel Serving ( #11557 )
...
Refine start script and readme for Pipeline Parallel Serving
2024-07-11 15:45:27 +08:00
Zhao Changmin
105e124752
optimize phi3-v encoder npu performance and add multimodal example ( #11553 )
...
* phi3-v
* readme
2024-07-11 13:59:14 +08:00
Zhao Changmin
3c16c9f725
Optimize baichuan on NPU ( #11548 )
...
* baichuan_npu
2024-07-10 13:18:48 +08:00
Zhao Changmin
76a5802acf
update NPU examples ( #11540 )
...
* update NPU examples
2024-07-09 17:19:42 +08:00
Jason Dai
099486afb7
Update README.md ( #11530 )
2024-07-08 20:18:41 +08:00
binbin Deng
66f6ffe4b2
Update GPU HF-Transformers example structure ( #11526 )
2024-07-08 17:58:06 +08:00
Xiangyu Tian
7d8bc83415
LLM: Partial Prefilling for Pipeline Parallel Serving ( #11457 )
...
LLM: Partial Prefilling for Pipeline Parallel Serving
2024-07-05 13:10:35 +08:00
binbin Deng
60de428b37
Support pipeline parallel for qwen-vl ( #11503 )
2024-07-04 18:03:57 +08:00
Wang, Jian4
61c36ba085
Add pp_serving verified models ( #11498 )
...
* add verified models
* update
* verify large model
* update commend
2024-07-03 14:57:09 +08:00
binbin Deng
9274282ef7
Support pipeline parallel for glm-4-9b-chat ( #11463 )
2024-07-03 14:25:28 +08:00
Wang, Jian4
4390e7dc49
Fix codegeex2 transformers version ( #11487 )
2024-07-02 15:09:28 +08:00
Heyang Sun
913e750b01
fix non-string deepseed config path bug ( #11476 )
...
* fix non-string deepseed config path bug
* Update lora_finetune_chatglm.py
2024-07-01 15:53:50 +08:00
Yishuo Wang
319a3b36b2
fix npu llama2 ( #11471 )
2024-07-01 10:14:11 +08:00
Heyang Sun
07362ffffc
ChatGLM3-6B LoRA Fine-tuning Demo ( #11450 )
...
* ChatGLM3-6B LoRA Fine-tuning Demo
* refine
* refine
* add 2-card deepspeed
* refine format
* add mpi4py and deepspeed install
2024-07-01 09:18:39 +08:00
Xiangyu Tian
fd933c92d8
Fix: Correct num_requests in benchmark for Pipeline Parallel Serving ( #11462 )
2024-06-28 16:10:51 +08:00
binbin Deng
987017ef47
Update pipeline parallel serving for more model support ( #11428 )
2024-06-27 18:21:01 +08:00
Yishuo Wang
cf0f5c4322
change npu document ( #11446 )
2024-06-27 13:59:59 +08:00
binbin Deng
508c364a79
Add precision option in PP inference examples ( #11440 )
2024-06-27 09:24:27 +08:00
Shaojun Liu
ab9f7f3ac5
FIX: Qwen1.5-GPTQ-Int4 inference error ( #11432 )
...
* merge_qkv if quant_method is 'gptq'
* fix python style checks
* refactor
* update GPU example
2024-06-26 15:36:22 +08:00
Jiao Wang
40fa23560e
Fix LLAVA example on CPU ( #11271 )
...
* update
* update
* update
* update
2024-06-25 20:04:59 -07:00
binbin Deng
e473b8d946
Add more qwen1.5 and qwen2 support for pipeline parallel inference ( #11423 )
2024-06-25 15:49:32 +08:00
Yishuo Wang
3b23de684a
update npu examples ( #11422 )
2024-06-25 13:32:53 +08:00
Xiangyu Tian
8ddae22cfb
LLM: Refactor Pipeline-Parallel-FastAPI example ( #11319 )
...
Initially Refactor for Pipeline-Parallel-FastAPI example
2024-06-25 13:30:36 +08:00