ipex-llm

Author	SHA1	Message	Date
Yishuo Wang	f0fdfa081b	Optimize qwen 1.5 14B batch performance (#11370 )	2024-06-20 17:23:39 +08:00
Qiyuan Gong	1eb884a249	IPEX Duplicate importer V2 (#11310 ) * Add gguf support. * Avoid error when import ipex-llm for multiple times. * Add check to avoid duplicate replace and revert. * Add calling from check to avoid raising exceptions in the submodule. * Add BIGDL_CHECK_DUPLICATE_IMPORT for controlling duplicate checker. Default is true.	2024-06-19 16:29:19 +08:00
Guoqiong Song	c44b1942ed	fix mistral for transformers>=4.39 (#11191 ) * fix mistral for transformers>=4.39	2024-06-18 13:39:35 -07:00
Yina Chen	5dad33e5af	Support fp8_e4m3 scale search (#11339 ) * fp8e4m3 switch off * fix style	2024-06-18 11:47:43 +08:00
Xin Qiu	183e0c6cf5	glm-4v-9b support (#11327 ) * chatglm4v support * fix style check * update glm4v	2024-06-17 13:52:37 +08:00
Yina Chen	0af0102e61	Add quantization scale search switch (#11326 ) * add scale_search switch * remove llama3 instruct * remove print	2024-06-14 18:46:52 +08:00
Yishuo Wang	5e25766855	fix and optimize chatglm2-32k and chatglm3-128k (#11306 )	2024-06-13 17:37:58 +08:00
Guancheng Fu	57a023aadc	Fix vllm tp (#11297 )	2024-06-13 10:47:48 +08:00
Yishuo Wang	10e480ee96	refactor internlm and internlm2 (#11274 )	2024-06-11 14:19:19 +08:00
Yishuo Wang	ea0d03fd28	Refactor baichuan1 7B and 13B (#11258 )	2024-06-07 14:29:20 +08:00
Yishuo Wang	ef8e9b2ecd	Refactor qwen2 moe (#11244 )	2024-06-07 13:14:54 +08:00
Xin Qiu	2f809116e2	optimize Chatglm4 (#11239 ) * chatglm4 * update * update * add rms norm * chatglm4	2024-06-06 18:25:20 +08:00
Yishuo Wang	2e4ccd541c	fix qwen2 cpu (#11240 )	2024-06-06 16:24:19 +08:00
Yishuo Wang	ba27e750b1	refactor yuan2 (#11235 )	2024-06-06 13:17:54 +08:00
Guoqiong Song	f6d5c6af78	fix issue 1407 (#11171 )	2024-06-05 13:35:57 -07:00
Xin Qiu	566691c5a3	quantized attention forward for minicpm (#11200 ) * quantized minicpm * fix style check	2024-06-05 09:15:25 +08:00
Jiao Wang	bb83bc23fd	Fix Starcoder issue on CPU on transformers 4.36+ (#11190 ) * fix starcoder for sdpa * update * style	2024-06-04 10:05:40 -07:00
Xiangyu Tian	ac3d53ff5d	LLM: Fix vLLM CPU version error (#11206 ) Fix vLLM CPU version error	2024-06-04 19:10:23 +08:00
Xin Qiu	5f13700c9f	optimize Minicpm (#11189 ) * minicpm optimize * update	2024-06-03 18:28:29 +08:00
ZehuaCao	4127b99ed6	Fix null pointer dereferences error. (#11125 ) * delete unused function on tgi_server * update * update * fix style	2024-05-30 16:16:10 +08:00
Guancheng Fu	50ee004ac7	Fix vllm condition (#11169 ) * add use-vllm * done * fix style * fix done	2024-05-30 15:23:17 +08:00
Zhao Changmin	65f4212f89	Fix qwen 14b run into register attention fwd (#11128 ) * fix qwen 14b	2024-05-24 14:45:07 +08:00
Yishuo Wang	797dbc48b8	fix phi-2 and phi-3 convert (#11116 )	2024-05-23 17:37:37 +08:00
Yishuo Wang	37b98a531f	support running internlm xcomposer2 on gpu and add sdp optimization (#11115 )	2024-05-23 17:26:24 +08:00
Zhao Changmin	c5e8b90c8d	Add Qwen register attention implemention (#11110 ) * qwen_register	2024-05-23 17:17:45 +08:00
Yishuo Wang	0e53f20edb	support running internlm-xcomposer2 on cpu (#11111 )	2024-05-23 16:36:09 +08:00
Yishuo Wang	cd4dff09ee	support phi-3 vision (#11101 )	2024-05-22 17:43:50 +08:00
Yishuo Wang	f00625f9a4	refactor qwen2 (#11087 )	2024-05-21 16:53:42 +08:00
Yishuo Wang	d830a63bb7	refactor qwen (#11074 )	2024-05-20 18:08:37 +08:00
Ruonan Wang	f1156e6b20	support gguf_q4k_m / gguf_q4k_s (#10887 ) * initial commit * UPDATE * fix style * fix style * add gguf_q4k_s * update comment * fix	2024-05-17 14:30:09 +08:00
Yishuo Wang	981d668be6	refactor baichuan2-7b (#11062 )	2024-05-17 13:01:34 +08:00
SONG Ge	192ae35012	Add support for llama2 quantize_kv with transformers 4.38.0 (#11054 ) * add support for llama2 quantize_kv with transformers 4.38.0 * fix code style * fix code style	2024-05-16 22:23:39 +08:00
Yishuo Wang	8cae897643	use new rope in phi3 (#11047 )	2024-05-16 15:12:35 +08:00
SONG Ge	9942a4ba69	[WIP] Support llama2 with transformers==4.38.0 (#11024 ) * support llama2 with transformers==4.38.0 * add supprot for quantize_qkv * add original support for 4.38.0 now * code style fix	2024-05-15 18:07:00 +08:00
Yishuo Wang	ee325e9cc9	fix phi3 (#11022 )	2024-05-15 09:32:12 +08:00
Zhao Changmin	0a732bebe7	Add phi3 cached RotaryEmbedding (#11013 ) * phi3cachedrotaryembed * pep8	2024-05-15 08:16:43 +08:00
Zhao Changmin	b03c859278	Add phi3RMS (#10988 ) * phi3RMS	2024-05-14 15:16:27 +08:00
Yishuo Wang	1b3c7a6928	remove phi3 empty cache (#10997 )	2024-05-13 14:09:55 +08:00
Kai Huang	a6342cc068	Empty cache after phi first attention to support 4k input (#10972 ) * empty cache * fix style	2024-05-09 19:50:04 +08:00
Yishuo Wang	2ebec0395c	optimize phi-3-mini-128 (#10959 )	2024-05-08 16:33:17 +08:00
Wang, Jian4	191b184341	LLM: Optimize cohere model (#10878 ) * use mlp and rms * optimize kv_cache * add fuse qkv * add flash attention and fp16 sdp * error fp8 sdp * fix optimized * fix style * update * add for pp	2024-05-07 10:19:50 +08:00
Guancheng Fu	49ab5a2b0e	Add embeddings (#10931 )	2024-05-07 09:07:02 +08:00
Guancheng Fu	2c64754eb0	Add vLLM to ipex-llm serving image (#10807 ) * add vllm * done * doc work * fix done * temp * add docs * format * add start-fastchat-service.sh * fix	2024-04-29 17:25:42 +08:00
Guancheng Fu	990535b1cf	Add tensor parallel for vLLM (#10879 ) * initial * test initial tp * initial sup * fix format * fix * fix	2024-04-26 17:10:49 +08:00
Yang Wang	1ce8d7bcd9	Support the `desc_act` feature in GPTQ model (#10851 ) * support act_order * update versions * fix style * fix bug * clean up	2024-04-24 10:17:13 -07:00
Yishuo Wang	2d210817ff	add phi3 optimization (#10871 )	2024-04-24 15:17:40 +08:00
Yishuo Wang	fe5a082b84	add phi-2 optimization (#10843 )	2024-04-22 18:56:47 +08:00
Ruonan Wang	439c834ed3	LLM: add mixed precision for lm_head (#10795 ) * add mixed_quantization * meet code review * update * fix style * meet review	2024-04-18 19:11:31 +08:00
Guancheng Fu	cbe7b5753f	Add vLLM[xpu] related code (#10779 ) * Add ipex-llm side change * add runable offline_inference * refactor to call vllm2 * Verified async server * add new v2 example * add README * fix * change dir * refactor readme.md * add experimental * fix	2024-04-18 15:29:20 +08:00
Wang, Jian4	209c3501e6	LLM: Optimize qwen1.5 moe model (#10706 ) * update moe block * fix style * enable optmize MLP * enabel kv_cache * enable fuse rope * enable fused qkv * enable flash_attention * error sdp quantize * use old api * use fuse * use xetla * fix python style * update moe_blocks num * fix output error * add cpu sdpa * update * update * update	2024-04-18 14:54:05 +08:00

1 2

62 commits