hf_text-generation-inference

Commit Graph

Author	SHA1	Message	Date
Nicolas Patry	fb2f74e2b9	Refactor dead code - Removing all `flash_xxx.py` files. (#2166 ) * Refactor dead code. * First working step. * Remove a lot of duplicated code. * More dead code. * More cleanup. * Fix Santacoder test. * Fixing the simple tests. * Fixing sharding. * Fixes for VLM. * Fixing santacoder (num_kv_heads hardcoded). * Removing more dead code. * Fixing `config.n_head`. * Stopping earlier because of `<end_of_utterance>` in idefics2. * Addresses comments. * Removing the dead code. * Fuse back mistral into FlashCausalLM. * Finish removal. * Fixing docs + causal_lm `batch_class`. * Fixing docs + causal.lm. * Add default to Gemma Causality. * Default value for gemma/gemma2. * Wrong default.	2024-07-05 10:29:56 +02:00
Daniël de Kok	bf3c813782	server: use chunked inputs The router will now send the input as chunks besides as a single string. This change modifies the server to process chunked input rather than strings. This also allows us to remove the image extraction code from the server.	2024-06-07 08:09:04 +02:00
OlivierDehaene	50b495f3d8	feat: add more latency metrics in forward (#1346 )	2023-12-14 15:59:38 +01:00
OlivierDehaene	72ee382ded	chore: formatting	2023-12-11 14:49:52 +01:00
Nicolas Patry	9ecfa16b12	Speculative (#1308 )	2023-12-11 12:46:30 +01:00
OlivierDehaene	895c5f1562	feat(server): only compute prefill logprobs when asked (#406 ) Close #288	2023-06-02 17:12:30 +02:00
OlivierDehaene	62f91f78ac	feat(server): support vectorized warpers in flash causal lm (#317 ) Co-authored-by: Joel Lamy-Poirier <joel.lamy-poirier@servicenow.com>	2023-05-26 12:30:27 +02:00
OlivierDehaene	218c9adaa5	feat: decrease IPC proto size (#367 ) Closes #307 #308	2023-05-24 19:19:57 +02:00
OlivierDehaene	5a58226130	fix(server): fix decode token (#334 ) Fixes #333 --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2023-05-16 23:23:27 +02:00
Nick Hill	4a7dd4085a	feat(server): reduce memory requirement (#214 )	2023-04-24 14:15:42 +02:00
OlivierDehaene	709d8936f6	feat(router): drop requests when client closes the channel (#202 )	2023-04-20 11:07:40 +02:00
OlivierDehaene	299217c95c	feat(server): add flash attention llama (#144 )	2023-04-11 16:38:22 +02:00
OlivierDehaene	9987960062	feat(router): make router input validation optional (#164 )	2023-04-09 20:22:27 +02:00
OlivierDehaene	b49dbf2d88	fix(server): use server tokenizer as gt (#128 )	2023-03-16 12:12:26 +01:00
OlivierDehaene	3fef90d50f	feat(clients): Python client (#103 )	2023-03-07 18:52:22 +01:00
OlivierDehaene	9b205d33cc	fix(server): fix generate_stream by forcing tokens to be decoded correctly (#100 )	2023-03-06 13:22:58 +01:00
OlivierDehaene	44ce098c10	feat(server): pre-allocate max attention mask (#75 )	2023-02-24 12:49:21 +01:00
OlivierDehaene	017a2a8c2f	feat: Add token streaming using ServerSideEvents support (#41 )	2023-01-31 17:04:00 +01:00
OlivierDehaene	4f9ac67cfa	Revert "feat: Add token streaming using ServerSideEvents support" (#40 ) Reverts huggingface/text-generation-inference#36	2023-01-31 14:21:51 +01:00
OlivierDehaene	7fbfbb0dc5	feat: Add token streaming using ServerSideEvents support (#36 ) Add token streaming using ServerSideEvents (SSE). The signature of the SSE events is: ```rust struct Details { finish_reason: String, generated_tokens: u32, seed: Option<u64>, } struct StreamResponse { token: Token, generated_text: Option<String>, details: Option<Details>, } struct ErrorResponse { error: String, } ```	2023-01-31 11:49:43 +01:00
OlivierDehaene	15511edc01	feat(server): Support SantaCoder (#26 )	2023-01-20 12:24:39 +01:00
OlivierDehaene	32a253063d	feat: Return logprobs (#8 )	2022-12-15 17:03:56 +01:00
OlivierDehaene	718096f695	feat: Support stop sequences (#7 )	2022-12-12 18:25:22 +01:00
OlivierDehaene	a2985036aa	feat(server): Add model tests (#6 )	2022-12-08 18:49:33 +01:00

24 Commits