hf_text-generation-inference

Commit Graph

Author	SHA1	Message	Date
Nicolas Patry	f555dabca8	Putting back header inclusion (seems unused but still)	2023-07-20 15:46:51 +00:00
Nicolas Patry	5ca0508d02	Simpler exllama	2023-07-20 15:36:53 +00:00
Felix Marty	6bf7090ecd	fix per-column quantization	2023-07-19 17:55:41 +00:00
Félix Marty	edfbfdfb3f	Merge branch 'main' into gptq-cuda-kernels	2023-07-19 16:58:54 +02:00
Nicolas Patry	5a1512c025	docs: Update README.md (#643 )	2023-07-19 13:39:12 +02:00
Nicolas Patry	1c81df15cd	docs: Update README.md (#639 )	2023-07-19 13:38:52 +02:00
OlivierDehaene	b66b190403	feat(router): ngrok edge (#642 )	2023-07-19 11:59:58 +02:00
OlivierDehaene	fe80f5360c	feat(server): auto max_batch_total_tokens for flash att models (#630 )	2023-07-19 09:31:25 +02:00
OlivierDehaene	5e6ddfd6a4	fix(server): fix llamav2 config (#635 )	2023-07-18 18:49:42 +02:00
OlivierDehaene	cf83f9b66f	v0.9.3 (#634 )	2023-07-18 18:11:20 +02:00
Nicolas Patry	211b211ec0	feat(server): add support for llamav2 (#633 )	2023-07-18 18:09:53 +02:00
OlivierDehaene	3b71c38558	feat(server): flash attention v2 (#624 )	2023-07-18 16:21:18 +02:00
Nicolas Patry	4d38a1c4ad	feat(server): Reworking the quantization script so it's still universal (not llama specific) (#587 ) but should work on more configurations (no need for 2 GPUs, less RAM usage). # What does this PR do? Reworking the quantization script so it's still universal (not llama specific) but should work on more configurations (no need for 2 GPUs, less RAM usage). Still need to investigate the potential differences in quantization results. <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->	2023-07-18 12:19:05 +02:00
OlivierDehaene	44acf72a73	fea(launcher): debug logs (#623 )	2023-07-17 19:03:07 +02:00
Nicolas Patry	bc2873246c	fix(launcher): Rename `b-float16` to `bfloat16` in the launcher arg (#621 )	2023-07-17 18:38:16 +02:00
OlivierDehaene	a2cf1bdb2f	fix(server): empty_cache when stopped	2023-07-15 13:58:19 +02:00
OlivierDehaene	c58a0c185b	v0.9.2 (#616 )	2023-07-14 16:31:48 +02:00
OlivierDehaene	5b9de4a1d3	fix(server): blacklist local files (#609 ) Close #589 #602	2023-07-13 21:54:55 +02:00
Victor Muštar	c8b077be79	docs: README: Add logo + baseline (#611 ) ![image](https://github.com/huggingface/text-generation-inference/assets/3841370/58177321-479f-4ad1-b3bc-cec027423984)	2023-07-13 21:45:20 +02:00
OlivierDehaene	982ce3227b	feat(router): explicit warning if revision is not set (#608 )	2023-07-13 18:59:38 +02:00
Felix Marty	74e6d6e54e	fix the usual merge mess	2023-07-13 15:48:55 +00:00
Félix Marty	9401e10210	Merge branch 'main' into gptq-cuda-kernels	2023-07-13 17:45:52 +02:00
Felix Marty	0036084294	support all, test llama	2023-07-13 15:41:57 +00:00
OlivierDehaene	b7327205a6	feat(launcher): add arg validation and drop subprocess (#595 )	2023-07-13 14:22:37 +02:00
Felix Marty	2ae65b45a8	fix tests	2023-07-13 10:38:08 +00:00
Felix Marty	38c2be5926	fix test	2023-07-12 18:31:49 +00:00
ssmi153	3628559516	GPTQ Env vars: catch correct type of error (#596 ) # What does this PR do? When passing in environment variables like gptq_bits, we still get errors thrown from TGI because the try/catch block is catching the wrong type of error. This PR aims to fix that. @Narsil - let me know if this is how you want this formatted. My Python is a little shaky, so I hope this syntax is correct.	2023-07-12 19:57:46 +02:00
Félix Marty	faa5b52fdc	Merge branch 'main' into gptq-cuda-kernels	2023-07-12 18:47:30 +02:00
Felix Marty	8645fd39e1	tests	2023-07-12 16:42:34 +00:00
Felix Marty	f90c61a340	support bits different than 4	2023-07-12 16:19:25 +00:00
Felix Marty	67d687609b	cleanup	2023-07-12 16:16:58 +00:00
Felix Marty	67a46b7361	move exllama buffer init to the top level	2023-07-12 16:09:26 +00:00
Felix Marty	4462854e1b	have a single gptq quantization type	2023-07-12 15:43:20 +00:00
OlivierDehaene	f2f0289fb9	feat(server): empty cache on errors	2023-07-12 17:06:19 +02:00
Nicolas Patry	67347950b7	feat(server): Implements sharding for non divisible `vocab_size`. (#583 ) - The code is relatively easy (just disable the checks on Embedding and Head) This cannot be done in the same easy fashion for hidden_dim/head_dim. It's relatively easy on some models (classic MHA) but it would make the other models (MQA) much more complex, and GPTQ quantization another quite hairy piece of code.	2023-07-12 16:43:31 +02:00
ssmi153	2c4bf88268	fix(server): Bug fixes for GPTQ_BITS environment variable passthrough (#590 ) # What does this PR do? This fixes a typo and extends the GPTP_BITS environment variables through to the second method which requires the same logic. Please let me know if there's anything I've misunderstood in this change. Thanks @Narsil for the original fix.	2023-07-12 14:17:35 +02:00
Adam Kowalski	7f9072228a	fix(server): Adding logger import to t5_modeling.py (#585 ) Logger is referenced during the apex importing but is not imported, causing a NameError	2023-07-12 10:40:32 +02:00
Nicolas Patry	db4efbf4bc	fix(server): T5 weights names. (#582 ) Fixes #541	2023-07-12 10:01:42 +02:00
Nicolas Patry	f063ebde10	chore: migrate ci region for more availability. (#581 )	2023-07-12 10:01:01 +02:00
Nicolas Patry	5bd2ab6583	feat(server): Support for env value for GPTQ_BITS and GPTQ_GROUPSIZE. (#580 ) # What does this PR do? Some models are already converted, and do not have those values in the file, this enables users to use them with less friction. Went for pure env based because adding flags would end up (imo) very tedious to maintain. There's a lot of sanitation to do: those flags would be errors if not used in conjuction with `--quantize gptq`. Then the flags need to exist in the launcher and the server passing them all throughout all function calls. This PR is intended as an easy escape hatch, not the defacto method to use gptq in TGI. Fixes #500	2023-07-12 10:00:02 +02:00
Nicolas Patry	f0181436f4	fix(server): Fixing RW code (it's remote code so the Arch checking doesn't work to see which weights to keep). (#579 ) Fixes #555	2023-07-12 09:51:34 +02:00
OlivierDehaene	b4024edd45	feat: better errors for warmup and TP (#575 ) Close #571	2023-07-10 14:47:15 +02:00
Nicolas Patry	e943a294bc	fix(server): harden the weights choice to save on disk. (#561 ) - Look at `transformers` base class to check for `_key_to_ignore_on_load_missing` or `_tied_weights` which are the standard attributes to select the keys to NOT save on disk (since they are ignored) - Modified safetensors code (to be reflected in safetensors even if it's an internal function). - Will not work for trust_remote_code=True repos (like santacoder). Should help with : https://github.com/huggingface/text-generation-inference/issues/555 and : https://github.com/huggingface/text-generation-inference/pull/501 and https://github.com/huggingface/text-generation-inference/issues/556 and https://github.com/huggingface/text-generation-inference/issues/482#issuecomment-1623713593	2023-07-07 14:50:12 +02:00
OlivierDehaene	31b36cca21	v0.9.1 (#558 )	2023-07-06 16:05:42 +02:00
OlivierDehaene	c4bb5264ac	fix(server): decrease memory fragmentation (#557 )	2023-07-06 14:28:33 +02:00
Felix Marty	a6e387404d	try-catch to load the cuda extension, quite ugly practice tbh	2023-07-05 17:53:56 +00:00
Felix Marty	620ed7d8aa	Merge branch 'gptq-cuda-kernels' of https://github.com/fxmarty/text-generation-inference into gptq-cuda-kernels	2023-07-05 16:42:37 +00:00
Felix Marty	2272b3a456	some more cleanup	2023-07-05 16:42:13 +00:00
Félix Marty	0ff8219fdb	Merge branch 'main' into gptq-cuda-kernels	2023-07-06 01:31:05 +09:00
OlivierDehaene	6f42942772	feat(router): add argument for hostname in router (#545 ) (#550 ) # What does this PR do? In title. Adds argument `--hostname` in router to support something like `--hostname ::`. Tested with ```commandline cargo run -- --port 8080 --hostname :: curl -I -X GET 'http://[::1]:8080/health' # failed before this commit ``` Trigger CI --------- Co-authored-by: Phil Chen <philchen2000@gmail.com>	2023-07-05 18:28:45 +02:00

1 2 3 4 5 ...

352 Commits All Branches Search

352 Commits

All Branches