preemo_text-generation-inference

Commit Graph

Select branches

Hide Pull Requests

main

#1

#10

#11

#12

#15

#17

#2

#3

#4

#5

#6

#7

972e9a7f7c

update causal batch for ct2 and fix nf4 (#17) main Michael Feil 2024-02-09 11:07:14 -0800
7f55c3ceaa bump the ctranslate2 version #17 Michael Feil 2023-12-01 17:28:51 +0100
c5a294b76b update causal batch for ct2 and fix nf4 michaelfeil 2023-11-04 00:00:00 +0000
339ede9e90

Update Readme.md / documentation (#15) Michael Feil 2023-10-04 08:01:06 +0200
bbd02184ce

Update README.md #15 Michael Feil 2023-10-03 15:16:27 +0200
229a1bc985 update readme michaelfeil 2023-10-03 10:53:33 +0200
393647af37 add documentation updates michaelfeil 2023-10-03 10:48:40 +0200
ff703cb867

Adding ctranslate2 quantization and inference: moving the contribution (#1) Michael Feil 2023-10-02 20:12:49 +0200
09e88f2470 Merge branch 'main' into ct2_support #1 michaelfeil 2023-10-02 00:00:00 +0000
012c917b6f

Wrapping completions and chat/completions endpoint (#2) Michael Feil 2023-09-27 17:58:07 +0200
f93012d59c

Merge pull request #4 from michaelfeil/bnb_4bit Yang, Bo 2023-09-08 14:52:32 -0700
072f267cc3

Initialize v_cache to avoid NaNs (#12) Yang, Bo 2023-08-23 14:23:59 -0700
57deda586e

Update flash_causal_lm.py #12 Yang, Bo 2023-08-23 14:22:08 -0700
a5f96fd18e

Update flash_causal_lm.py Yang, Bo 2023-08-23 14:21:52 -0700
c360d45c9f

Initialize v_cache to avoid NaNs Yang, Bo 2023-08-23 14:15:35 -0700
2fda8fe812

Initialize v_cache to avoid NaNs (#11) Yang, Bo 2023-08-23 14:07:06 -0700
c6114f4b0d Initialize v_cache to avoid NaNs #11 Yang, Bo 2023-08-23 21:00:58 +0000
1e646fb41d

Compilation fix: Correct method argument types in generation.rs and validation.rs (#10) Jason Sun 2023-08-23 16:52:49 -0400
45dc82b8b4

Update router/src/validation.rs #10 Jason Sun 2023-08-22 09:32:12 -0700
e5a5db61ea

Update benchmark/src/generation.rs Jason Sun 2023-08-22 09:32:06 -0700
ac2fe4f8c6

fix: Correct method argument types in generation and validation Jason Sun 2023-08-21 15:14:55 -0700
8130300c9a fix: 2038y problem #2 michaelfeil 2023-08-07 13:02:34 +0200
ab58232a3d cargo fmt michaelfeil 2023-08-07 12:28:59 +0200
e8ca636eea rebase and squash commits on latest main michaelfeil 2023-08-07 12:25:17 +0200
8ddfbaafb9 Merge branch 'ct2_support' of https://github.com/michaelfeil/preemo-text-generation-inference into ct2_support michaelfeil 2023-08-07 11:06:54 +0200
b9326ace1a adapt path michaelfeil 2023-08-06 18:26:30 +0200
a732244687 update changes for dockerfile michaelfeil 2023-08-06 17:10:44 +0200
c089b19487 update dockerfile michaelfeil 2023-08-04 14:33:49 +0200
df1e7b513a reformatting and changes. michaelfeil 2023-08-04 14:18:56 +0200
ee81780ba4 rebaseing the commit on preemo fork. michaelfeil 2023-07-30 14:07:28 +0200
5963554641 adapt path michaelfeil 2023-08-06 18:26:30 +0200
24632c5105 update changes for dockerfile michaelfeil 2023-08-06 17:10:44 +0200
2ac9db513a update dockerfile michaelfeil 2023-08-04 14:33:49 +0200
bc4b3f97ec reformatting and changes. michaelfeil 2023-08-04 14:18:56 +0200
da9746586b

Update README.md #4 Michael Feil 2023-08-03 23:23:02 +0200
a9838bba2f

Modify exllama weight Michael Feil 2023-08-03 23:20:59 +0200
13f559c305

Claim copyright #7 Yang, Bo 2023-08-02 16:15:53 -0700
8af4a7a0b0

Merge branch 'main' into bnb_4bit Yang, Bo 2023-08-02 12:47:17 -0700
b5fadc4c28

Don't enable custom kernels if CUDA is not available (#6) Yang, Bo 2023-08-02 09:51:54 -0700
8a5f80bb61

Add AutoCausalLM (#5) Yang, Bo 2023-08-02 09:35:40 -0700
656f2fe4dc fix: typo michaelfeil 2023-08-02 16:56:14 +0200
ec8590a3f1

Don't enable custom kernels if CUDA is not available #6 Yang, Bo 2023-08-01 17:58:00 -0700
ef006ccee2 Merge branch 'AutoCausalLM' of https://github.com/Atry/hf-text-generation-inference into HEAD #5 Yang, Bo 2023-08-01 12:30:08 -0700
9048a80f8f

Add a new README (#3) Yang, Bo 2023-08-01 12:22:07 -0700
4c2237b2a0 update PR template michaelfeil 2023-08-01 18:18:28 +0200
44fa36b5bf restoring commit from dev branch, rebase on current master michaelfeil 2023-08-01 18:15:18 +0200
220b2afc8a

Update README.md #3 Yang, Bo 2023-07-31 21:39:29 -0700
76206a513f Add Preemo's README Yang, Bo 2023-07-31 21:35:16 -0700
8c3d8a10cd Rename README.md to README-HuggingFace.md Yang, Bo 2023-07-31 21:34:47 -0700
08b50a5bb9 rebaseing the commit on preemo fork. michaelfeil 2023-07-30 14:07:28 +0200
afd04dc71e

feat(server): update vllm version (#723) OlivierDehaene 2023-07-28 15:36:38 +0200
f848decee6

docs: Add hardware section to TOC in README (#721) regisss 2023-07-28 11:20:03 +0200
5a1cccbb98

Add section about TGI on other AI hardware accelerators in README (#715) regisss 2023-07-28 09:14:03 +0200
9f18f4c006

v0.9.4 (#713) OlivierDehaene 2023-07-27 19:25:15 +0200
ab96b9aec3

feat(server): support new falcon config (#712) OlivierDehaene 2023-07-27 18:38:57 +0200
2efd46ef95 fix(server): fix missing datasets in quantize OlivierDehaene 2023-07-27 14:50:45 +0200
8bd0adb135

fix(server): fix quantization python requirements (#708) OlivierDehaene 2023-07-27 12:28:10 +0200
e64a65891b docs(README): update readme OlivierDehaene 2023-07-25 19:45:25 +0200
a0d55358d2

feat(server): Using `quantize_config.json` instead of GPTQ_BITS env variables. (#671) Nicolas Patry 2023-07-25 12:00:27 +0100
9bb64c92a9 Add AutoCausalLM Yang, Bo 2023-07-12 01:07:10 +0000
37df6df38e

fix(server): fix exllama buffers (#689) OlivierDehaene 2023-07-24 14:25:43 +0200
73a4d65d26

feat: add cuda memory fraction (#659) OlivierDehaene 2023-07-24 11:43:58 +0200
1da642bd0e feat(server): add local prom and health routes if running w/ ngrok OlivierDehaene 2023-07-21 16:56:30 +0200
15b3e9ffb0

Directly load GPTBigCode to specified device (#618) Yang, Bo 2023-07-21 02:27:31 -0700
d5b5bc750f

feat(server): Add exllama GPTQ CUDA kernel support #553 (#666) Nicolas Patry 2023-07-21 10:59:00 +0200
bf94df3c71

fix(server): use mem_get_info to get kv cache size (#664) OlivierDehaene 2023-07-20 17:23:49 +0200
08b8eec1d7

fix(server): Fixing non parameters in quantize script `bigcode/starcoder` was an example. (#661) Nicolas Patry 2023-07-20 16:04:15 +0200
362883f259

fix(server): llama v2 GPTQ (#648) fxmarty 2023-07-20 15:02:54 +0200
214c06f510

Add trust_remote_code to quantize script (#647) cdawg 2023-07-20 13:53:08 +0200
5a1512c025

docs: Update README.md (#643) Nicolas Patry 2023-07-19 13:39:12 +0200
1c81df15cd

docs: Update README.md (#639) Nicolas Patry 2023-07-19 13:38:52 +0200
b66b190403

feat(router): ngrok edge (#642) OlivierDehaene 2023-07-19 11:59:58 +0200
fe80f5360c

feat(server): auto max_batch_total_tokens for flash att models (#630) OlivierDehaene 2023-07-19 09:31:25 +0200
5e6ddfd6a4

fix(server): fix llamav2 config (#635) OlivierDehaene 2023-07-18 18:49:42 +0200
cf83f9b66f

v0.9.3 (#634) OlivierDehaene 2023-07-18 18:11:20 +0200
211b211ec0

feat(server): add support for llamav2 (#633) Nicolas Patry 2023-07-18 18:09:53 +0200
3b71c38558

feat(server): flash attention v2 (#624) OlivierDehaene 2023-07-18 16:21:18 +0200
4d38a1c4ad

feat(server): Reworking the quantization script so it's still universal (not llama specific) (#587) Nicolas Patry 2023-07-18 12:19:05 +0200
44acf72a73

fea(launcher): debug logs (#623) OlivierDehaene 2023-07-17 19:03:07 +0200
bc2873246c

fix(launcher): Rename `b-float16` to `bfloat16` in the launcher arg (#621) Nicolas Patry 2023-07-17 18:38:16 +0200
a2cf1bdb2f fix(server): empty_cache when stopped OlivierDehaene 2023-07-15 13:57:31 +0200
c58a0c185b

v0.9.2 (#616) OlivierDehaene 2023-07-14 16:31:48 +0200
5b9de4a1d3

fix(server): blacklist local files (#609) OlivierDehaene 2023-07-13 21:54:55 +0200
c8b077be79

docs: README: Add logo + baseline (#611) Victor Muštar 2023-07-13 21:45:20 +0200
982ce3227b

feat(router): explicit warning if revision is not set (#608) OlivierDehaene 2023-07-13 18:59:38 +0200
b7327205a6

feat(launcher): add arg validation and drop subprocess (#595) OlivierDehaene 2023-07-13 14:22:37 +0200
3628559516

GPTQ Env vars: catch correct type of error (#596) ssmi153 2023-07-13 01:57:46 +0800
f2f0289fb9 feat(server): empty cache on errors OlivierDehaene 2023-07-12 17:05:50 +0200
67347950b7

feat(server): Implements sharding for non divisible `vocab_size`. (#583) Nicolas Patry 2023-07-12 16:43:31 +0200
2c4bf88268

fix(server): Bug fixes for GPTQ_BITS environment variable passthrough (#590) ssmi153 2023-07-12 20:17:35 +0800
7f9072228a

fix(server): Adding logger import to t5_modeling.py (#585) Adam Kowalski 2023-07-12 03:40:32 -0500
db4efbf4bc

fix(server): T5 weights names. (#582) Nicolas Patry 2023-07-12 10:01:42 +0200
f063ebde10

chore: migrate ci region for more availability. (#581) Nicolas Patry 2023-07-12 10:01:01 +0200
5bd2ab6583

feat(server): Support for env value for GPTQ_BITS and GPTQ_GROUPSIZE. (#580) Nicolas Patry 2023-07-12 10:00:02 +0200
f0181436f4

fix(server): Fixing RW code (it's remote code so the Arch checking doesn't work to see which weights to keep). (#579) Nicolas Patry 2023-07-12 09:51:34 +0200
b4024edd45

feat: better errors for warmup and TP (#575) OlivierDehaene 2023-07-10 14:47:15 +0200
e943a294bc

fix(server): harden the weights choice to save on disk. (#561) Nicolas Patry 2023-07-07 14:50:12 +0200
31b36cca21

v0.9.1 (#558) OlivierDehaene 2023-07-06 16:05:42 +0200
c4bb5264ac

fix(server): decrease memory fragmentation (#557) OlivierDehaene 2023-07-06 14:28:33 +0200

Commit Graph Select branches Hide Pull Requests main #1 #10 #11 #12 #15 #17 #2 #3 #4 #5 #6 #7 Mono Color

Commit Graph

Select branches

Hide Pull Requests

main

#1

#10

#11

#12

#15

#17

#2

#3

#4

#5

#6

#7