Commit Graph

  • 972e9a7f7c
    update causal batch for ct2 and fix nf4 (#17) main Michael Feil 2024-02-09 11:07:14 -0800
  • 7f55c3ceaa bump the ctranslate2 version #17 Michael Feil 2023-12-01 17:28:51 +0100
  • c5a294b76b update causal batch for ct2 and fix nf4 michaelfeil 2023-11-04 00:00:00 +0000
  • 339ede9e90
    Update Readme.md / documentation (#15) Michael Feil 2023-10-04 08:01:06 +0200
  • bbd02184ce
    Update README.md #15 Michael Feil 2023-10-03 15:16:27 +0200
  • 229a1bc985 update readme michaelfeil 2023-10-03 10:53:33 +0200
  • 393647af37 add documentation updates michaelfeil 2023-10-03 10:48:40 +0200
  • ff703cb867
    Adding ctranslate2 quantization and inference: moving the contribution (#1) Michael Feil 2023-10-02 20:12:49 +0200
  • 09e88f2470 Merge branch 'main' into ct2_support #1 michaelfeil 2023-10-02 00:00:00 +0000
  • 012c917b6f
    Wrapping completions and chat/completions endpoint (#2) Michael Feil 2023-09-27 17:58:07 +0200
  • f93012d59c
    Merge pull request #4 from michaelfeil/bnb_4bit Yang, Bo 2023-09-08 14:52:32 -0700
  • 072f267cc3
    Initialize v_cache to avoid NaNs (#12) Yang, Bo 2023-08-23 14:23:59 -0700
  • 57deda586e
    Update flash_causal_lm.py #12 Yang, Bo 2023-08-23 14:22:08 -0700
  • a5f96fd18e
    Update flash_causal_lm.py Yang, Bo 2023-08-23 14:21:52 -0700
  • c360d45c9f
    Initialize v_cache to avoid NaNs Yang, Bo 2023-08-23 14:15:35 -0700
  • 2fda8fe812
    Initialize v_cache to avoid NaNs (#11) Yang, Bo 2023-08-23 14:07:06 -0700
  • c6114f4b0d Initialize v_cache to avoid NaNs #11 Yang, Bo 2023-08-23 21:00:58 +0000
  • 1e646fb41d
    Compilation fix: Correct method argument types in generation.rs and validation.rs (#10) Jason Sun 2023-08-23 16:52:49 -0400
  • 45dc82b8b4
    Update router/src/validation.rs #10 Jason Sun 2023-08-22 09:32:12 -0700
  • e5a5db61ea
    Update benchmark/src/generation.rs Jason Sun 2023-08-22 09:32:06 -0700
  • ac2fe4f8c6
    fix: Correct method argument types in generation and validation Jason Sun 2023-08-21 15:14:55 -0700
  • 8130300c9a fix: 2038y problem #2 michaelfeil 2023-08-07 13:02:34 +0200
  • ab58232a3d cargo fmt michaelfeil 2023-08-07 12:28:59 +0200
  • e8ca636eea rebase and squash commits on latest main michaelfeil 2023-08-07 12:25:17 +0200
  • 8ddfbaafb9 Merge branch 'ct2_support' of https://github.com/michaelfeil/preemo-text-generation-inference into ct2_support michaelfeil 2023-08-07 11:06:54 +0200
  • b9326ace1a adapt path michaelfeil 2023-08-06 18:26:30 +0200
  • a732244687 update changes for dockerfile michaelfeil 2023-08-06 17:10:44 +0200
  • c089b19487 update dockerfile michaelfeil 2023-08-04 14:33:49 +0200
  • df1e7b513a reformatting and changes. michaelfeil 2023-08-04 14:18:56 +0200
  • ee81780ba4 rebaseing the commit on preemo fork. michaelfeil 2023-07-30 14:07:28 +0200
  • 5963554641 adapt path michaelfeil 2023-08-06 18:26:30 +0200
  • 24632c5105 update changes for dockerfile michaelfeil 2023-08-06 17:10:44 +0200
  • 2ac9db513a update dockerfile michaelfeil 2023-08-04 14:33:49 +0200
  • bc4b3f97ec reformatting and changes. michaelfeil 2023-08-04 14:18:56 +0200
  • da9746586b
    Update README.md #4 Michael Feil 2023-08-03 23:23:02 +0200
  • a9838bba2f
    Modify exllama weight Michael Feil 2023-08-03 23:20:59 +0200
  • d2ae3581bf
    Claim copyright (#7) Yang, Bo 2023-08-02 17:23:54 -0700
  • 13f559c305
    Claim copyright #7 Yang, Bo 2023-08-02 16:15:53 -0700
  • 8af4a7a0b0
    Merge branch 'main' into bnb_4bit Yang, Bo 2023-08-02 12:47:17 -0700
  • b5fadc4c28
    Don't enable custom kernels if CUDA is not available (#6) Yang, Bo 2023-08-02 09:51:54 -0700
  • 8a5f80bb61
    Add AutoCausalLM (#5) Yang, Bo 2023-08-02 09:35:40 -0700
  • 656f2fe4dc fix: typo michaelfeil 2023-08-02 16:56:14 +0200
  • ec8590a3f1
    Don't enable custom kernels if CUDA is not available #6 Yang, Bo 2023-08-01 17:58:00 -0700
  • ef006ccee2 Merge branch 'AutoCausalLM' of https://github.com/Atry/hf-text-generation-inference into HEAD #5 Yang, Bo 2023-08-01 12:30:08 -0700
  • 9048a80f8f
    Add a new README (#3) Yang, Bo 2023-08-01 12:22:07 -0700
  • 4c2237b2a0 update PR template michaelfeil 2023-08-01 18:18:28 +0200
  • 44fa36b5bf restoring commit from dev branch, rebase on current master michaelfeil 2023-08-01 18:15:18 +0200
  • 220b2afc8a
    Update README.md #3 Yang, Bo 2023-07-31 21:39:29 -0700
  • 76206a513f Add Preemo's README Yang, Bo 2023-07-31 21:35:16 -0700
  • 8c3d8a10cd Rename README.md to README-HuggingFace.md Yang, Bo 2023-07-31 21:34:47 -0700
  • 08b50a5bb9 rebaseing the commit on preemo fork. michaelfeil 2023-07-30 14:07:28 +0200
  • afd04dc71e
    feat(server): update vllm version (#723) OlivierDehaene 2023-07-28 15:36:38 +0200
  • f848decee6
    docs: Add hardware section to TOC in README (#721) regisss 2023-07-28 11:20:03 +0200
  • 5a1cccbb98
    Add section about TGI on other AI hardware accelerators in README (#715) regisss 2023-07-28 09:14:03 +0200
  • 9f18f4c006
    v0.9.4 (#713) OlivierDehaene 2023-07-27 19:25:15 +0200
  • ab96b9aec3
    feat(server): support new falcon config (#712) OlivierDehaene 2023-07-27 18:38:57 +0200
  • 2efd46ef95 fix(server): fix missing datasets in quantize OlivierDehaene 2023-07-27 14:50:45 +0200
  • 8bd0adb135
    fix(server): fix quantization python requirements (#708) OlivierDehaene 2023-07-27 12:28:10 +0200
  • e64a65891b docs(README): update readme OlivierDehaene 2023-07-25 19:45:25 +0200
  • a0d55358d2
    feat(server): Using `quantize_config.json` instead of GPTQ_BITS env variables. (#671) Nicolas Patry 2023-07-25 12:00:27 +0100
  • 9bb64c92a9 Add AutoCausalLM Yang, Bo 2023-07-12 01:07:10 +0000
  • 37df6df38e
    fix(server): fix exllama buffers (#689) OlivierDehaene 2023-07-24 14:25:43 +0200
  • 73a4d65d26
    feat: add cuda memory fraction (#659) OlivierDehaene 2023-07-24 11:43:58 +0200
  • 1da642bd0e feat(server): add local prom and health routes if running w/ ngrok OlivierDehaene 2023-07-21 16:56:30 +0200
  • 15b3e9ffb0
    Directly load GPTBigCode to specified device (#618) Yang, Bo 2023-07-21 02:27:31 -0700
  • d5b5bc750f
    feat(server): Add exllama GPTQ CUDA kernel support #553 (#666) Nicolas Patry 2023-07-21 10:59:00 +0200
  • bf94df3c71
    fix(server): use mem_get_info to get kv cache size (#664) OlivierDehaene 2023-07-20 17:23:49 +0200
  • 08b8eec1d7
    fix(server): Fixing non parameters in quantize script `bigcode/starcoder` was an example. (#661) Nicolas Patry 2023-07-20 16:04:15 +0200
  • 362883f259
    fix(server): llama v2 GPTQ (#648) fxmarty 2023-07-20 15:02:54 +0200
  • 214c06f510
    Add trust_remote_code to quantize script (#647) cdawg 2023-07-20 13:53:08 +0200
  • 5a1512c025
    docs: Update README.md (#643) Nicolas Patry 2023-07-19 13:39:12 +0200
  • 1c81df15cd
    docs: Update README.md (#639) Nicolas Patry 2023-07-19 13:38:52 +0200
  • b66b190403
    feat(router): ngrok edge (#642) OlivierDehaene 2023-07-19 11:59:58 +0200
  • fe80f5360c
    feat(server): auto max_batch_total_tokens for flash att models (#630) OlivierDehaene 2023-07-19 09:31:25 +0200
  • 5e6ddfd6a4
    fix(server): fix llamav2 config (#635) OlivierDehaene 2023-07-18 18:49:42 +0200
  • cf83f9b66f
    v0.9.3 (#634) OlivierDehaene 2023-07-18 18:11:20 +0200
  • 211b211ec0
    feat(server): add support for llamav2 (#633) Nicolas Patry 2023-07-18 18:09:53 +0200
  • 3b71c38558
    feat(server): flash attention v2 (#624) OlivierDehaene 2023-07-18 16:21:18 +0200
  • 4d38a1c4ad
    feat(server): Reworking the quantization script so it's still universal (not llama specific) (#587) Nicolas Patry 2023-07-18 12:19:05 +0200
  • 44acf72a73
    fea(launcher): debug logs (#623) OlivierDehaene 2023-07-17 19:03:07 +0200
  • bc2873246c
    fix(launcher): Rename `b-float16` to `bfloat16` in the launcher arg (#621) Nicolas Patry 2023-07-17 18:38:16 +0200
  • a2cf1bdb2f fix(server): empty_cache when stopped OlivierDehaene 2023-07-15 13:57:31 +0200
  • c58a0c185b
    v0.9.2 (#616) OlivierDehaene 2023-07-14 16:31:48 +0200
  • 5b9de4a1d3
    fix(server): blacklist local files (#609) OlivierDehaene 2023-07-13 21:54:55 +0200
  • c8b077be79
    docs: README: Add logo + baseline (#611) Victor Muštar 2023-07-13 21:45:20 +0200
  • 982ce3227b
    feat(router): explicit warning if revision is not set (#608) OlivierDehaene 2023-07-13 18:59:38 +0200
  • b7327205a6
    feat(launcher): add arg validation and drop subprocess (#595) OlivierDehaene 2023-07-13 14:22:37 +0200
  • 3628559516
    GPTQ Env vars: catch correct type of error (#596) ssmi153 2023-07-13 01:57:46 +0800
  • f2f0289fb9 feat(server): empty cache on errors OlivierDehaene 2023-07-12 17:05:50 +0200
  • 67347950b7
    feat(server): Implements sharding for non divisible `vocab_size`. (#583) Nicolas Patry 2023-07-12 16:43:31 +0200
  • 2c4bf88268
    fix(server): Bug fixes for GPTQ_BITS environment variable passthrough (#590) ssmi153 2023-07-12 20:17:35 +0800
  • 7f9072228a
    fix(server): Adding logger import to t5_modeling.py (#585) Adam Kowalski 2023-07-12 03:40:32 -0500
  • db4efbf4bc
    fix(server): T5 weights names. (#582) Nicolas Patry 2023-07-12 10:01:42 +0200
  • f063ebde10
    chore: migrate ci region for more availability. (#581) Nicolas Patry 2023-07-12 10:01:01 +0200
  • 5bd2ab6583
    feat(server): Support for env value for GPTQ_BITS and GPTQ_GROUPSIZE. (#580) Nicolas Patry 2023-07-12 10:00:02 +0200
  • f0181436f4
    fix(server): Fixing RW code (it's remote code so the Arch checking doesn't work to see which weights to keep). (#579) Nicolas Patry 2023-07-12 09:51:34 +0200
  • b4024edd45
    feat: better errors for warmup and TP (#575) OlivierDehaene 2023-07-10 14:47:15 +0200
  • e943a294bc
    fix(server): harden the weights choice to save on disk. (#561) Nicolas Patry 2023-07-07 14:50:12 +0200
  • 31b36cca21
    v0.9.1 (#558) OlivierDehaene 2023-07-06 16:05:42 +0200
  • c4bb5264ac
    fix(server): decrease memory fragmentation (#557) OlivierDehaene 2023-07-06 14:28:33 +0200