feat: bump launcher and add new lora docs

2024-06-06 23:49:07 +00:00 · 2024-06-06 23:49:07 +00:00 · 43ec9dfe32
parent 81707bfbfa
commit 43ec9dfe32
4 changed files with 90 additions and 1 deletions
--- a/docs/source/_toctree.yml
+++ b/docs/source/_toctree.yml
@ -58,6 +58,9 @@
  - local: conceptual/speculation
    title: Speculation (Medusa, ngram)
  - local: conceptual/guidance
-    title: How Guidance Works (via outlines)
+    title: How Guidance Works (via outlines
  - local: conceptual/mutli_lora
    title: LoRA (Low-Rank Adaptation)
  title: Conceptual Guides
--- a/docs/source/basic_tutorials/launcher.md
+++ b/docs/source/basic_tutorials/launcher.md
@ -409,6 +409,14 @@ Options:
          [env: MAX_CLIENT_BATCH_SIZE=]
          [default: 4]
 ```
 ## LORA_ADAPTERS
 ```shell
      --lora-adapters <LORA_ADAPTERS>
          Lora Adapters a list of adapter ids i.e. `repo/adapter1,repo/adapter2` to load during startup that will be available to callers via the `adapter_id` field in a request
          [env: LORA_ADAPTERS=]
 ```
 ## HELP
 ```shell
--- a/docs/source/conceptual/multi_lora.md
+++ b/docs/source/conceptual/multi_lora.md
@ -0,0 +1,65 @@
 # LoRA (Low-Rank Adaptation)
 ## What is LoRA?
 LoRA is a technique that allows for efficent fine-tuning a model while only updating a small portion of the model's weights. This is useful when you have a large model that has been pre-trained on a large dataset, but you want to fine-tune it on a smaller dataset or for a specific task.
 LoRA works by adding a small number of additional weights to the model, which are used to adapt the model to the new dataset or task. These additional weights are learned during the fine-tuning process, while the rest of the model's weights are kept fixed.
 ## How is it used?
 LoRA can be used in many ways and the community is always finding new ways to use it. Here are some examples of how you can use LoRA:
 Technically, LoRA can be used to fine-tune a large language model on a small dataset. However, these use cases can span a wide range of applications, such as:
 - fine-tuning a language model on a small dataset
 - fine-tuning a language model on a domain-specific dataset
 - fine-tuning a language model on a dataset with limited labels
 ## Optimizing Inference with LoRA
 LoRA's can be used during inference by mutliplying the adapter weights with the model weights at each specified layer. This process can be computationally expensive, but due to awesome work by [punica-ai](https://github.com/punica-ai/punica) and the [lorax](https://github.com/predibase/lorax) team, optimized kernels/and frameworks have been developed to make this process more efficient. TGI leverages these optimizations in order to provide fast and efficient inference with mulitple LoRA models.
 ## Serving multiple LoRA adapters with TGI
 Once a LoRA model has been trained, it can be used to generate text or perform other tasks just like a regular language model. However, because the model has been fine-tuned on a specific dataset, it may perform better on that dataset than a model that has not been fine-tuned.
 In practice its often useful to have multiple LoRA models, each fine-tuned on a different dataset or for a different task. This allows you to use the model that is best suited for a particular task or dataset.
 Text Genertaion Inference (TGI) now supports loading multiple LoRA models at startup that can be used in generation requests. This feature is available starting from version `~2.0.6` and is compatible with LoRA models trained using the `peft` library.
 ### Specifying LoRA models
 To use LoRA in TGI, when starting the server, you can specify the list of LoRA models to load using the `LORA_ADAPTERS` environment variable. For example:
 ```bash
 LORA_ADAPTERS=predibase/customer_support,predibase/dbpedia
 ```
 In the server logs, you will see the following message:
 ```txt
 Loading adapter weights into model: predibase/customer_support
 Loading adapter weights into model: predibase/dbpedia
 ```
 ## Generate text
 You can then use these models in generation requests by specifying the `lora_model` parameter in the request payload. For example:
 ```json
 curl 127.0.0.1:3000/generate \
    -X POST \
    -H 'Content-Type: application/json' \
    -d '{
  "inputs": "Hello who are you?",
  "parameters": {
    "max_new_tokens": 40,
    "adapter_id": "predibase/customer_support"
  }
 }'
 ```
 > **Note:** The Lora feature is new and still being improved. If you encounter any issues or have any feedback, please let us know by opening an issue on the [GitHub repository](https://github.com/huggingface/text-generation-inference/issues/new/choose). Additionally documentation and an improved client library will be published soon.
 An updated tutorial with detailed examples will be published soon. Stay tuned!
--- a/launcher/src/main.rs
+++ b/launcher/src/main.rs
@ -449,6 +449,11 @@ struct Args {
    /// Control the maximum number of inputs that a client can send in a single request
    #[clap(default_value = "4", long, env)]
    max_client_batch_size: usize,
    /// Lora Adapters a list of adapter ids i.e. `repo/adapter1,repo/adapter2` to load during
    /// startup that will be available to callers via the `adapter_id` field in a request.
    #[clap(long, env)]
    lora_adapters: Option<String>,
 }
 #[derive(Debug)]
@ -482,6 +487,7 @@ fn shard_manager(
    max_total_tokens: usize,
    max_batch_size: Option<usize>,
    max_input_tokens: usize,
    lora_adapters: Option<String>,
    otlp_endpoint: Option<String>,
    log_level: LevelFilter,
    status_sender: mpsc::Sender<ShardStatus>,
@ -612,6 +618,11 @@ fn shard_manager(
        envs.push(("MAX_BATCH_SIZE".into(), max_batch_size.to_string().into()));
    }
    // Lora Adapters
    if let Some(lora_adapters) = lora_adapters {
        envs.push(("LORA_ADAPTERS".into(), lora_adapters.into()));
    }
    // If huggingface_hub_cache is some, pass it to the shard
    // Useful when running inside a docker container
    if let Some(huggingface_hub_cache) = huggingface_hub_cache {
@ -1048,6 +1059,7 @@ fn spawn_shards(
        let rope_scaling = args.rope_scaling;
        let rope_factor = args.rope_factor;
        let max_batch_size = args.max_batch_size;
        let lora_adapters = args.lora_adapters.clone();
        thread::spawn(move || {
            shard_manager(
                model_id,
@ -1073,6 +1085,7 @@ fn spawn_shards(
                max_total_tokens,
                max_batch_size,
                max_input_tokens,
                lora_adapters,
                otlp_endpoint,
                max_log_level,
                status_sender,