77 lines
3.2 KiB
Plaintext
77 lines
3.2 KiB
Plaintext
<!--Copyright 2022 The HuggingFace Team. All rights reserved.
|
|
|
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
the License. You may obtain a copy of the License at
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
specific language governing permissions and limitations under the License.
|
|
-->
|
|
|
|
# Memory and speed
|
|
|
|
We present some techniques and ideas to optimize 🤗 Diffusers _inference_ for memory or speed.
|
|
|
|
## CUDA `autocast`
|
|
|
|
If you use a CUDA GPU, you can take advantage of `torch.autocast` to perform inference roughly twice as fast at the cost of slightly lower precision. All you need to do is put your inference call inside an `autocast` context manager. The following example shows how to do it using Stable Diffusion text-to-image generation as an example:
|
|
|
|
```Python
|
|
from torch import autocast
|
|
from diffusers import StableDiffusionPipeline
|
|
|
|
pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", use_auth_token=True)
|
|
pipe = pipe.to("cuda")
|
|
|
|
prompt = "a photo of an astronaut riding a horse on mars"
|
|
with autocast("cuda"):
|
|
image = pipe(prompt).images[0]
|
|
```
|
|
|
|
Despite the precision loss, in our experience the final image results look the same as the `float32` versions. Feel free to experiment and report back!
|
|
|
|
## Half precision weights
|
|
|
|
To save more GPU memory, you can load the model weights directly in half precision. This involves loading the float16 version of the weights, which was saved to a branch named `fp16`, and telling PyTorch to use the `float16` type when loading them:
|
|
|
|
```Python
|
|
pipe = StableDiffusionPipeline.from_pretrained(
|
|
"CompVis/stable-diffusion-v1-4",
|
|
revision="fp16",
|
|
torch_dtype=torch.float16,
|
|
use_auth_token=True
|
|
)
|
|
```
|
|
|
|
## Sliced attention for additional memory savings
|
|
|
|
For even additional memory savings, you can use a sliced version of attention that performs the computation in steps instead of all at once.
|
|
|
|
<Tip>
|
|
Attention slicing is useful even if a batch size of just 1 is used - as long as the model uses more than one attention head. If there is more than one attention head the *QK^T* attention matrix can be computed sequentially for each head which can save a significant amount of memory.
|
|
</Tip>
|
|
|
|
To perform the attention computation sequentially over each head, you only need to invoke [`~StableDiffusionPipeline.enable_attention_slicing`] in your pipeline before inference, like here:
|
|
|
|
```Python
|
|
import torch
|
|
from diffusers import StableDiffusionPipeline
|
|
|
|
pipe = StableDiffusionPipeline.from_pretrained(
|
|
"CompVis/stable-diffusion-v1-4",
|
|
revision="fp16",
|
|
torch_dtype=torch.float16,
|
|
use_auth_token=True
|
|
)
|
|
pipe = pipe.to("cuda")
|
|
|
|
prompt = "a photo of an astronaut riding a horse on mars"
|
|
pipe.enable_attention_slicing()
|
|
with torch.autocast("cuda"):
|
|
image = pipe(prompt).images[0]
|
|
```
|
|
|
|
There's a small performance penalty of about 10% slower inference times, but this method allows you to use Stable Diffusion in as little as 3.2 GB of VRAM!
|