Stable Diffusion Latent Upscaler (#2059)
* Modify UNet2DConditionModel
- allow skipping mid_block
- adding a norm_group_size argument so that we can set the `num_groups` for group norm using `num_channels//norm_group_size`
- allow user to set dimension for the timestep embedding (`time_embed_dim`)
- the kernel_size for `conv_in` and `conv_out` is now configurable
- add random fourier feature layer (`GaussianFourierProjection`) for `time_proj`
- allow user to add the time and class embeddings before passing through the projection layer together - `time_embedding(t_emb + class_label))`
- added 2 arguments `attn1_types` and `attn2_types`
* currently we have argument `only_cross_attention`: when it's set to `True`, we will have a to the
`BasicTransformerBlock` block with 2 cross-attention , otherwise we
get a self-attention followed by a cross-attention; in k-upscaler, we need to have blocks that include just one cross-attention, or self-attention -> cross-attention;
so I added `attn1_types` and `attn2_types` to the unet's argument list to allow user specify the attention types for the 2 positions in each block; note that I stil kept
the `only_cross_attention` argument for unet for easy configuration, but it will be converted to `attn1_type` and `attn2_type` when passing down to the down blocks
- the position of downsample layer and upsample layer is now configurable
- in k-upscaler unet, there is only one skip connection per each up/down block (instead of each layer in stable diffusion unet), added `skip_freq = "block"` to support
this use case
- if user passes attention_mask to unet, it will prepare the mask and pass a flag to cross attention processer to skip the `prepare_attention_mask` step
inside cross attention block
add up/down blocks for k-upscaler
modify CrossAttention class
- make the `dropout` layer in `to_out` optional
- `use_conv_proj` - use conv instead of linear for all projection layers (i.e. `to_q`, `to_k`, `to_v`, `to_out`) whenever possible. note that when it's used to do cross
attention, to_k, to_v has to be linear because the `encoder_hidden_states` is not 2d
- `cross_attention_norm` - add an optional layernorm on encoder_hidden_states
- `attention_dropout`: add an optional dropout on attention score
adapt BasicTransformerBlock
- add an ada groupnorm layer to conditioning attention input with timestep embedding
- allow skipping the FeedForward layer in between the attentions
- replaced the only_cross_attention argument with attn1_type and attn2_type for more flexible configuration
update timestep embedding: add new act_fn gelu and an optional act_2
modified ResnetBlock2D
- refactored with AdaGroupNorm class (the timestep scale shift normalization)
- add `mid_channel` argument - allow the first conv to have a different output dimension from the second conv
- add option to use input AdaGroupNorm on the input instead of groupnorm
- add options to add a dropout layer after each conv
- allow user to set the bias in conv_shortcut (needed for k-upscaler)
- add gelu
adding conversion script for k-upscaler unet
add pipeline
* fix attention mask
* fix a typo
* fix a bug
* make sure model can be used with GPU
* make pipeline work with fp16
* fix an error in BasicTransfomerBlock
* make style
* fix typo
* some more fixes
* uP
* up
* correct more
* some clean-up
* clean time proj
* up
* uP
* more changes
* remove the upcast_attention=True from unet config
* remove attn1_types, attn2_types etc
* fix
* revert incorrect changes up/down samplers
* make style
* remove outdated files
* Apply suggestions from code review
* attention refactor
* refactor cross attention
* Apply suggestions from code review
* update
* up
* update
* Apply suggestions from code review
* finish
* Update src/diffusers/models/cross_attention.py
* more fixes
* up
* up
* up
* finish
* more corrections of conversion state
* act_2 -> act_2_fn
* remove dropout_after_conv from ResnetBlock2D
* make style
* simplify KAttentionBlock
* add fast test for latent upscaler pipeline
* add slow test
* slow test fp16
* make style
* add doc string for pipeline_stable_diffusion_latent_upscale
* add api doc page for latent upscaler pipeline
* deprecate attention mask
* clean up embeddings
* simplify resnet
* up
* clean up resnet
* up
* correct more
* up
* up
* improve a bit more
* correct more
* more clean-ups
* Update docs/source/en/api/pipelines/stable_diffusion/latent_upscale.mdx
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
* Update docs/source/en/api/pipelines/stable_diffusion/latent_upscale.mdx
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
* add docstrings for new unet config
* Update src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_latent_upscale.py
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
* Update src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_latent_upscale.py
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
* # Copied from
* encode the image if not latent
* remove force casting vae to fp32
* fix
* add comments about preconditioning parameters from k-diffusion paper
* attn1_type, attn2_type -> add_self_attention
* clean up get_down_block and get_up_block
* fix
* fixed a typo(?) in ada group norm
* update slice attention processer for cross attention
* update slice
* fix fast test
* update the checkpoint
* finish tests
* fix-copies
* fix-copy for modeling_text_unet.py
* make style
* make style
* fix f-string
* Update src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_latent_upscale.py
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
* fix import
* correct changes
* fix resnet
* make fix-copies
* correct euler scheduler
* add missing #copied from for preprocess
* revert
* fix
* fix copies
* Update docs/source/en/api/pipelines/stable_diffusion/latent_upscale.mdx
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
* Update docs/source/en/api/pipelines/stable_diffusion/latent_upscale.mdx
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
* Update docs/source/en/api/pipelines/stable_diffusion/latent_upscale.mdx
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
* Update docs/source/en/api/pipelines/stable_diffusion/latent_upscale.mdx
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
* Update src/diffusers/models/cross_attention.py
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
* Update src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_latent_upscale.py
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
* Update src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_latent_upscale.py
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
* clean up conversion script
* KDownsample2d,KUpsample2d -> KDownsample2D,KUpsample2D
* more
* Update src/diffusers/models/unet_2d_condition.py
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
* remove prepare_extra_step_kwargs
* Update src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_latent_upscale.py
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
* Update src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_latent_upscale.py
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
* fix a typo in timestep embedding
* remove num_image_per_prompt
* fix fasttest
* make style + fix-copies
* fix
* fix xformer test
* fix style
* doc string
* make style
* fix-copies
* docstring for time_embedding_norm
* make style
* final finishes
* make fix-copies
* fix tests
---------
Co-authored-by: yiyixuxu <yixu@yis-macbook-pro.lan>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
2023-02-07 01:11:57 -07:00
|
|
|
import argparse
|
|
|
|
|
|
|
|
import huggingface_hub
|
|
|
|
import k_diffusion as K
|
2023-02-07 15:46:23 -07:00
|
|
|
import torch
|
|
|
|
|
Stable Diffusion Latent Upscaler (#2059)
* Modify UNet2DConditionModel
- allow skipping mid_block
- adding a norm_group_size argument so that we can set the `num_groups` for group norm using `num_channels//norm_group_size`
- allow user to set dimension for the timestep embedding (`time_embed_dim`)
- the kernel_size for `conv_in` and `conv_out` is now configurable
- add random fourier feature layer (`GaussianFourierProjection`) for `time_proj`
- allow user to add the time and class embeddings before passing through the projection layer together - `time_embedding(t_emb + class_label))`
- added 2 arguments `attn1_types` and `attn2_types`
* currently we have argument `only_cross_attention`: when it's set to `True`, we will have a to the
`BasicTransformerBlock` block with 2 cross-attention , otherwise we
get a self-attention followed by a cross-attention; in k-upscaler, we need to have blocks that include just one cross-attention, or self-attention -> cross-attention;
so I added `attn1_types` and `attn2_types` to the unet's argument list to allow user specify the attention types for the 2 positions in each block; note that I stil kept
the `only_cross_attention` argument for unet for easy configuration, but it will be converted to `attn1_type` and `attn2_type` when passing down to the down blocks
- the position of downsample layer and upsample layer is now configurable
- in k-upscaler unet, there is only one skip connection per each up/down block (instead of each layer in stable diffusion unet), added `skip_freq = "block"` to support
this use case
- if user passes attention_mask to unet, it will prepare the mask and pass a flag to cross attention processer to skip the `prepare_attention_mask` step
inside cross attention block
add up/down blocks for k-upscaler
modify CrossAttention class
- make the `dropout` layer in `to_out` optional
- `use_conv_proj` - use conv instead of linear for all projection layers (i.e. `to_q`, `to_k`, `to_v`, `to_out`) whenever possible. note that when it's used to do cross
attention, to_k, to_v has to be linear because the `encoder_hidden_states` is not 2d
- `cross_attention_norm` - add an optional layernorm on encoder_hidden_states
- `attention_dropout`: add an optional dropout on attention score
adapt BasicTransformerBlock
- add an ada groupnorm layer to conditioning attention input with timestep embedding
- allow skipping the FeedForward layer in between the attentions
- replaced the only_cross_attention argument with attn1_type and attn2_type for more flexible configuration
update timestep embedding: add new act_fn gelu and an optional act_2
modified ResnetBlock2D
- refactored with AdaGroupNorm class (the timestep scale shift normalization)
- add `mid_channel` argument - allow the first conv to have a different output dimension from the second conv
- add option to use input AdaGroupNorm on the input instead of groupnorm
- add options to add a dropout layer after each conv
- allow user to set the bias in conv_shortcut (needed for k-upscaler)
- add gelu
adding conversion script for k-upscaler unet
add pipeline
* fix attention mask
* fix a typo
* fix a bug
* make sure model can be used with GPU
* make pipeline work with fp16
* fix an error in BasicTransfomerBlock
* make style
* fix typo
* some more fixes
* uP
* up
* correct more
* some clean-up
* clean time proj
* up
* uP
* more changes
* remove the upcast_attention=True from unet config
* remove attn1_types, attn2_types etc
* fix
* revert incorrect changes up/down samplers
* make style
* remove outdated files
* Apply suggestions from code review
* attention refactor
* refactor cross attention
* Apply suggestions from code review
* update
* up
* update
* Apply suggestions from code review
* finish
* Update src/diffusers/models/cross_attention.py
* more fixes
* up
* up
* up
* finish
* more corrections of conversion state
* act_2 -> act_2_fn
* remove dropout_after_conv from ResnetBlock2D
* make style
* simplify KAttentionBlock
* add fast test for latent upscaler pipeline
* add slow test
* slow test fp16
* make style
* add doc string for pipeline_stable_diffusion_latent_upscale
* add api doc page for latent upscaler pipeline
* deprecate attention mask
* clean up embeddings
* simplify resnet
* up
* clean up resnet
* up
* correct more
* up
* up
* improve a bit more
* correct more
* more clean-ups
* Update docs/source/en/api/pipelines/stable_diffusion/latent_upscale.mdx
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
* Update docs/source/en/api/pipelines/stable_diffusion/latent_upscale.mdx
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
* add docstrings for new unet config
* Update src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_latent_upscale.py
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
* Update src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_latent_upscale.py
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
* # Copied from
* encode the image if not latent
* remove force casting vae to fp32
* fix
* add comments about preconditioning parameters from k-diffusion paper
* attn1_type, attn2_type -> add_self_attention
* clean up get_down_block and get_up_block
* fix
* fixed a typo(?) in ada group norm
* update slice attention processer for cross attention
* update slice
* fix fast test
* update the checkpoint
* finish tests
* fix-copies
* fix-copy for modeling_text_unet.py
* make style
* make style
* fix f-string
* Update src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_latent_upscale.py
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
* fix import
* correct changes
* fix resnet
* make fix-copies
* correct euler scheduler
* add missing #copied from for preprocess
* revert
* fix
* fix copies
* Update docs/source/en/api/pipelines/stable_diffusion/latent_upscale.mdx
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
* Update docs/source/en/api/pipelines/stable_diffusion/latent_upscale.mdx
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
* Update docs/source/en/api/pipelines/stable_diffusion/latent_upscale.mdx
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
* Update docs/source/en/api/pipelines/stable_diffusion/latent_upscale.mdx
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
* Update src/diffusers/models/cross_attention.py
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
* Update src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_latent_upscale.py
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
* Update src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_latent_upscale.py
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
* clean up conversion script
* KDownsample2d,KUpsample2d -> KDownsample2D,KUpsample2D
* more
* Update src/diffusers/models/unet_2d_condition.py
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
* remove prepare_extra_step_kwargs
* Update src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_latent_upscale.py
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
* Update src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_latent_upscale.py
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
* fix a typo in timestep embedding
* remove num_image_per_prompt
* fix fasttest
* make style + fix-copies
* fix
* fix xformer test
* fix style
* doc string
* make style
* fix-copies
* docstring for time_embedding_norm
* make style
* final finishes
* make fix-copies
* fix tests
---------
Co-authored-by: yiyixuxu <yixu@yis-macbook-pro.lan>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
2023-02-07 01:11:57 -07:00
|
|
|
from diffusers import UNet2DConditionModel
|
|
|
|
|
|
|
|
|
|
|
|
UPSCALER_REPO = "pcuenq/k-upscaler"
|
|
|
|
|
|
|
|
|
|
|
|
def resnet_to_diffusers_checkpoint(resnet, checkpoint, *, diffusers_resnet_prefix, resnet_prefix):
|
|
|
|
rv = {
|
|
|
|
# norm1
|
|
|
|
f"{diffusers_resnet_prefix}.norm1.linear.weight": checkpoint[f"{resnet_prefix}.main.0.mapper.weight"],
|
|
|
|
f"{diffusers_resnet_prefix}.norm1.linear.bias": checkpoint[f"{resnet_prefix}.main.0.mapper.bias"],
|
|
|
|
# conv1
|
|
|
|
f"{diffusers_resnet_prefix}.conv1.weight": checkpoint[f"{resnet_prefix}.main.2.weight"],
|
|
|
|
f"{diffusers_resnet_prefix}.conv1.bias": checkpoint[f"{resnet_prefix}.main.2.bias"],
|
|
|
|
# norm2
|
|
|
|
f"{diffusers_resnet_prefix}.norm2.linear.weight": checkpoint[f"{resnet_prefix}.main.4.mapper.weight"],
|
|
|
|
f"{diffusers_resnet_prefix}.norm2.linear.bias": checkpoint[f"{resnet_prefix}.main.4.mapper.bias"],
|
|
|
|
# conv2
|
|
|
|
f"{diffusers_resnet_prefix}.conv2.weight": checkpoint[f"{resnet_prefix}.main.6.weight"],
|
|
|
|
f"{diffusers_resnet_prefix}.conv2.bias": checkpoint[f"{resnet_prefix}.main.6.bias"],
|
|
|
|
}
|
|
|
|
|
|
|
|
if resnet.conv_shortcut is not None:
|
|
|
|
rv.update(
|
|
|
|
{
|
|
|
|
f"{diffusers_resnet_prefix}.conv_shortcut.weight": checkpoint[f"{resnet_prefix}.skip.weight"],
|
|
|
|
}
|
|
|
|
)
|
|
|
|
|
|
|
|
return rv
|
|
|
|
|
|
|
|
|
|
|
|
def self_attn_to_diffusers_checkpoint(checkpoint, *, diffusers_attention_prefix, attention_prefix):
|
|
|
|
weight_q, weight_k, weight_v = checkpoint[f"{attention_prefix}.qkv_proj.weight"].chunk(3, dim=0)
|
|
|
|
bias_q, bias_k, bias_v = checkpoint[f"{attention_prefix}.qkv_proj.bias"].chunk(3, dim=0)
|
|
|
|
rv = {
|
|
|
|
# norm
|
|
|
|
f"{diffusers_attention_prefix}.norm1.linear.weight": checkpoint[f"{attention_prefix}.norm_in.mapper.weight"],
|
|
|
|
f"{diffusers_attention_prefix}.norm1.linear.bias": checkpoint[f"{attention_prefix}.norm_in.mapper.bias"],
|
|
|
|
# to_q
|
|
|
|
f"{diffusers_attention_prefix}.attn1.to_q.weight": weight_q.squeeze(-1).squeeze(-1),
|
|
|
|
f"{diffusers_attention_prefix}.attn1.to_q.bias": bias_q,
|
|
|
|
# to_k
|
|
|
|
f"{diffusers_attention_prefix}.attn1.to_k.weight": weight_k.squeeze(-1).squeeze(-1),
|
|
|
|
f"{diffusers_attention_prefix}.attn1.to_k.bias": bias_k,
|
|
|
|
# to_v
|
|
|
|
f"{diffusers_attention_prefix}.attn1.to_v.weight": weight_v.squeeze(-1).squeeze(-1),
|
|
|
|
f"{diffusers_attention_prefix}.attn1.to_v.bias": bias_v,
|
|
|
|
# to_out
|
|
|
|
f"{diffusers_attention_prefix}.attn1.to_out.0.weight": checkpoint[f"{attention_prefix}.out_proj.weight"]
|
|
|
|
.squeeze(-1)
|
|
|
|
.squeeze(-1),
|
|
|
|
f"{diffusers_attention_prefix}.attn1.to_out.0.bias": checkpoint[f"{attention_prefix}.out_proj.bias"],
|
|
|
|
}
|
|
|
|
|
|
|
|
return rv
|
|
|
|
|
|
|
|
|
|
|
|
def cross_attn_to_diffusers_checkpoint(
|
|
|
|
checkpoint, *, diffusers_attention_prefix, diffusers_attention_index, attention_prefix
|
|
|
|
):
|
|
|
|
weight_k, weight_v = checkpoint[f"{attention_prefix}.kv_proj.weight"].chunk(2, dim=0)
|
|
|
|
bias_k, bias_v = checkpoint[f"{attention_prefix}.kv_proj.bias"].chunk(2, dim=0)
|
|
|
|
|
|
|
|
rv = {
|
|
|
|
# norm2 (ada groupnorm)
|
|
|
|
f"{diffusers_attention_prefix}.norm{diffusers_attention_index}.linear.weight": checkpoint[
|
|
|
|
f"{attention_prefix}.norm_dec.mapper.weight"
|
|
|
|
],
|
|
|
|
f"{diffusers_attention_prefix}.norm{diffusers_attention_index}.linear.bias": checkpoint[
|
|
|
|
f"{attention_prefix}.norm_dec.mapper.bias"
|
|
|
|
],
|
|
|
|
# layernorm on encoder_hidden_state
|
|
|
|
f"{diffusers_attention_prefix}.attn{diffusers_attention_index}.norm_cross.weight": checkpoint[
|
|
|
|
f"{attention_prefix}.norm_enc.weight"
|
|
|
|
],
|
|
|
|
f"{diffusers_attention_prefix}.attn{diffusers_attention_index}.norm_cross.bias": checkpoint[
|
|
|
|
f"{attention_prefix}.norm_enc.bias"
|
|
|
|
],
|
|
|
|
# to_q
|
|
|
|
f"{diffusers_attention_prefix}.attn{diffusers_attention_index}.to_q.weight": checkpoint[
|
|
|
|
f"{attention_prefix}.q_proj.weight"
|
|
|
|
]
|
|
|
|
.squeeze(-1)
|
|
|
|
.squeeze(-1),
|
|
|
|
f"{diffusers_attention_prefix}.attn{diffusers_attention_index}.to_q.bias": checkpoint[
|
|
|
|
f"{attention_prefix}.q_proj.bias"
|
|
|
|
],
|
|
|
|
# to_k
|
|
|
|
f"{diffusers_attention_prefix}.attn{diffusers_attention_index}.to_k.weight": weight_k.squeeze(-1).squeeze(-1),
|
|
|
|
f"{diffusers_attention_prefix}.attn{diffusers_attention_index}.to_k.bias": bias_k,
|
|
|
|
# to_v
|
|
|
|
f"{diffusers_attention_prefix}.attn{diffusers_attention_index}.to_v.weight": weight_v.squeeze(-1).squeeze(-1),
|
|
|
|
f"{diffusers_attention_prefix}.attn{diffusers_attention_index}.to_v.bias": bias_v,
|
|
|
|
# to_out
|
|
|
|
f"{diffusers_attention_prefix}.attn{diffusers_attention_index}.to_out.0.weight": checkpoint[
|
|
|
|
f"{attention_prefix}.out_proj.weight"
|
|
|
|
]
|
|
|
|
.squeeze(-1)
|
|
|
|
.squeeze(-1),
|
|
|
|
f"{diffusers_attention_prefix}.attn{diffusers_attention_index}.to_out.0.bias": checkpoint[
|
|
|
|
f"{attention_prefix}.out_proj.bias"
|
|
|
|
],
|
|
|
|
}
|
|
|
|
|
|
|
|
return rv
|
|
|
|
|
|
|
|
|
|
|
|
def block_to_diffusers_checkpoint(block, checkpoint, block_idx, block_type):
|
|
|
|
block_prefix = "inner_model.u_net.u_blocks" if block_type == "up" else "inner_model.u_net.d_blocks"
|
|
|
|
block_prefix = f"{block_prefix}.{block_idx}"
|
|
|
|
|
|
|
|
diffusers_checkpoint = {}
|
|
|
|
|
|
|
|
if not hasattr(block, "attentions"):
|
|
|
|
n = 1 # resnet only
|
|
|
|
elif not block.attentions[0].add_self_attention:
|
|
|
|
n = 2 # resnet -> cross-attention
|
|
|
|
else:
|
|
|
|
n = 3 # resnet -> self-attention -> cross-attention)
|
|
|
|
|
|
|
|
for resnet_idx, resnet in enumerate(block.resnets):
|
|
|
|
# diffusers_resnet_prefix = f"{diffusers_up_block_prefix}.resnets.{resnet_idx}"
|
|
|
|
diffusers_resnet_prefix = f"{block_type}_blocks.{block_idx}.resnets.{resnet_idx}"
|
|
|
|
idx = n * resnet_idx if block_type == "up" else n * resnet_idx + 1
|
|
|
|
resnet_prefix = f"{block_prefix}.{idx}" if block_type == "up" else f"{block_prefix}.{idx}"
|
|
|
|
|
|
|
|
diffusers_checkpoint.update(
|
|
|
|
resnet_to_diffusers_checkpoint(
|
|
|
|
resnet, checkpoint, diffusers_resnet_prefix=diffusers_resnet_prefix, resnet_prefix=resnet_prefix
|
|
|
|
)
|
|
|
|
)
|
|
|
|
|
|
|
|
if hasattr(block, "attentions"):
|
|
|
|
for attention_idx, attention in enumerate(block.attentions):
|
|
|
|
diffusers_attention_prefix = f"{block_type}_blocks.{block_idx}.attentions.{attention_idx}"
|
|
|
|
idx = n * attention_idx + 1 if block_type == "up" else n * attention_idx + 2
|
|
|
|
self_attention_prefix = f"{block_prefix}.{idx}"
|
|
|
|
cross_attention_prefix = f"{block_prefix}.{idx }"
|
|
|
|
cross_attention_index = 1 if not attention.add_self_attention else 2
|
|
|
|
idx = (
|
|
|
|
n * attention_idx + cross_attention_index
|
|
|
|
if block_type == "up"
|
|
|
|
else n * attention_idx + cross_attention_index + 1
|
|
|
|
)
|
|
|
|
cross_attention_prefix = f"{block_prefix}.{idx }"
|
|
|
|
|
|
|
|
diffusers_checkpoint.update(
|
|
|
|
cross_attn_to_diffusers_checkpoint(
|
|
|
|
checkpoint,
|
|
|
|
diffusers_attention_prefix=diffusers_attention_prefix,
|
|
|
|
diffusers_attention_index=2,
|
|
|
|
attention_prefix=cross_attention_prefix,
|
|
|
|
)
|
|
|
|
)
|
|
|
|
|
|
|
|
if attention.add_self_attention is True:
|
|
|
|
diffusers_checkpoint.update(
|
|
|
|
self_attn_to_diffusers_checkpoint(
|
|
|
|
checkpoint,
|
|
|
|
diffusers_attention_prefix=diffusers_attention_prefix,
|
|
|
|
attention_prefix=self_attention_prefix,
|
|
|
|
)
|
|
|
|
)
|
|
|
|
|
|
|
|
return diffusers_checkpoint
|
|
|
|
|
|
|
|
|
|
|
|
def unet_to_diffusers_checkpoint(model, checkpoint):
|
|
|
|
diffusers_checkpoint = {}
|
|
|
|
|
|
|
|
# pre-processing
|
|
|
|
diffusers_checkpoint.update(
|
|
|
|
{
|
|
|
|
"conv_in.weight": checkpoint["inner_model.proj_in.weight"],
|
|
|
|
"conv_in.bias": checkpoint["inner_model.proj_in.bias"],
|
|
|
|
}
|
|
|
|
)
|
|
|
|
|
|
|
|
# timestep and class embedding
|
|
|
|
diffusers_checkpoint.update(
|
|
|
|
{
|
|
|
|
"time_proj.weight": checkpoint["inner_model.timestep_embed.weight"].squeeze(-1),
|
|
|
|
"time_embedding.linear_1.weight": checkpoint["inner_model.mapping.0.weight"],
|
|
|
|
"time_embedding.linear_1.bias": checkpoint["inner_model.mapping.0.bias"],
|
|
|
|
"time_embedding.linear_2.weight": checkpoint["inner_model.mapping.2.weight"],
|
|
|
|
"time_embedding.linear_2.bias": checkpoint["inner_model.mapping.2.bias"],
|
|
|
|
"time_embedding.cond_proj.weight": checkpoint["inner_model.mapping_cond.weight"],
|
|
|
|
}
|
|
|
|
)
|
|
|
|
|
|
|
|
# down_blocks
|
|
|
|
for down_block_idx, down_block in enumerate(model.down_blocks):
|
|
|
|
diffusers_checkpoint.update(block_to_diffusers_checkpoint(down_block, checkpoint, down_block_idx, "down"))
|
|
|
|
|
|
|
|
# up_blocks
|
|
|
|
for up_block_idx, up_block in enumerate(model.up_blocks):
|
|
|
|
diffusers_checkpoint.update(block_to_diffusers_checkpoint(up_block, checkpoint, up_block_idx, "up"))
|
|
|
|
|
|
|
|
# post-processing
|
|
|
|
diffusers_checkpoint.update(
|
|
|
|
{
|
|
|
|
"conv_out.weight": checkpoint["inner_model.proj_out.weight"],
|
|
|
|
"conv_out.bias": checkpoint["inner_model.proj_out.bias"],
|
|
|
|
}
|
|
|
|
)
|
|
|
|
|
|
|
|
return diffusers_checkpoint
|
|
|
|
|
|
|
|
|
|
|
|
def unet_model_from_original_config(original_config):
|
|
|
|
in_channels = original_config["input_channels"] + original_config["unet_cond_dim"]
|
|
|
|
out_channels = original_config["input_channels"] + (1 if original_config["has_variance"] else 0)
|
|
|
|
|
|
|
|
block_out_channels = original_config["channels"]
|
|
|
|
|
|
|
|
assert (
|
|
|
|
len(set(original_config["depths"])) == 1
|
|
|
|
), "UNet2DConditionModel currently do not support blocks with different number of layers"
|
|
|
|
layers_per_block = original_config["depths"][0]
|
|
|
|
|
|
|
|
class_labels_dim = original_config["mapping_cond_dim"]
|
|
|
|
cross_attention_dim = original_config["cross_cond_dim"]
|
|
|
|
|
|
|
|
attn1_types = []
|
|
|
|
attn2_types = []
|
|
|
|
for s, c in zip(original_config["self_attn_depths"], original_config["cross_attn_depths"]):
|
|
|
|
if s:
|
|
|
|
a1 = "self"
|
|
|
|
a2 = "cross" if c else None
|
|
|
|
elif c:
|
|
|
|
a1 = "cross"
|
|
|
|
a2 = None
|
|
|
|
else:
|
|
|
|
a1 = None
|
|
|
|
a2 = None
|
|
|
|
attn1_types.append(a1)
|
|
|
|
attn2_types.append(a2)
|
|
|
|
|
|
|
|
unet = UNet2DConditionModel(
|
|
|
|
in_channels=in_channels,
|
|
|
|
out_channels=out_channels,
|
|
|
|
down_block_types=("KDownBlock2D", "KCrossAttnDownBlock2D", "KCrossAttnDownBlock2D", "KCrossAttnDownBlock2D"),
|
|
|
|
mid_block_type=None,
|
|
|
|
up_block_types=("KCrossAttnUpBlock2D", "KCrossAttnUpBlock2D", "KCrossAttnUpBlock2D", "KUpBlock2D"),
|
|
|
|
block_out_channels=block_out_channels,
|
|
|
|
layers_per_block=layers_per_block,
|
|
|
|
act_fn="gelu",
|
|
|
|
norm_num_groups=None,
|
|
|
|
cross_attention_dim=cross_attention_dim,
|
|
|
|
attention_head_dim=64,
|
|
|
|
time_cond_proj_dim=class_labels_dim,
|
|
|
|
resnet_time_scale_shift="scale_shift",
|
|
|
|
time_embedding_type="fourier",
|
|
|
|
timestep_post_act="gelu",
|
|
|
|
conv_in_kernel=1,
|
|
|
|
conv_out_kernel=1,
|
|
|
|
)
|
|
|
|
|
|
|
|
return unet
|
|
|
|
|
|
|
|
|
|
|
|
def main(args):
|
|
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
|
|
|
|
|
|
|
orig_config_path = huggingface_hub.hf_hub_download(UPSCALER_REPO, "config_laion_text_cond_latent_upscaler_2.json")
|
|
|
|
orig_weights_path = huggingface_hub.hf_hub_download(
|
|
|
|
UPSCALER_REPO, "laion_text_cond_latent_upscaler_2_1_00470000_slim.pth"
|
|
|
|
)
|
|
|
|
print(f"loading original model configuration from {orig_config_path}")
|
|
|
|
print(f"loading original model checkpoint from {orig_weights_path}")
|
|
|
|
|
|
|
|
print("converting to diffusers unet")
|
|
|
|
orig_config = K.config.load_config(open(orig_config_path))["model"]
|
|
|
|
model = unet_model_from_original_config(orig_config)
|
|
|
|
|
|
|
|
orig_checkpoint = torch.load(orig_weights_path, map_location=device)["model_ema"]
|
|
|
|
converted_checkpoint = unet_to_diffusers_checkpoint(model, orig_checkpoint)
|
|
|
|
|
|
|
|
model.load_state_dict(converted_checkpoint, strict=True)
|
|
|
|
model.save_pretrained(args.dump_path)
|
|
|
|
print(f"saving converted unet model in {args.dump_path}")
|
|
|
|
|
|
|
|
|
|
|
|
if __name__ == "__main__":
|
|
|
|
parser = argparse.ArgumentParser()
|
|
|
|
|
|
|
|
parser.add_argument("--dump_path", default=None, type=str, required=True, help="Path to the output model.")
|
|
|
|
args = parser.parse_args()
|
|
|
|
|
|
|
|
main(args)
|