硬核解读Stable Diffusion(系列二)
设为星标,干货直达!
SD的主要应用
下面来介绍SD的主要应用,这包括文生图,图生图以及图像inpainting。其中文生图是SD的基础功能:根据输入文本生成相应的图像,而图生图和图像inpainting是在文生图的基础上延伸出来的两个功能。
文生图
根据文本生成图像这是文生图的最核心的功能,下图为SD的文生图的推理流程图:首先根据输入text用text encoder提取text embeddings,同时初始化一个随机噪音noise(latent上的,512x512图像对应的noise维度为64x64x4),然后将text embeddings和noise送入扩散模型UNet中生成去噪后的latent,最后送入autoencoder的decoder模块得到生成的图像。StableDiffusionPipeline
来实现文生图,具体代码如下所示:
import torch
from diffusers import StableDiffusionPipeline
from PIL import Image
# 组合图像,生成grid
def image_grid(imgs, rows, cols):
assert len(imgs) == rows*cols
w, h = imgs[0].size
grid = Image.new('RGB', size=(cols*w, rows*h))
grid_w, grid_h = grid.size
for i, img in enumerate(imgs):
grid.paste(img, box=(i%cols*w, i//cols*h))
return grid
# 加载文生图pipeline
pipe = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5", # 或者使用 SD v1.4: "CompVis/stable-diffusion-v1-4"
torch_dtype=torch.float16
).to("cuda")
# 输入text,这里text又称为prompt
prompts = [
"a photograph of an astronaut riding a horse",
"A cute otter in a rainbow whirlpool holding shells, watercolor",
"An avocado armchair",
"A white dog wearing sunglasses"
]
generator = torch.Generator("cuda").manual_seed(42) # 定义随机seed,保证可重复性
# 执行推理
images = pipe(
prompts,
height=512,
width=512,
num_inference_steps=50,
guidance_scale=7.5,
negative_prompt=None,
num_images_per_prompt=1,
generator=generator
).images
grid = image_grid(images, rows=1, cols=4)
grid
生成的图像效果如下所示:
另外一个参数是num_inference_steps
,它是指推理过程中的去噪步数或者采样步数。SD在训练过程采用的是步数为1000的noise scheduler,但是在推理时往往采用速度更快的scheduler:只需要少量的采样步数就能生成不错的图像,比如SD默认采用PNDM scheduler,它只需要采样50步就可以出图。当然我们也可以换用其它类型的scheduler,比如DDIM scheduler和DPM-Solver scheduler。我们可以在diffusers中直接替换scheduler,比如我们想使用DDIM:
from diffusers import DDIMScheduler
# 注意这里的clip_sample要关闭,否则生成图像存在问题,因为不能对latent进行clip
pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config, clip_sample=False)
换成DDIM后,同样的采样步数生成的图像如下所示,在部分细节上和PNDM有差异:guidance_scale
,前面说过当CFG的guidance_scale
越大时,生成的图像应该会和输入文本更一致,这里我们同样以宇航员骑马为例来测试不同guidance_scale下的图像生成效果。下图为guidance_scale为1,3,5,7,9和11下生成的图像对比,可以看到当guidance_scale较低时生成的图像效果是比较差的,当guidance_scale在7~9时,生成的图像效果是可以的,当采用更大的guidance_scale比如11,图像的色彩过饱和而看起来不自然,所以SD默认采用的guidance_scale为7.5。
另外一个比较容易忽略的参数是negative_prompt
,这个参数和CFG有关,前面说过,SD采用了CFG来提升生成图像的质量。使用CFG,去噪过程的噪音预测不仅仅依赖条件扩散模型,也依赖无条件扩散模型:这里的negative_prompt
便是无条件扩散模型的text输入,前面说过训练过程中我们将text置为空字符串来实现无条件扩散模型,所以这里:negative_prompt = None = ""
。但是有时候我们可以使用不为空的negative_prompt来避免模型生成的图像包含不想要的东西,因为从上述公式可以看到这里的无条件扩散模型是我们想远离的部分。下面我们来举几个具体的例子,首先来看生成人物图像的一个例子,这里的输入文本为"a portrait of a beautiful blonde woman",其生成的图像如下所示:
第二个例子是一个建筑物,这里的输入文本为"A Hyperrealistic photograph of German architectural modern home",默认图像生成效果如下所示:
SD默认生成512x512大小的图像,但实际上可以生成其它分辨率的图像,但是可能会出现不协调,如果采用多尺度策略训练,会改善这种情况; 采用快速的noise scheduler,SD在去噪步数为30~50步时就能生成稳定的图像; SD的guidance_scale设置为7~9是比较稳定的,过小和过大都会出现图像质量下降,实际使用中可以根据具体情况灵活调节; 可以使用negative prompt来去除不想要的东西来改善图像生成效果; 好的prompt对图像生成效果是至关重要的。
上边我们介绍了如何使用SD进行文生图以及一些主要参数,在最后我们也给出文生图这个pipeline的内部流程代码,如下所示:
import torch
from diffusers import AutoencoderKL, UNet2DConditionModel, DDIMScheduler
from transformers import CLIPTextModel, CLIPTokenizer
from tqdm.auto import tqdm
model_id = "runwayml/stable-diffusion-v1-5"
# 1. 加载autoencoder
vae = AutoencoderKL.from_pretrained(model_id, subfolder="vae")
# 2. 加载tokenizer和text encoder
tokenizer = CLIPTokenizer.from_pretrained(model_id, subfolder="tokenizer")
text_encoder = CLIPTextModel.from_pretrained(model_id, subfolder="text_encoder")
# 3. 加载扩散模型UNet
unet = UNet2DConditionModel.from_pretrained(model_id, subfolder="unet")
# 4. 定义noise scheduler
noise_scheduler = DDIMScheduler(
num_train_timesteps=1000,
beta_start=0.00085,
beta_end=0.012,
beta_schedule="scaled_linear",
clip_sample=False, # don't clip sample, the x0 in stable diffusion not in range [-1, 1]
set_alpha_to_one=False,
)
# 将模型复制到GPU上
device = "cuda"
vae.to(device, dtype=torch.float16)
text_encoder.to(device, dtype=torch.float16)
unet = unet.to(device, dtype=torch.float16)
# 定义参数
prompt = [
"A dragon fruit wearing karate belt in the snow",
"A small cactus wearing a straw hat and neon sunglasses in the Sahara desert",
"A photo of a raccoon wearing an astronaut helmet, looking out of the window at night",
"A cute otter in a rainbow whirlpool holding shells, watercolor"
]
height = 512
width = 512
num_inference_steps = 50
guidance_scale = 7.5
negative_prompt = ""
batch_size = len(prompt)
# 随机种子
generator = torch.Generator(device).manual_seed(2023)
with torch.no_grad():
# 获取text_embeddings
text_input = tokenizer(prompt, padding="max_length", max_length=tokenizer.model_max_length, truncation=True, return_tensors="pt")
text_embeddings = text_encoder(text_input.input_ids.to(device))[0]
# 获取unconditional text embeddings
max_length = text_input.input_ids.shape[-1]
uncond_input = tokenizer(
[negative_prompt] * batch_size, padding="max_length", max_length=max_length, return_tensors="pt"
)
uncond_embeddings = text_encoder(uncond_input.input_ids.to(device))[0]
# 拼接为batch,方便并行计算
text_embeddings = torch.cat([uncond_embeddings, text_embeddings])
# 生成latents的初始噪音
latents = torch.randn(
(batch_size, unet.in_channels, height // 8, width // 8),
generator=generator, device=device
)
latents = latents.to(device, dtype=torch.float16)
# 设置采样步数
noise_scheduler.set_timesteps(num_inference_steps, device=device)
# scale the initial noise by the standard deviation required by the scheduler
latents = latents * noise_scheduler.init_noise_sigma # for DDIM, init_noise_sigma = 1.0
timesteps_tensor = noise_scheduler.timesteps
# Do denoise steps
for t in tqdm(timesteps_tensor):
# 这里latens扩展2份,是为了同时计算unconditional prediction
latent_model_input = torch.cat([latents] * 2)
latent_model_input = noise_scheduler.scale_model_input(latent_model_input, t) # for DDIM, do nothing
# 使用UNet预测噪音
noise_pred = unet(latent_model_input, t, encoder_hidden_states=text_embeddings).sample
# 执行CFG
noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)
# 计算上一步的noisy latents:x_t -> x_t-1
latents = noise_scheduler.step(noise_pred, t, latents).prev_sample
# 注意要对latents进行scale
latents = 1 / 0.18215 * latents
# 使用vae解码得到图像
image = vae.decode(latents).sample
图生图
图生图(image2image)是对文生图功能的一个扩展,这个功能来源于SDEdit这个工作,其核心思路也非常简单:给定一个笔画的色块图像,可以先给它加一定的高斯噪音(执行扩散过程)得到噪音图像,然后基于扩散模型对这个噪音图像进行去噪,就可以生成新的图像,但是这个图像在结构和布局和输入图像基本一致。StableDiffusionImg2ImgPipeline
来实现文生图,具体代码如下所示:
import torch
from diffusers import StableDiffusionImg2ImgPipeline
from PIL import Image
# 加载图生图pipeline
model_id = "runwayml/stable-diffusion-v1-5"
pipe = StableDiffusionImg2ImgPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")
# 读取初始图片
init_image = Image.open("init_image.png").convert("RGB")
# 推理
prompt = "A fantasy landscape, trending on artstation"
generator = torch.Generator(device="cuda").manual_seed(2023)
image = pipe(
prompt=prompt,
image=init_image,
strength=0.8,
guidance_scale=7.5,
generator=generator
).images[0]
image
相比文生图的pipeline,图生图的pipeline还多了一个参数strength
,这个参数介于0-1之间,表示对输入图片加噪音的程度,这个值越大加的噪音越多,对原始图片的破坏也就越大,当strength=1时,其实就变成了一个随机噪音,此时就相当于纯粹的文生图pipeline了。下面展示了一个具体的实例,这里的第一张图为输入的初始图片,它是一个笔画的色块,我们可以通过图生图将它生成一幅具体的图像,其中第2张图和第3张图的strength分别是0.5和0.8,可以看到当strength=0.5时,生成的图像和原图比较一致,但是就比较简单了,当strength=0.8时,生成的图像偏离原图更多,但是图像的质感有一个明显的提升。
图生图这个功能一个更广泛的应用是在风格转换上,比如给定一张人像,想生成动漫风格的图像。这里我们可以使用动漫风格的开源模型anything-v4.0,它是基于SD v1.5在动漫风格数据集上finetune的,使用它可以更好地利用图生图将人物动漫化。下面的第1张为输入人物图像,采用的prompt为"masterpiece, best quality, 1girl, red hair, medium hair, green eyes",后面的图像是strength分别为0.3-0.9下生成的图像。可以看到在不同的strength下图像有不同的生成效果,其中strength=0.6时我觉得效果是最好的。
总结来看,图生图其实核心也是依赖了文生图的能力,其中strength这个参数需要灵活调节来得到满意的图像。在最后,我们也给出图生图pipeline的内部主要代码,如下所示:
import PIL
import numpy as np
import torch
from diffusers import AutoencoderKL, UNet2DConditionModel, DDIMScheduler
from transformers import CLIPTextModel, CLIPTokenizer
from tqdm.auto import tqdm
model_id = "runwayml/stable-diffusion-v1-5"
# 1. 加载autoencoder
vae = AutoencoderKL.from_pretrained(model_id, subfolder="vae")
# 2. 加载tokenizer和text encoder
tokenizer = CLIPTokenizer.from_pretrained(model_id, subfolder="tokenizer")
text_encoder = CLIPTextModel.from_pretrained(model_id, subfolder="text_encoder")
# 3. 加载扩散模型UNet
unet = UNet2DConditionModel.from_pretrained(model_id, subfolder="unet")
# 4. 定义noise scheduler
noise_scheduler = DDIMScheduler(
num_train_timesteps=1000,
beta_start=0.00085,
beta_end=0.012,
beta_schedule="scaled_linear",
clip_sample=False, # don't clip sample, the x0 in stable diffusion not in range [-1, 1]
set_alpha_to_one=False,
)
# 将模型复制到GPU上
device = "cuda"
vae.to(device, dtype=torch.float16)
text_encoder.to(device, dtype=torch.float16)
unet = unet.to(device, dtype=torch.float16)
# 预处理init_image
def preprocess(image):
w, h = image.size
w, h = map(lambda x: x - x % 32, (w, h)) # resize to integer multiple of 32
image = image.resize((w, h), resample=PIL.Image.LANCZOS)
image = np.array(image).astype(np.float32) / 255.0
image = image[None].transpose(0, 3, 1, 2)
image = torch.from_numpy(image)
return 2.0 * image - 1.0
# 参数设置
prompt = ["A fantasy landscape, trending on artstation"]
num_inference_steps = 50
guidance_scale = 7.5
strength = 0.8
batch_size = 1
negative_prompt = ""
generator = torch.Generator(device).manual_seed(2023)
init_image = PIL.Image.open("init_image.png").convert("RGB")
with torch.no_grad():
# 获取prompt的text_embeddings
text_input = tokenizer(prompt, padding="max_length", max_length=tokenizer.model_max_length, truncation=True, return_tensors="pt")
text_embeddings = text_encoder(text_input.input_ids.to(device))[0]
# 获取unconditional text embeddings
max_length = text_input.input_ids.shape[-1]
uncond_input = tokenizer(
[negative_prompt] * batch_size, padding="max_length", max_length=max_length, return_tensors="pt"
)
uncond_embeddings = text_encoder(uncond_input.input_ids.to(device))[0]
# 拼接batch
text_embeddings = torch.cat([uncond_embeddings, text_embeddings])
# 设置采样步数
noise_scheduler.set_timesteps(num_inference_steps, device=device)
# 根据strength计算timesteps
init_timestep = min(int(num_inference_steps * strength), num_inference_steps)
t_start = max(num_inference_steps - init_timestep, 0)
timesteps = noise_scheduler.timesteps[t_start:]
# 预处理init_image
init_input = preprocess(init_image)
init_latents = vae.encode(init_input.to(device, dtype=torch.float16)).latent_dist.sample(generator)
init_latents = 0.18215 * init_latents
# 给init_latents加噪音
noise = torch.randn(init_latents.shape, generator=generator, device=device, dtype=init_latents.dtype)
init_latents = noise_scheduler.add_noise(init_latents, noise, timesteps[:1])
latents = init_latents # 作为初始latents
# Do denoise steps
for t in tqdm(timesteps):
# 这里latens扩展2份,是为了同时计算unconditional prediction
latent_model_input = torch.cat([latents] * 2)
latent_model_input = noise_scheduler.scale_model_input(latent_model_input, t) # for DDIM, do nothing
# 预测噪音
noise_pred = unet(latent_model_input, t, encoder_hidden_states=text_embeddings).sample
# CFG
noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)
# 计算上一步的noisy latents:x_t -> x_t-1
latents = noise_scheduler.step(noise_pred, t, latents).prev_sample
# 注意要对latents进行scale
latents = 1 / 0.18215 * latents
# 解码
image = vae.decode(latents).sample
图像inpainting
最后我们要介绍的一项功能是图像inpainting,它和图生图一样也是文生图功能的一个扩展。SD的图像inpainting不是用在图像修复上,而是主要用在图像编辑上:给定一个输入图像和想要编辑的区域mask,我们想通过文生图来编辑mask区域的内容。SD的图像inpainting原理可以参考论文Blended Latent Diffusion,其主要原理图如下所示:StableDiffusionInpaintPipelineLegacy
可以实现文本引导下的图像inpainting,具体代码如下所示:
import torch
from diffusers import StableDiffusionInpaintPipelineLegacy
from PIL import Image
# 加载inpainting pipeline
model_id = "runwayml/stable-diffusion-v1-5"
pipe = StableDiffusionInpaintPipelineLegacy.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")
# 读取输入图像和输入mask
input_image = Image.open("overture-creations-5sI6fQgYIuo.png").resize((512, 512))
input_mask = Image.open("overture-creations-5sI6fQgYIuo_mask.png").resize((512, 512))
# 执行推理
prompt = ["a mecha robot sitting on a bench", "a cat sitting on a bench"]
generator = torch.Generator("cuda").manual_seed(0)
with torch.autocast("cuda"):
images = pipe(
prompt=prompt,
image=input_image,
mask_image=input_mask,
num_inference_steps=50,
strength=0.75,
guidance_scale=7.5,
num_images_per_prompt=1,
generator=generator,
).images
下面是一个具体的生成效果,这里我们将输入图像的dog换成了mecha robot或者cat,从而实现了图像编辑。StableDiffusionInpaintPipelineLegacy
这个pipeline内部的核心代码:
import PIL
import numpy as np
import torch
from diffusers import AutoencoderKL, UNet2DConditionModel, DDIMScheduler
from transformers import CLIPTextModel, CLIPTokenizer
from tqdm.auto import tqdm
def preprocess_mask(mask):
mask = mask.convert("L")
w, h = mask.size
w, h = map(lambda x: x - x % 32, (w, h)) # resize to integer multiple of 32
mask = mask.resize((w // 8, h // 8), resample=PIL.Image.NEAREST)
mask = np.array(mask).astype(np.float32) / 255.0
mask = np.tile(mask, (4, 1, 1))
mask = mask[None].transpose(0, 1, 2, 3) # what does this step do?
mask = 1 - mask # repaint white, keep black
mask = torch.from_numpy(mask)
return mask
def preprocess(image):
w, h = image.size
w, h = map(lambda x: x - x % 32, (w, h)) # resize to integer multiple of 32
image = image.resize((w, h), resample=PIL.Image.LANCZOS)
image = np.array(image).astype(np.float32) / 255.0
image = image[None].transpose(0, 3, 1, 2)
image = torch.from_numpy(image)
return 2.0 * image - 1.0
model_id = "runwayml/stable-diffusion-v1-5"
# 1. 加载autoencoder
vae = AutoencoderKL.from_pretrained(model_id, subfolder="vae")
# 2. 加载tokenizer和text encoder
tokenizer = CLIPTokenizer.from_pretrained(model_id, subfolder="tokenizer")
text_encoder = CLIPTextModel.from_pretrained(model_id, subfolder="text_encoder")
# 3. 加载扩散模型UNet
unet = UNet2DConditionModel.from_pretrained(model_id, subfolder="unet")
# 4. 定义noise scheduler
noise_scheduler = DDIMScheduler(
num_train_timesteps=1000,
beta_start=0.00085,
beta_end=0.012,
beta_schedule="scaled_linear",
clip_sample=False, # don't clip sample, the x0 in stable diffusion not in range [-1, 1]
set_alpha_to_one=False,
)
# 将模型复制到GPU上
device = "cuda"
vae.to(device, dtype=torch.float16)
text_encoder.to(device, dtype=torch.float16)
unet = unet.to(device, dtype=torch.float16)
prompt = "a mecha robot sitting on a bench"
strength = 0.75
guidance_scale = 7.5
batch_size = 1
num_inference_steps = 50
negative_prompt = ""
generator = torch.Generator(device).manual_seed(0)
with torch.no_grad():
# 获取prompt的text_embeddings
text_input = tokenizer(prompt, padding="max_length", max_length=tokenizer.model_max_length, truncation=True, return_tensors="pt")
text_embeddings = text_encoder(text_input.input_ids.to(device))[0]
# 获取unconditional text embeddings
max_length = text_input.input_ids.shape[-1]
uncond_input = tokenizer(
[negative_prompt] * batch_size, padding="max_length", max_length=max_length, return_tensors="pt"
)
uncond_embeddings = text_encoder(uncond_input.input_ids.to(device))[0]
# 拼接batch
text_embeddings = torch.cat([uncond_embeddings, text_embeddings])
# 设置采样步数
noise_scheduler.set_timesteps(num_inference_steps, device=device)
# 根据strength计算timesteps
init_timestep = min(int(num_inference_steps * strength), num_inference_steps)
t_start = max(num_inference_steps - init_timestep, 0)
timesteps = noise_scheduler.timesteps[t_start:]
# 预处理init_image
init_input = preprocess(input_image)
init_latents = vae.encode(init_input.to(device, dtype=torch.float16)).latent_dist.sample(generator)
init_latents = 0.18215 * init_latents
init_latents = torch.cat([init_latents] * batch_size, dim=0)
init_latents_orig = init_latents
# 处理mask
mask_image = preprocess_mask(input_mask)
mask_image = mask_image.to(device=device, dtype=init_latents.dtype)
mask = torch.cat([mask_image] * batch_size)
# 给init_latents加噪音
noise = torch.randn(init_latents.shape, generator=generator, device=device, dtype=init_latents.dtype)
init_latents = noise_scheduler.add_noise(init_latents, noise, timesteps[:1])
latents = init_latents # 作为初始latents
# Do denoise steps
for t in tqdm(timesteps):
# 这里latens扩展2份,是为了同时计算unconditional prediction
latent_model_input = torch.cat([latents] * 2)
latent_model_input = noise_scheduler.scale_model_input(latent_model_input, t) # for DDIM, do nothing
# 预测噪音
noise_pred = unet(latent_model_input, t, encoder_hidden_states=text_embeddings).sample
# CFG
noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)
# 计算上一步的noisy latents:x_t -> x_t-1
latents = noise_scheduler.step(noise_pred, t, latents).prev_sample
# 将unmask区域替换原始图像的nosiy latents
init_latents_proper = noise_scheduler.add_noise(init_latents_orig, noise, torch.tensor([t]))
latents = (init_latents_proper * mask) + (latents * (1 - mask))
# 注意要对latents进行scale
latents = 1 / 0.18215 * latents
image = vae.decode(latents).sample
另外,runwayml在发布SD 1.5版本的同时还发布了一个inpainting模型:runwayml/stable-diffusion-inpainting,与前面所讲不同的是,这是一个在SD 1.2上finetune的模型。原来SD的UNet的输入是64x64x4,为了实现inpainting,现在给UNet的第一个卷机层增加5个channels,分别为masked图像的latents(经过autoencoder编码,64x64x4)和mask图像(直接下采样8x,64x64x1),增加的权重填零初始化。在diffusers中,可以使用StableDiffusionInpaintPipeline
来调用这个模型,具体代码如下:
import torch
from diffusers import StableDiffusionInpaintPipeline
from PIL import Image
from tqdm.auto import tqdm
import PIL
# Load pipeline
model_id = "runwayml/stable-diffusion-inpainting/"
pipe = StableDiffusionInpaintPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")
prompt = ["a mecha robot sitting on a bench", "a dog sitting on a bench", "a bench"]
generator = torch.Generator("cuda").manual_seed(2023)
input_image = Image.open("overture-creations-5sI6fQgYIuo.png").resize((512, 512))
input_mask = Image.open("overture-creations-5sI6fQgYIuo_mask.png").resize((512, 512))
images = pipe(
prompt=prompt,
image=input_image,
mask_image=input_mask,
num_inference_steps=50,
generator=generator,
).images
其生成的效果图如下所示:
推荐阅读
辅助模块加速收敛,精度大幅提升!移动端实时的NanoDet-Plus来了!
机器学习算法工程师
一个用心的公众号