Datawhale X 魔搭 AI夏令营第四期（AIGC学习笔记）

扩散过程是一个给图片逐渐添加噪声直至完全淹没的过程，在这个过程中，训练一个网络预测噪声。如果噪声预测得非常准确，那么从一个含有噪声的图片中减去预测的噪声，就能恢复原图。扩散模型（diffusion probabilistic model）本质上是一种马尔可夫链（Markov chain），使用变分推断（variational inference）进行训练。下一状态的概率分布只能由当前状态决定，即给

zhouquan.liu

1171人浏览 · 2024-08-11 15:49:54

zhouquan.liu · 2024-08-11 15:49:54 发布

Datawhale X 魔搭 AI夏令营第四期（AIGC学习笔记）

00. Diffusion 扩散模型
01. Task2执行流程示意
- Step1. 使用通义千问生成文生图提示词
- Step2. 替换提示词，运行baseline
02. Task3执行流程示意
- Step1. ComfyUI安装
- Step2. 使用ComfyUI工具生图

00. Diffusion 扩散模型

本章是对Diffusion 扩散模型的学习内容总结。

去噪扩散模型（Denoising diffusion probabilistic models，DDPM）

去噪扩散模型的学习笔记主要参考了B站梗直哥丶关于扩散模型的解读视频¹。

扩散模型（diffusion probabilistic model）本质上是一种马尔可夫链（Markov chain），使用变分推断（variational inference）进行训练。

扩散模型的目的就是要学习一个转移分布 $p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_{t})$ ，实现对扩散过程的逆转。扩散过程是一个给图片逐渐添加噪声直至完全淹没的过程，在这个过程中，训练一个网络预测噪声。如果噪声预测得非常准确，那么从一个含有噪声的图片中减去预测的噪声，就能恢复原图。在DDPM论文²中，作了一个强假设：噪声与条件转移分布都假定为高斯分布，因此只需要学习均值和方差。

扩散模型是一种潜在变量模型（latent variable model），其中 $\mathbf{x}_1,...,\mathbf{x}_T$ 是与 $\mathbf{x}_0$ 具有相同维度的潜在变量。

在这里插入图片描述

正向扩散过程

上图中从右往左的过程即扩散过程。初始数据 $\mathbf{x}_0$ 符合分布 $q(\mathbf{x}_0)$ ，即训练集分布。然后往里面不断添加高斯噪声。这个高斯噪声的均值和方差是固定的，只有方差系数 $\beta_t$ 来控制噪声的强度。由以下公式可以从 $\mathbf{x}_0$ 得到 $\mathbf{x}_t$ ：
$\begin{aligned} \mathbf{x}_{t}& =\sqrt{\alpha_t}\mathbf{x}_{t-1}+\sqrt{1-\alpha_t}\boldsymbol{\epsilon}_{t-1}\qquad ;where\ \alpha_t=1-\beta_t \\ &=\sqrt{\alpha_t\alpha_{t-1}}\mathbf{x}_{t-2}+\sqrt{1-\alpha_t\alpha_{t-1}}\bar{\boldsymbol{\epsilon}}_{t-2} \\ &=\ldots \\ &=\sqrt{\bar{\alpha}_t}\mathbf{x}_0+\sqrt{1-\bar{\alpha}_t}\boldsymbol{\epsilon} \\ q(\mathbf{x}_t|\mathbf{x}_0)& =\mathcal{N}(\mathbf{x}_t;\sqrt{\bar{\alpha}_t}\mathbf{x}_0,(1-\bar{\alpha}_t)\mathbf{I}) \end{aligned}$

逆向去噪过程

逆向去噪过程即上图从左向右的过程。在逆向去噪过程中，用神经网络学习转移分布 $p_\theta$ 。其中。网络的输入是 $\mathbf{x}_t$ 和t。在DDPM中，使用正向扩散过程的后验分布 $q(\mathbf{x}_{t-1}|\mathbf{x}_t,\mathbf{x}_0)=\mathcal{N}(\mathbf{x}_{t-1};\tilde{\boldsymbol{\mu}}(\mathbf{x}_t,\mathbf{x}_0),\tilde{\beta}_t\mathbf{I})$ 来逼近逆向过程的转移分布 $p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t)$
$p_\theta(\mathbf{x}_{0:T})=p(\mathbf{x}_T)\prod_{t=1}^Tp_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t)\quad p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t)=\mathcal{N}(\mathbf{x}_{t-1};\boldsymbol{\mu}_\theta(\mathbf{x}_t,t),\boldsymbol{\Sigma}_\theta(\mathbf{x}_t,t))$

训练和推理过程

训练和推理过程截取了MIT 6.5940课程的ppt³进行演示。

训练过程：

在这里插入图片描述

推理过程：

在这里插入图片描述

名词解释

马尔可夫链（Markov chain）：马尔可夫链是满足马尔可夫性质的随机变量序列 $X_1, X_2, ...$ 。下一状态的概率分布只能由当前状态决定，即给出当前状态，将来状态与过去状态时相互独立的。

条件控制扩散模型（Conditional diffusion models）

条件控制扩散模型在模型中引入了额外的信息，来指导图像的生成。引入额外信息 $y$ 后，对前向扩散过程没有任何影响，而是对逆向去噪（采样）过程产生影响。推导过程可见⁴。

在这里插入图片描述

潜在扩散模型（Latent diffusion models）

在去噪扩散模型中， $\mathbf{x}_t$ 的元素与图片像素一一对应。要想生成高分辨率的图像，就需要非常大的 $\mathbf{x}_t$ 。针对这一问题，High-Resolution Image Synthesis with Latent Diffusion Models⁵引入了一个自编码器（如变分自编码器VAE），先对原始对象进行压缩编码，编码后的向量再应用到扩散模型。

在这里插入图片描述

Stable Diffusion

Stable Diffusion是一种潜在的文生图扩散模型。
Stable Diffusion v1 是一种特定配置，该配置使用下采样因子为 8 的自动编码器和 860M UNet 以及 CLIP ViT-L/14 文本编码器作为扩散模型。

huggingface的diffusers库集成了stable diffusion：

from diffusers import StableDiffusionPipeline
import torch                                                                                                                                                                                                 
                                                                                                                                                                                                                                              
pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5").to("cuda")  # can be replaced with local path
prompt = "portrait photo of a old warrior chief"
generator = torch.Generator("cuda").manual_seed(0)
image = pipe(prompt, generator=generator).images[0]                                                                                                                                                                                           
image

通过参考⁶与Huggingface源码diffusers.StableDiffusionPipeline熟悉Stable Diffusion（SD）的推理实现：

在初始化代码中，可以了解到SD的几个关键组件：
vae：变分自编码器，负责将数据压缩到潜在空间（latent space）与潜在数据的解压缩。
text_encoder：文本编码器，默认使用CLIP模型中的text-encoder，将控制图像生成的文本进行编码。
unet：噪声预测模型，在潜在空间执行。
scheduler: 负责去噪过程的计算。

def __init__(
        self,
        vae: AutoencoderKL,
        text_encoder: CLIPTextModel,
        tokenizer: CLIPTokenizer,
        unet: UNet2DConditionModel,
        scheduler: KarrasDiffusionSchedulers,
        safety_checker: StableDiffusionSafetyChecker,
        feature_extractor: CLIPImageProcessor,
        image_encoder: CLIPVisionModelWithProjection = None,
        requires_safety_checker: bool = True,
    ):
    ...

核心逻辑实现在__call__方法中

def __call__(
  ...
)
...
  # 省略第0~2步的潜在空间尺寸赋值，输入检查，batch_size, device等参数定义。
  
  # 3. 输入文本编码，这里包括了对负提示词的处理
  prompt_embeds, negative_prompt_embeds = self.encode_prompt(
              prompt,
              device,
              num_images_per_prompt,
              self.do_classifier_free_guidance,
              negative_prompt,
              prompt_embeds=prompt_embeds,
              negative_prompt_embeds=negative_prompt_embeds,
              lora_scale=lora_scale,
              clip_skip=self.clip_skip,
          )

  # 4. Prepare timesteps
  timesteps, num_inference_steps = retrieve_timesteps(
      self.scheduler, num_inference_steps, device, timesteps, sigmas
  )

  # 5. Prepare latent variables
  num_channels_latents = self.unet.config.in_channels
  latents = self.prepare_latents(
      batch_size * num_images_per_prompt,
      num_channels_latents,
      height,
      width,
      prompt_embeds.dtype,
      device,
      generator,
      latents,
  )

  # 6. Prepare extra step kwargs. TODO: Logic should ideally just be moved out of the pipeline
  extra_step_kwargs = self.prepare_extra_step_kwargs(generator, eta)

  # 6.1 Add image embeds for IP-Adapter
  added_cond_kwargs = (
      {"image_embeds": image_embeds}
      if (ip_adapter_image is not None or ip_adapter_image_embeds is not None)
      else None
  )

  # 6.2 Optionally get Guidance Scale Embedding
  timestep_cond = None
  if self.unet.config.time_cond_proj_dim is not None:
      guidance_scale_tensor = torch.tensor(self.guidance_scale - 1).repeat(batch_size * num_images_per_prompt)
      timestep_cond = self.get_guidance_scale_embedding(
          guidance_scale_tensor, embedding_dim=self.unet.config.time_cond_proj_dim
      ).to(device=device, dtype=latents.dtype)

  # 7. Denoising loop
  num_warmup_steps = len(timesteps) - num_inference_steps * self.scheduler.order
  self._num_timesteps = len(timesteps)
  with self.progress_bar(total=num_inference_steps) as progress_bar:
      for i, t in enumerate(timesteps):
          if self.interrupt:
              continue

          # expand the latents if we are doing classifier free guidance
          latent_model_input = torch.cat([latents] * 2) if self.do_classifier_free_guidance else latents
          latent_model_input = self.scheduler.scale_model_input(latent_model_input, t)

          # predict the noise residual 在潜在空间预测噪声
          noise_pred = self.unet(
              latent_model_input,
              t,
              encoder_hidden_states=prompt_embeds,
              timestep_cond=timestep_cond,
              cross_attention_kwargs=self.cross_attention_kwargs,
              added_cond_kwargs=added_cond_kwargs,
              return_dict=False,
          )[0]

          # perform guidance
          if self.do_classifier_free_guidance:
              noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
              noise_pred = noise_pred_uncond + self.guidance_scale * (noise_pred_text - noise_pred_uncond)

          if self.do_classifier_free_guidance and self.guidance_rescale > 0.0:
              # Based on 3.4. in https://arxiv.org/pdf/2305.08891.pdf
              noise_pred = rescale_noise_cfg(noise_pred, noise_pred_text, guidance_rescale=self.guidance_rescale)

          # compute the previous noisy sample x_t -> x_t-1
          latents = self.scheduler.step(noise_pred, t, latents, **extra_step_kwargs, return_dict=False)[0]

          if callback_on_step_end is not None:
              callback_kwargs = {}
              for k in callback_on_step_end_tensor_inputs:
                  callback_kwargs[k] = locals()[k]
              callback_outputs = callback_on_step_end(self, i, t, callback_kwargs)

              latents = callback_outputs.pop("latents", latents)
              prompt_embeds = callback_outputs.pop("prompt_embeds", prompt_embeds)
              negative_prompt_embeds = callback_outputs.pop("negative_prompt_embeds", negative_prompt_embeds)

          # call the callback, if provided
          if i == len(timesteps) - 1 or ((i + 1) > num_warmup_steps and (i + 1) % self.scheduler.order == 0):
              progress_bar.update()
              if callback is not None and i % callback_steps == 0:
                  step_idx = i // getattr(self.scheduler, "order", 1)
                  callback(step_idx, t, latents)

  # 将潜在数据解码为原尺寸的图像
  if not output_type == "latent":
      image = self.vae.decode(latents / self.vae.config.scaling_factor, return_dict=False, generator=generator)[
          0
      ]
      image, has_nsfw_concept = self.run_safety_checker(image, device, prompt_embeds.dtype)
  else:
      image = latents
      has_nsfw_concept = None

  if has_nsfw_concept is None:
      do_denormalize = [True] * image.shape[0]
  else:
      do_denormalize = [not has_nsfw for has_nsfw in has_nsfw_concept]

  image = self.image_processor.postprocess(image, output_type=output_type, do_denormalize=do_denormalize)

  # Offload all models
  self.maybe_free_model_hooks()

  if not return_dict:
      return (image, has_nsfw_concept)

  return StableDiffusionPipelineOutput(images=image, nsfw_content_detected=has_nsfw_concept)

01. Task2执行流程示意

Step1. 使用通义千问生成文生图提示词

故事文案采用了了链接⁷对《小王子》故事的概述。

Step2. 替换提示词，运行baseline

图1：色域绘画，飞行员穿着飞行服，上半身特写，表情忧虑，身旁是严重受损的飞机残骸，撒哈拉沙漠背景，金色头发的小王子站在旁边，认真地看着飞行员手中的画纸，请求飞行员画一只羊。
图2：色域绘画，小王子站在B612小行星上，全身像，身后是一朵孤独绽放的玫瑰花，周围是广袤星空，小王子正低头沉思，似乎在考虑是否离开。
图3：色域绘画，飞行员睡着后的梦境，飞行员的上半身特写，闭着眼睛，面带微笑，背景是星空与沙漠，仿佛正站在一条通往未知的道路旁。
图4：色域绘画，小王子游历各个星球，全身像，背景是不同星球的地平线，包括一个国王、一个虚荣的人、一个酒鬼、一个商人、一个点灯人和一个地理学家的形象，小王子看起来有些困惑和失望。
图5：色域绘画，小王子站在地球的沙漠中，上半身特写，表情惊讶，旁边是一条神秘的蛇，前方是一片盛开的玫瑰花园，小王子显得十分惊讶和沮丧。
图6：色域绘画，小王子与一只狐狸成为朋友，全身像，小王子蹲下与狐狸对话，背景是一片森林，小王子正领悟到“眼睛看不见的事物才是最重要的”这一真理。
图7：色域绘画，小王子和飞行员找到一口井，全身像，两人并肩站立在井边，背景是沙漠与星空，小王子准备离开，而飞行员则显得悲伤。
图8：色域绘画，飞行员坐在沙漠中，上半身特写，手中拿着一张画有羊的纸，望着星空微笑，背景是星空下的沙漠，飞行员在思念小王子，希望他能回来。

在这里插入图片描述

02. Task3执行流程示意

Task3主要是对ComfyUI工具的熟悉。
ComfyUI 是一个功能强大、高度模块化的 Stable Diffusion 图形用户界面和后端系统，提供了一个可视化的文生图流程。⁸

Step1. ComfyUI安装

运行安装脚本，打开链接：

Step2. 使用ComfyUI工具生图

加载工作流配置文件，运行：

https://www.bilibili.com/video/BV1BP411S7Mg/?spm_id_from=333.999.0.0&vd_source=b7b0278844e86ae043f7069b8064a457 ↩︎
https://arxiv.org/abs/2006.11239 ↩︎
https://www.dropbox.com/scl/fi/q8y9ap7mlucmiimyh3zl5/lec16.pdf?rlkey=6wx4z3pnhic8pq0oju8ro3qzr&e=1&dl=0 ↩︎
https://www.zhangzhenhu.com/aigc/Guidance.html ↩︎
https://arxiv.org/abs/2112.10752 ↩︎
https://www.zhangzhenhu.com/aigc/%E7%A8%B3%E5%AE%9A%E6%89%A9%E6%95%A3%E6%A8%A1%E5%9E%8B.html ↩︎
https://www.bilibili.com/read/cv9560704/ ↩︎
https://comfyuidoc.com/zh/ ↩︎