Xtuner微调llava-llama3-

简单介绍llava-llama3-8b模型及微调方式

wi162yyxq

1280人浏览 · 2024-08-20 17:57:28

wi162yyxq · 2024-08-20 17:57:28 发布

最近尝试训练了一下llama3的多模态版本，主要参考的是官方教程：Tutorial/xtuner/llava/xtuner_llava.md at camp2 · InternLM/Tutorial · GitHub

官方教程讲解的已经很详细了，我这里只是简单说一下流程，主要还是讲解一下这个多模态版本的模型结构和参考官方教程训练过程中遇到的问题。

1、模型相关知识

首先我们先了解一下这个多模态llama3版本的模型结构，这个模型结构很简单，遵循llava结构，通过在视觉模型与llm之间构建映射层(projector)，实现视觉特征到llm对齐和输入。目前多模态模型虽然有很多类型，如Qformer、llava之类的，但是采用llava结构的模型越来越多，我个人认为可能是因为通过诸如Qformer的方式会对视觉特征进行压缩，一方面这种压缩会有特征损失，另一方面这种压缩生成的Query并不通用。

上图就是llava的结构图，通过上面的结构可以看出，整个模型分为三个部分：Vision Encoder、Projection和LLM。利用Vison Encoder将图片转换为[N=1, grid_H x grid_W, hidden_dim]的feature map，然后接一个插值层Projection W，将图像特征和文本特征进行维度对齐。经过Projection后，得到[N=1, grid_H x grid_W=image_seqlen, emb_dim]。然后将 image token embedding和text token embedding合并到一起，作为语言模型的输入，生成描述的文本。

我们可以根据这个结构看一下llava-llama3-8b这个模型的代码，这里以chat.py这个文件代码举例：

1、Vision Encoder：

文件中通过以下代码加载visual_encoder模型，这里使用的是clip-vit-large-patch14-336版本。

# build visual_encoder
if 'visual_encoder' in os.listdir(llava_path):
    assert args.visual_encoder is None, (
        "Please don't specify the `--visual-encoder` since passed "
        '`--llava` contains a visual encoder!')
    visual_encoder_path = osp.join(llava_path, 'visual_encoder')
else:
    assert args.visual_encoder is not None, (
        'Please specify the `--visual-encoder`!')
    visual_encoder_path = args.visual_encoder
visual_encoder = CLIPVisionModel.from_pretrained(
    visual_encoder_path,
    torch_dtype=TORCH_DTYPE_MAP[args.torch_dtype])
image_processor = CLIPImageProcessor.from_pretrained(
    visual_encoder_path)
print(f'Load visual_encoder from {visual_encoder_path}')

2、LLM模型加载：

直接通过AutoModelForCausalLM加载模型文件。

# build llm
llm = AutoModelForCausalLM.from_pretrained(args.model_name_or_path,
                                            **model_kwargs)
tokenizer = AutoTokenizer.from_pretrained(
    args.model_name_or_path,
    trust_remote_code=True,
    encode_special_tokens=True)
print(f'Load LLM from {args.model_name_or_path}')

3、projector映射层加载：

# build projector
projector_path = osp.join(llava_path, 'projector')
projector = AutoModel.from_pretrained(
    projector_path,
    torch_dtype=TORCH_DTYPE_MAP[args.torch_dtype],
    trust_remote_code=True)
print(f'Load projector from {args.llava}')

我们可以看一下projector层的代码（可以从你下载的模型文件夹中找到），其实就是很简单的两层线性层。

def __init__(self, config: ProjectorConfig) -> None:
    super().__init__(config)
    self.gradient_checkpointing = False

    modules = [
        nn.Linear(
            config.visual_hidden_size,
            config.llm_hidden_size,
            bias=config.bias)
    ]
    for _ in range(1, config.depth):
        modules.append(ACT2FN[config.hidden_act])
        modules.append(
            nn.Linear(
                config.llm_hidden_size,
                config.llm_hidden_size,
                bias=config.bias))
    self.model = nn.Sequential(*modules)

这样，我们就了解整个llava-llama3的结构了，无非就是CLIP+LLM(LLaMA)结构。

2、训练方式：

上图是完整的训练过程，包含预训练和微调，我们这里主要说明如何进行模型微调。

1、安装xtuner：

git clone -b v0.1.17  https://github.com/InternLM/xtuner
cd /root/xtuner0117/xtuner
pip install -e '.[all]' && cd ~

2、下载模型文件：

模型文件下载推荐使用国内的抱抱脸镜像：

a. LLama3镜像 https://hf-mirror.com/xtuner/llava-llama-3-8b

b. Clip镜像 openai/clip-vit-large-patch14-336 · HF Mirror

下载完成后可以运行感受一下，这里有一点就是运行时需要加入image参数，后续问答均以这个image为参考进行，除非重新改image再跑：

xtuner chat ./llava-llama-3-8b \
--visual-encoder ./clip-vit-large-patch14-336 \
--llava ./llava-llama-3-8b \
--prompt-template llama3_chat \
--image $IMAGE_PATH

我修改了一下，让他每次都可以输入不同的图片去进行问答。回答还是不错的。

3、准备数据集：

数据集格式如下：

[
    {
        "id": "随便什么字符串",
        "image": "图片文件的相对位置。相对谁？相对你后面config文件里指定的image_folder参数的路径。",
        "conversation": [
            {
                "from": "human",
                "value": "<image>\n第1个问题。"
            },
            {
                "from": "gpt",
                "value": "第1个回答"
            },
            {
                "from": "human",
                "value": "第2个问题。"
            },
            {
                "from": "gpt",
                "value": "第2个回答"
            },
            # ......
            {
                "from": "human",
                "value": "第n个问题。"
            },
            {
                "from": "gpt",
                "value": "第n个回答"
            },
        ]
    },

    # 下面是第2组训练数据了。

    {
        "id": "随便什么字符串",
        "image": "图片文件的相对位置。相对谁？相对你后面config文件里指定的image_folder参数的路径。",
        "conversation": [
            {
                "from": "human",
                "value": "<image>\n第1个问题。"
            },
            # ......
            {
                "from": "gpt",
                "value": "第n个回答"
            }
        ]
    }
]

例如：

[
  {
    "id": "000000033471",
    "image": "coco/train2017/000000033471.jpg",
    "conversations": [
      {
        "from": "human",
        "value": "<image>\nWhat are the colors of the bus in the image?"
      },
      {
        "from": "gpt",
        "value": "The bus in the image is white and red."
      },
      {
        "from": "human",
        "value": "What feature can be seen on the back of the bus?"
      },
      {
        "from": "gpt",
        "value": "The back of the bus features an advertisement."
      },
      {
        "from": "human",
        "value": "Is the bus driving down the street or pulled off to the side?"
      },
      {
        "from": "gpt",
        "value": "The bus is driving down the street, which is crowded with people and other vehicles."
      }
    ]
  },
  {
    "id": "000000052846",
    "image": "coco/train2017/000000052846.jpg",
    "conversations": [
      {
        "from": "human",
        "value": "<image>\nWhere is the cat positioned in the image?"
      },
      {
        "from": "gpt",
        "value": "The cat is positioned on top of the back of the couch in the living room."
      }
    ]
  }
 ]

这里的image后面的参数以你这个文件的根目录为基础去写。

4、修改训练的配置文件：

修改xtuner\configs\llava\llama3_8b_instruct_clip_vit_large_p14_336\finetune\llava_llama3_8b_instruct_qlora_clip_vit_large_p14_336_e1_gpu1_finetune.py 文件

# Model
- llm_name_or_path = 'internlm/internlm2-chat-1_8b'
+ llm_name_or_path = '/root/share/new_models/Shanghai_AI_Laboratory/internlm2-chat-1_8b'
- visual_encoder_name_or_path = 'openai/clip-vit-large-patch14-336'
+ visual_encoder_name_or_path = '/root/share/new_models/openai/clip-vit-large-patch14-336'

# Specify the pretrained pth
- pretrained_pth = './work_dirs/llava_internlm2_chat_1_8b_clip_vit_large_p14_336_e1_gpu8_pretrain/iter_2181.pth'  # noqa: E501
+ pretrained_pth = '/root/share/new_models/xtuner/iter_2181.pth'

# Data
- data_root = './data/llava_data/'
+ data_root = '/root/tutorial/xtuner/llava/llava_data/'
- data_path = data_root + 'LLaVA-Instruct-150K/llava_v1_5_mix665k.json'
+ data_path = data_root + 'repeated_data.json'
- image_folder = data_root + 'llava_images'
+ image_folder = data_root

# Scheduler & Optimizer
- batch_size = 16  # per_device
+ batch_size = 1  # per_device


# evaluation_inputs
- evaluation_inputs = ['请描述一下这张图片','Please describe this picture']
+ evaluation_inputs = ['Please describe this picture','What is the equipment in the image?']

5、微调模型：

按照官方教程需要下载一个projector的预训练模型用于微调，地址：Tutorial/xtuner/llava/iter_2181.pth at camp2 · InternLM/Tutorial · GitHub

然后运行微调指令：

xtuner train /root/tutorial/xtuner/llava/llava_internlm2_chat_1_8b_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune_copy.py --deepspeed deepspeed_zero2

用这个指令后会尴尬的发现一个问题，就是在加载projector的过程会报错，显示维度不匹配，2181的维度是2048，但是Clip的维度是4096，导致无法进行训练。

但是我们其实之前已经下载了llama3-8b的模型，里面是有projector的，但是我们把配置文件改成之前下载的model.safetensor又会发生无法加载的错误，为了解决这个问题，我把llava.py文件修改了，重新修改了projector的读取方式，使其支持safetensor格式。

位置在xtuner\xtuner\model\uils.py 修改guess_load_checkpoint函数

def guess_load_checkpoint(pth_model):
    if osp.isfile(pth_model):
        file_name, file_extension = os.path.splitext(pth_model)
        if (file_extension =='.safetensors'):
            state_dict = {}
            with safetensors.safe_open(pth_model, framework="pt", device=0) as f:
                for k in f.keys():
                    state_dict[k] = f.get_tensor(k)
        else:
            state_dict = torch.load(pth_model, map_location='cpu')
            if 'state_dict' in state_dict:
                state_dict = state_dict['state_dict']
    elif osp.isdir(pth_model):
        try:
            from xtuner.utils.zero_to_any_dtype import \
                get_state_dict_from_zero_checkpoint
        except ImportError:
            raise ImportError(
                'The provided PTH model appears to be a DeepSpeed checkpoint. '
                'However, DeepSpeed library is not detected in current '
                'environment. This suggests that DeepSpeed may not be '
                'installed or is incorrectly configured. Please verify your '
                'setup.')
        state_dict = get_state_dict_from_zero_checkpoint(
            osp.dirname(pth_model), osp.basename(pth_model))
    else:
        raise FileNotFoundError(f'Cannot find {pth_model}')
    return state_dict

增加了一个safetensor读取的分支，通过这个方式就可以进行训练了。

当然，修改完之后需要重新回到根目录，运行安装指令重新安装xtuner