DPO

2024-05-08

简介

DPO全称Direct Preference Optimization，它是RLHF算法的一种，相比PPO算法来讲，它只需要actor和ref model,少了critic和reward model。其核心期望为good loss - bad loss越来越大，这点和排序模型中的rank loss很相似，但是又不希望和ref model偏差太多。

最小实现代码


from copy import deepcopy

import torch
import torch.nn.functional as F
from transformers import AutoModelForCausalLM, LlamaForCausalLM, LlamaConfig

torch.manual_seed(0)
# 超参数
beta = 0.1
# 加载模型


# data
prompt_ids = [1, 2, 3, 4, 5, 6]
good_response_ids = [7, 8, 9, 10]
# 对loss稍加修改可以应对一个good和多个bad的情况
bad_response_ids_list = [[1, 2, 3, 0], [4, 5, 6, 0]]

# 转换成模型输入
input_ids = torch.LongTensor(
    [prompt_ids + good_response_ids, *[prompt_ids + bad_response_ids for bad_response_ids in bad_response_ids_list]]
)
# labels 提前做个shift
labels = torch.LongTensor(
    [
        [-100] * len(prompt_ids) + good_response_ids,
        *[[-100] * len(prompt_ids) + bad_response_ids for bad_response_ids in bad_response_ids_list]
    ]
)[:, 1:]
loss_mask = (labels != -100)
labels[labels == -100] = 0


policy_model = LlamaForCausalLM(config=LlamaConfig(vocab_size=1000, num_hidden_layers=1, hidden_size=128))
reference_model = deepcopy(policy_model)


# 计算 policy model的log prob
logits = policy_model(input_ids)["logits"][:, :-1, :]
per_token_logps = torch.gather(logits.log_softmax(-1), dim=2, index=labels.unsqueeze(2)).squeeze(2)
all_logps = (per_token_logps * loss_mask).sum(-1)
# 暂时写死第一个是good response的概率
policy_good_logps, policy_bad_logps = all_logps[:1], all_logps[1:]

# 计算 reference model的log prob
with torch.no_grad():
    logits = reference_model(input_ids)["logits"][:, :-1, :]
    per_token_logps = torch.gather(logits.log_softmax(-1), dim=2, index=labels.unsqueeze(2)).squeeze(2)
    all_logps = (per_token_logps * loss_mask).sum(-1)
    # 暂时写死第一个是good response的概率
    reference_good_logps, reference_bad_logps = all_logps[:1], all_logps[1:]

# 计算loss，会自动进行广播
logits = (policy_good_logps - reference_good_logps) - (policy_bad_logps - reference_bad_logps)
loss = -F.logsigmoid(beta * logits).mean()
print(loss)

参考资料

展开全文 >>

RLHF-Actor-Critic

2024-05-07

简介

值函数希望学习一个价值函数，这个值可以用于评估当前决策的分值。策略函数希望学习一个策略函数，拿到其动作的概率分布。

Actor-Critic是在策略函数的基础上，额外引入学习价值函数，来帮助策略函数更好地学习。

下面这个图很好表示了两者关系。

重点看actor-critic算法中update函数log_probs部分。actor采用策略，critic来进行评价。

展开全文 >>

RLHF-policy_gradient

2024-04-26

前言

这个是RLHF系列中的策略梯度部分，在看了Hands-on-RL和parl两者实现后，感觉整体难度并不是很高，但是当自己从零实现时还是会莫名其妙多一些问题，相比深度学习来讲，还是有蛮多小细节是需要额外注意的。

注意点

1. log平滑

这里是指learn阶段中的获取最大期望阶段，如下代码所示：

1 2	output = self.model(obs_bs) output = torch.log(output.gather(-1, action_bs.reshape(-1, 1)))

在最开始自己实现时，我没有加log进行平滑，发现模型没法收敛(CarPole-v0 reward最大得分为200)，一直是8,9徘徊。后来我看了上述实现，发现这里多了个log,这里让我觉得很困惑，因为我觉得这一步是不必要的，原因有以下几个方面：

model.forward部分，已经用softmax做归一化了，已经避免了差异较大的情况。
从某种角度来讲，我甚至觉得model.forward中的softmax部分也不应该添加。因为本质来讲就是希望期望最大嘛。

但是呢，如果不加log这一步，模型就无法收敛。

2. 折扣因子

这里是指在每一次done，产生了一批state、reward、action之后，在进行计算loss时，下一步的reward还要考虑当前步reward的结果，即下一步的reward一定要小于当前步的reward。也就是calc_reward_to_go函数这里。

这里同样也会觉得很困惑，因为如果希望期望最大，那就reward * prob使其概率最大即可。

如果看Hands-on-RL他的实现，不会感到任何困惑，因为他是用for来做的。但是呢，parl的实现在考虑了reward * prob使其概率最大这一步之后，又添加了calc_reward_to_go函数，从而本来是个离散的东西，强生生的给变成了一个连续状态的事情，关键呢，怎么看都不像是连续的，因为当前步在计算loss时也并没有跟上一步扯上直接的关系。

这一步还好，从理解角度来讲，我会更倾向Hands-on-RL的实现，容易理解。

总结

关于第一点，我的感觉是步子不能迈太大，宁可慢慢收敛，也要比无法收敛更强，例如model部分我尝试改成如下：

class PolicyNet(torch.nn.Module):
    def __init__(self, state_dim, hidden_dim, action_dim):
        super(PolicyNet, self).__init__()
        self.state_dim = state_dim
        self.action_dim = action_dim
        self.fc1 = torch.nn.Linear(state_dim, hidden_dim)
        self.fc2 = torch.nn.Linear(hidden_dim, action_dim)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        # return F.softmax(self.fc2(x), dim=-1)
        return self.fc2(x)
    @torch.no_grad()
    def sample(self, obs):
        result = self(torch.FloatTensor(obs))
        result = F.softmax(result, dim=-1)
        return torch.multinomial(result, 1).item()

即forward部分不用softmax,剩下代码保持不变，也同样处于无法收敛状态，似乎来看,softmax+log才是这个算法成功的关键。

这里让我想到一个事情，在深度学习Layer参数初始化的过程，比如：

1	a = nn.Linear(2222, 11)

我们会是这么写，基本不会关心a这个linear的参数是如何初始化的。是因为框架内部已经考虑了kaiming、xavier、uniform等各种初始化技术。如果不加这些参数初始化技术，模型基本也很难收敛。

所以如果应用的话，可以采用现成的实现，如果研究的话，其中一些细节可以慢慢调整。

源码

import random

import gym
import torch
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm
import rl_utils
from parl.env import CompatWrapper


class PolicyNet(torch.nn.Module):
    def __init__(self, state_dim, hidden_dim, action_dim):
        super(PolicyNet, self).__init__()
        self.state_dim = state_dim
        self.action_dim = action_dim
        self.fc1 = torch.nn.Linear(state_dim, hidden_dim)
        self.fc2 = torch.nn.Linear(hidden_dim, action_dim)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        return F.softmax(self.fc2(x), dim=-1)

    @torch.no_grad()
    def sample(self, obs):
        result = self(torch.FloatTensor(obs))
        return torch.multinomial(result, 1).item()

def calc_reward_to_go(reward_list, gamma=1.0):
    for i in range(len(reward_list) - 2, -1, -1):
        # G_i = r_i + γ·G_i+1
        reward_list[i] += gamma * reward_list[i + 1]  # Gt
    return np.array(reward_list)

class Agent(object):
    def __init__(self, state_dim, action_dim, hidden_dim):
        self.model = PolicyNet(state_dim=state_dim, hidden_dim=hidden_dim, action_dim=action_dim)
        self.optim = torch.optim.Adam(self.model.parameters(), lr=1e-3)

    def train(self, env):
        total_reward = 0
        obs_bs, action_bs, next_obs_bs, reward_bs, done_bs = [], [], [], [], []
        obs = env.reset()
        while True:
            action = self.model.sample(obs=obs)
            next_obs, reward, done, info = env.step(action=action)
            if done:
                break
            obs_bs.append(obs)
            action_bs.append(action)
            next_obs_bs.append(next_obs)
            reward_bs.append(reward)
            done_bs.append(done)

            total_reward += reward
            obs = next_obs
        if not obs_bs: return total_reward
        # learn
        reward_bs = calc_reward_to_go(reward_bs)

        obs_bs = torch.FloatTensor(np.array(obs_bs))
        action_bs = torch.LongTensor(np.array(action_bs))
        reward_bs = torch.FloatTensor(np.array(reward_bs))

        output = self.model(obs_bs)
        output = torch.log(output.gather(-1, action_bs.reshape(-1, 1)))

        loss = torch.mean(-1 * output.flatten() * reward_bs)

        self.optim.zero_grad()
        loss.backward()
        self.optim.step()

        return total_reward


if __name__ == '__main__':
    env = CompatWrapper(gym.make('CartPole-v0'))
    agent = Agent(env.observation_space.shape[0], action_dim=env.action_space.n, hidden_dim=64)
    for epoch in range(10000):
        total_reward = agent.train(env=env)
        print(f'Epoch[{epoch}]-->{total_reward}')

展开全文 >>

RLHF-DQN

2024-04-24

前言

下面记录下DQN算法以及一些细节，注意哦，本博客更多目的在于当下记录，并非完整严谨的哦，也或许有理解错误。

关于DQN，看了下网上的介绍以及从Q-Learning到DQN解决state和action无法枚举完的情况。另外也强烈推荐下面链接：

知乎网友实现DQN：可直接按照这个跑通体验下效果。
PaddlePaddle/PARL：这个是paddle出的RLHF库，并提供了相应的examples帮助入门和深入，并且环境也帮忙解决好了，如果debug能力比较强的话，建议直接看这个哦。
PyTorch DQN实现：这个是pytorch官方实现的DQN算法。

一些特别的点

1. 俺是value based的，所以不需要softmax

看下面这个DQN网络，你觉得有问题么？

class DQN(nn.Module):
    def __init__(self, obs, action):
        super().__init__()
        self.fc1 = nn.Linear(obs, 64)
        self.fc2 = nn.Linear(64, 64)
        self.fc3 = nn.Linear(64, action)

    def forward(self, obs):
        x = torch.relu(self.fc1(obs))
        x = torch.relu(self.fc2(x))
        x = torch.softmax(self.fc3(x))
        return x

啊啊啊，这里为什么要加softmax呢，是不是习惯了深度学习那套分类思想，再好好想想这里，是不是不应该加softmax。

2. eps是干嘛的，是否是必须的？

仔细观察下面代码，eps是在做sample时起作用的，也就是select_action这里，可以看到，在每个epoch结束时，更新新的eps值，从最开始的1,到最后的eps_end=0.1,它是一个比较平滑的曲线。

从select_action可以观察到，action早期处于随机采样的状态，随着epoch的增加，action的决策更多过度到model决策。

那这里是否是必须的呢？

不，这里并不是必须的，因为完全可以在train之前加一个warmup步骤，让model一定程度上学会state到action这个变化。再到后面，就是正常训练流程，不需要sample这个过程了。

那这里更多起到什么作用?

我觉得有个点可以很好理解这里，即从完全小白到慢慢学习直至认知理解的过程。添加warmup，即先提前产生一批训练样本，而eps这里，即随着过程慢慢学习，不过相比warmup，可以更优采取权重随机采样方案，即torch.multinomial(torch.softmax(self(obs), dim=-1), 1).item()，它表现出来的特点是权重高的多次采样出现的频率也会更高，那么随着模型的优化，采样更优的可能性也会提升。

3. target_net是否是必须的？

不，这里也并不是必须的。观察整个过程，policy_net要比target_net新一个epoch，而且他俩实际上是在干同一个事情，那么可以将target_net指向policy_net,可以发现epoch的增加，reward的值也是正常提升的。

4. reply memory是干嘛的？

reply memory可以说记录了整个state和action等的过程，当然有一个maxlen来限制其大小，过早的数据就不要了。

不过需要指出的是，改动了上面这些点，虽然也可以收敛，但是可能会收敛变慢。

源码

这里借鉴了知乎网友实现DQN。

pygame==2.1.0


import gym
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.tensorboard import SummaryWriter
import numpy as np
import random
from collections import deque
from tqdm import tqdm


class DQN(nn.Module):
    def __init__(self, state_dim, action_dim):
        super(DQN, self).__init__()
        self.fc1 = nn.Linear(state_dim, 64)
        self.fc2 = nn.Linear(64, 64)
        self.fc3 = nn.Linear(64, action_dim)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = self.fc3(x)
        return x


class Agent():
    def __init__(self, state_dim, action_dim, memory_size=10000, batch_size=64, gamma=0.99, lr=1e-4):
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.state_dim = state_dim
        self.action_dim = action_dim
        self.memory = deque(
            maxlen=memory_size)  # deque是一个双端队列，可以在队首或队尾插入或删除元素。在DQN算法中，我们使用deque实现经验池来存储之前的经验，因为它可以在队尾插入新的经验，并在队首删除最老的经验，从而保持经验池的大小不变。
        self.batch_size = batch_size
        self.gamma = gamma
        self.lr = lr
        self.policy_net = DQN(state_dim, action_dim).to(self.device)
        self.target_net = DQN(state_dim, action_dim).to(self.device)
        self.optimizer = optim.Adam(self.policy_net.parameters(), lr=self.lr)
        self.loss_fn = nn.MSELoss()
        self.steps = 0
        self.writer = SummaryWriter()

    def select_action(self, state, eps):
        if random.random() < eps:
            return random.randint(0, self.action_dim - 1)
        else:
            state = torch.FloatTensor(state).to(self.device)
            with torch.no_grad():
                action = self.policy_net(state).argmax().item()
            return action

    def store_transition(self, state, action, reward, next_state, done):
        self.memory.append((state, action, reward, next_state, done))

    def train(self):
        if len(self.memory) < self.batch_size:
            return
        transitions = random.sample(self.memory, self.batch_size)
        batch = list(zip(*transitions))

        state_batch = torch.FloatTensor(batch[0]).to(self.device)
        action_batch = torch.LongTensor(batch[1]).to(self.device)
        reward_batch = torch.FloatTensor(batch[2]).to(self.device)
        next_state_batch = torch.FloatTensor(batch[3]).to(self.device)
        done_batch = torch.FloatTensor(batch[4]).to(self.device)

        q_values = self.policy_net(state_batch).gather(1, action_batch.unsqueeze(1)).squeeze(1)
        next_q_values = self.target_net(next_state_batch).max(1)[0]
        expected_q_values = reward_batch + self.gamma * next_q_values * (1 - done_batch)

        loss = self.loss_fn(q_values, expected_q_values.detach())

        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()

        self.steps += 1
        self.writer.add_scalar("Loss", loss.item(), self.steps)

    def update_target(self):
        self.target_net.load_state_dict(self.policy_net.state_dict())


def train_epoch(env, eps):
    state = env.reset()
    total_reward = 0
    while True:
        action = agent.select_action(state, eps)

        next_state, reward, done, _ = env.step(action)

        agent.store_transition(state, action, reward, next_state, done)
        state = next_state
        agent.train()

        # env.render()
        total_reward += reward
        if done:
            break
    return total_reward


def train_dqn(env, agent: Agent, eps_start=1, eps_end=0.1, eps_decay=0.995, max_episodes=1000, max_steps=1000):
    eps = eps_start
    for episode in tqdm(range(max_episodes)):
        reward = train_epoch(env, eps)
        agent.update_target()
        eps = max(eps * eps_decay, eps_end)
        print(f'{episode} --> {reward}')


if __name__ == "__main__":
    env = gym.make("CartPole-v1")
    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.n
    agent = Agent(state_dim, action_dim)
    train_dqn(env, agent)

展开全文 >>

LLM-RLHF工作原理一

2024-04-07

前言

大模型最近很是火啊，媒体铺天盖地的宣传，候选者简历中写LLM微调等等。本文希望以huggingface trl/RLHF notebooks讲到的几个例子作为入口，介绍下RLHF在整个训练工作中的位置以及起到的作用，方便理解与后续应用。

代码分析

在huggingface trl/RLHF notebooks这个文件夹下，一共有三个例子：

同时也按照上述这三个文件顺序进行分析。

一、gpt2-sentiment.ipynb

目的：这个文件实现的是如何利用RLHF学会生成正向评论。

1. Load IMDB dataset

数据集默认有两个字段，text 和label，即用户对一部电影的评论和这条评论的情感倾向（正向、负向）。

这里对text字段随机截断长度为n后面的文本，例如：text=这个电影我觉得很棒。截取后变成query=这个电影。

2. Model和Ref Model

这里采用GPT2作为训练model,ref model和model是一样的，可先理解成model是用来训练的，ref model是用来参考的。
ref model是RLHF训练过程中不可缺少的一部分，也跟在generation model后面添加ValueHead层是一个道理，关于强化学习更细力度，本文先忽略。

3. reward model

这里采用distilbert-imdb模型来作为打分模型，这个模型的作用是输入一条评论，它会给出positive、negative的打分。

4. 训练

即让model基于query生成指max_new_tokens的文本，然后让reward model来打分，以positive score为目标，不断优化model,使其能够基于用户给定的文本开头来生成正向评论。

这里的max_new_tokens也比较有意思，它可以有两层的不同解释：

一条文本的长度不会很长
折扣因子

关于后者，我觉得会是一个比较有意思的点。在RLHF中，有针对每一步给一个score，还有走完后针对整条路径给一个score。那这里的max_new_tokens是不是就可以理解成是中间的状态～
既不会因为每一步都打分造成训练效率低下也不会因为对整条路径打分导致某些点决策失误所带来的更大偏差，尽量缓解这种情况。

结束。

二、gpt2-sentiment-control.ipynb

目的：通过添加prompt来控制生成评论的情感。

这里的prompt有三类：positive、negative、neutral。由于neutral是reward model本身能力所不具备的，看到这里也可以跳过。

其构造示例如下：

1	query="[positive]这个电影很"

那么预期目标是好。

如果是

1	query="[negative]这个电影很"

那么预期目标是不好，差之类的情感。

剩下流程和上面文件一致，此处忽略。

三、best_of_n.ipynb

目的：RLHF的目标是超越原有天花板，那这种是选取ref model的best of n来和RLHF训练后的做个比对。

整体下来，reward model占据很重要的作用，决定了RLHF的效果，需要注意。

更多看下原代码，整体流程不是很复杂。又水水水了一篇。

实验

1. 数据集

以ChnSentiCorp作为情感分类数据集。

2. reward model

train_score.py


import torch
from datasets import load_dataset
from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments

checkpoint = 'chinese-roberta-wwm-ext'

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

raw_dataset = load_dataset(
    'csv',
    data_files={
        "train": "train.tsv",
        "dev": "dev.tsv",
    },
    delimiter='\t'
)


def data_collator(batch_data):
    text_a = [_['text_a'] for _ in batch_data]
    data = tokenizer.batch_encode_plus(text_a, max_length=510, truncation=True, return_tensors='pt', padding=True)
    data.update({"labels": torch.tensor([_['label'] for _ in batch_data]).reshape(-1, 1)})
    return data


def compute_metrics(data):
    from umetrics.macrometrics import MacroMetrics
    macro = MacroMetrics(labels=[0, 1])
    predictions = data.predictions.argmax(-1).tolist()
    labels = data.label_ids.flatten().tolist()
    macro.step(y_trues=labels, y_preds=predictions)
    macro.classification_report(print)
    return {"f1": macro.f1_score()}



args = TrainingArguments(
    output_dir='score_model',
    remove_unused_columns=False,
    seed=1,
    do_train=True,
    do_eval=True,
    evaluation_strategy='epoch',
    learning_rate=1e-5,
    num_train_epochs=10,
    per_device_train_batch_size=16,
    fp16=True,
    save_total_limit=1,
    metric_for_best_model='f1',
  
)

trainer = Trainer(
    model=model,
    args=args,
    data_collator=data_collator,
    train_dataset=raw_dataset['train'],
    eval_dataset=raw_dataset['dev'],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)
trainer.train()

训练后F1能达到95%，所以打分模型至此结束。

3. train RLHF

import pandas as pd
import torch
from datasets import load_dataset
from tqdm import tqdm
from transformers import AutoTokenizer, set_seed, pipeline
from trl import AutoModelForCausalLMWithValueHead, create_reference_model, PPOConfig, PPOTrainer
from trl.core import LengthSampler

model_name = "Wenzhong-GPT2-110M"
config = PPOConfig(
    model_name=model_name,
    learning_rate=1.41e-5,
    log_with="tensorboard",
    batch_size=64,
    gradient_accumulation_steps=8,
    mini_batch_size=8,
)
sent_kwargs = {"return_all_scores": True, "function_to_apply": "none", "batch_size": 16}

gpt2_tokenizer = AutoTokenizer.from_pretrained(config.model_name)
gpt2_tokenizer.pad_token = gpt2_tokenizer.eos_token

set_seed(1)


# ################

def build_dataset(input_min_text_length=2, input_max_text_length=8):
    ds = load_dataset(
        'csv',
        data_files={
            "train": "train.tsv",
            "dev": "dev.tsv",
        },
        delimiter='\t'
    )['train']
    ds = ds.rename_columns({"text_a": "review"})
    ds = ds.filter(lambda x: len(x['review']) > 10, batched=False)

    input_size = LengthSampler(input_min_text_length, input_max_text_length)

    def tokenize(sample):
        sample["input_ids"] = gpt2_tokenizer.encode(sample["review"][: input_size()])
        sample["query"] = gpt2_tokenizer.decode(sample["input_ids"])
        return sample

    ds = ds.map(tokenize, batched=False)
    ds.set_format(type="torch")
    return ds


dataset = build_dataset()


# ##################

def collator(data):
    return dict((key, [d[key] for d in data]) for key in data[0])


# ##################

sentiment_pipe = pipeline(
    "sentiment-analysis",
    model='/score_model/checkpoint-1000',
    device='cuda:1'
)

text = "今天天气很好"
print(sentiment_pipe(text, **sent_kwargs))
# [[{'label': 'LABEL_0', 'score': -2.200057029724121}, {'label': 'LABEL_1', 'score': 2.2879886627197266}]]
text = '这个电影主角演的真心一般般'
print(sentiment_pipe(text, **sent_kwargs))
# [[{'label': 'LABEL_0', 'score': 4.0778069496154785}, {'label': 'LABEL_1', 'score': -4.0350117683410645}]]

# # ##############
#
output_min_length = 16
output_max_length = 32
output_length_sampler = LengthSampler(output_min_length, output_max_length)

model = AutoModelForCausalLMWithValueHead.from_pretrained(config.model_name)
ref_model = create_reference_model(model)

ppo_trainer = PPOTrainer(config, model, ref_model, gpt2_tokenizer, dataset, data_collator=collator)

generation_kwargs = {
    "min_length": -1,

    "top_k": 0,
    "top_p": 1,
    "do_sample": True,
    "pad_token_id": gpt2_tokenizer.eos_token_id,
    "bos_token_id": gpt2_tokenizer.bos_token_id,
    "eos_token_id": gpt2_tokenizer.eos_token_id,
}
device = 'cuda:0'

for i in range(3):
    for epoch, batch in tqdm(enumerate(ppo_trainer.dataloader)):
        query_tensors = batch['input_ids']
        response_tensors = []
        for query in query_tensors:
            gen_len = output_length_sampler()

            response = ppo_trainer.generate(query, **{**generation_kwargs, "max_new_tokens": gen_len})
            response_tensors.append(response.squeeze()[-gen_len:])
        batch["response"] = [gpt2_tokenizer.decode(r.squeeze()) for r in response_tensors]
        #### Compute sentiment score
        texts = [q + r for q, r in zip(batch["query"], batch["response"])]
        pipe_outputs = sentiment_pipe(texts, **sent_kwargs)
        rewards = [torch.tensor(output[1]["score"]) for output in pipe_outputs]

        #### Run PPO step
        stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
        ppo_trainer.log_stats(stats, batch, rewards)

        #### get a batch from the dataset
        bs = 16
        game_data = dict()
        dataset.set_format("pandas")
        df_batch = dataset[:].sample(bs)
        game_data["query"] = df_batch["query"].tolist()
        query_tensors = df_batch["input_ids"].tolist()

        response_tensors_ref, response_tensors = [], []

        #### get response from gpt2 and gpt2_ref
        for i in range(bs):
            gen_len = output_length_sampler()
            output = ref_model.generate(
                torch.tensor(query_tensors[i]).unsqueeze(dim=0).to(device), max_new_tokens=gen_len, **generation_kwargs
            ).squeeze()[-gen_len:]
            response_tensors_ref.append(output)
            output = model.generate(
                torch.tensor(query_tensors[i]).unsqueeze(dim=0).to(device), max_new_tokens=gen_len, **generation_kwargs
            ).squeeze()[-gen_len:]
            response_tensors.append(output)

        #### decode responses
        game_data["response (before)"] = [gpt2_tokenizer.decode(response_tensors_ref[i]) for i in range(bs)]
        game_data["response (after)"] = [gpt2_tokenizer.decode(response_tensors[i]) for i in range(bs)]

        #### sentiment analysis of query/response pairs before/after
        texts = [q + r for q, r in zip(game_data["query"], game_data["response (before)"])]
        game_data["rewards (before)"] = [output[1]["score"] for output in sentiment_pipe(texts, **sent_kwargs)]

        texts = [q + r for q, r in zip(game_data["query"], game_data["response (after)"])]
        game_data["rewards (after)"] = [output[1]["score"] for output in sentiment_pipe(texts, **sent_kwargs)]

        # store results in a dataframe
        df_results = pd.DataFrame(game_data)
        print(df_results)

        print("mean:")
        print(df_results[["rewards (before)", "rewards (after)"]].mean())
        print()
        print("median:")
        print(df_results[["rewards (before)", "rewards (after)"]].median())

model.save_pretrained("gpt2-imdb-pos-v2", push_to_hub=False)
gpt2_tokenizer.save_pretrained("gpt2-imdb-pos-v2", push_to_hub=False)

4. eval RLHF

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_path1 = 'Wenzhong-GPT2-110M'
model_path2 = 'gpt2-imdb-pos-v2'

for model_path in (model_path1, model_path2):
    model = AutoModelForCausalLM.from_pretrained(model_path).to("cuda")

    tokenizer = AutoTokenizer.from_pretrained(model_path)

    generation_kwargs = {
    "min_length": -1,

    "top_k": 0,
    "top_p": 1,
    "do_sample": True,
    "pad_token_id": tokenizer.eos_token_id,
    "bos_token_id": tokenizer.bos_token_id,
    "eos_token_id": tokenizer.eos_token_id,
    "max_new_tokens": 100,
    }

    for query in ("这个电影", "今天要下雨，心情太", "今天天气很差", '这个手机屏幕很差，手感'):
        input_ids = torch.tensor(tokenizer.encode(query)).to("cuda").reshape(1, -1)
        response = model.generate(input_ids, **generation_kwargs)[0]
        print(tokenizer.decode(response))

使用RLHF训练后的生成结果示例如下：

这个电影，见证点深的艺术价值。<|endoftext|>
今天要下雨，心情太好了，很舒服了。整个人还算可以，心情的状态都还可以的好。所以，周末的时候接触我的朋友们都会给我一种�
今天天气很差，但是连续8天都是安安静静休息，也很好！水的这出入口很好，很适合我们的婚了。<|endoftext|>
这个手机屏幕很差，手感显示是手机配置高品质，但真是非常棒！<|endoftext|>

前两个例子还好，比较容易生成理想的正面评论，第三个和第四个例子前面说到了天气很差和屏幕很差，都是偏负面的评论，但是后面生成的文本还是能生成正向的回答，说明经过强化学习的确有产生预期效果。

关于文本流畅问题，有个问题是Wenzhong-GPT2本身产生通顺句子的能力就比较弱，但经过多轮训练，也能产生除生成正向评论外的效果，这点也是很nice.

展开全文 >>

qwen1.8B试玩

2023-12-21

介绍

阿里出了个qwen1.8B，对于资源有所要求的场景或者需要支持长文本的场景，应该是目前国内在这个量级内最优的选择了吧。接下来以此来打通微调、部署各个流程，算是一次记录。

微调

首先按照要求和快速使用来跑起来，安装flash-attn，先跑下推理，正常，接下来就进入微调阶段。

按照微调流程，这里采用LoRA进行微调，但是需要注意的是，虽然官方给出了显存占用及训练速度，但是我在1080Ti上得到的显存占用还是要更高一些，大家可以将这个指标理解成为运行起来至少需要的显存，在进行训练时，还是会有一些增高。

训练的话采用finetune_lora_single_gpu.sh默认配置，幸亏我没有采用train，而是使用了dev数据集，7500条数据，8个多小时，，不过整个loss还是蛮正常的，没有出现issue里出现的各种问题。。。

使用

我这里保存到了outout_qwen,下面为调用LoRA微调后的模型。

path = 'Qwen-1_8B-Chat'
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True)
model = AutoPeftModelForCausalLM.from_pretrained(
     'Qwen/output_qwen',
     device_map='auto',
     trust_remote_code=True
)
model.generation_config = GenerationConfig.from_pretrained(path, trust_remote_code=True)
model.eval()

注意的是，如果到此就打算部署的话，也要将adapter_config.json中的base_model_name_or_path正确引用，不过官方也给了合并代码，可以将LoRA和qwen合并到一起。

from peft import AutoPeftModelForCausalLM

model = AutoPeftModelForCausalLM.from_pretrained(
    path_to_adapter, # path to the output directory
    device_map="auto",
    trust_remote_code=True
).eval()

merged_model = model.merge_and_unload()
# max_shard_size and safe serialization are not necessary. 
# They respectively work for sharding checkpoint and save the model to safetensors
merged_model.save_pretrained(new_model_directory, max_shard_size="2048MB", safe_serialization=True)

llama.cpp

如果需要使用llama.cpp，这里还是建议进行上述步骤合并的，使用llama.cpp也不是很复杂，可按照官方README安装cmake编译安装，后续也不是很复杂，官方提供了非常清晰的使用说明。

mkdir build
cd build
cmake ..
cmake --build . --config Release

不过需要注意的是，用不用量化或者说要不要做转换，本身上还是要根据转换后的效果和效率来决定的，由于目前我们直接跑在GPU上，后续有机会单独针对llama.cpp尝试深入一下～

展开全文 >>

lora原理与实现

2023-11-29

介绍

Lora,是微软出的一种在低资源场景下进行微调大模型的实现方式，在transformers里有peft这个包进行调用，它通过固定预训练模型权重并只训练新增lora层来实现微调，目前其在比如Baichuan2、ChatGLM上都有相关资料，更多介绍可自行搜索了解。

简单理解

其简单理解实现方式为，比如qkv的linear为768*768（更大模型可能会更大），那lora通过新增两个linear（lora_A和lora_B），引入一个超参r来降低训练参数量，其伪代码如下:


in_feature, out_feature = 768, 768

# old
self.q = nn.Linear(in_feature, out_feature)

# Lora

self.lora_A = nn.Linear(in_feature, r)
self.lora_B = nn.Linear(r, out_feature)

比如当r设置为1时，是否可以明显感受到训练参数量的变化。

注意：Lora本质只有在微调阶段生效，其加速了微调，但是对于推理阶段不会加速，而且因引入了新的计算会导致推理速度略有下降。

paddle实现

注意，此篇文章通过介绍LoraLinear的实现，来说明Lora的工作原理。

其实现在这里。

注意点1

LoraLinear继承自nn.Linear，所以是在Linear的基础上进行改动。

注意点2

Linear的weight不进行训练，这也符合Lora的原理说明。

1	self.weight.stop_gradient=True

注意点3

forward部分，代码如下：

def forward(self, input: paddle.Tensor):
        result = F.linear(x=input, weight=self.weight, bias=self.bias, name=self.name)
        if not self.merged:
            result += (self.lora_dropout(input) @ self.lora_A @ self.lora_B) * self.scaling
        return result

我们看如果pretrained model和lora不进行merge的时候，其推理是pretrained model先计算，拿到result，然后和lora计算结果进行相加获取最终结果。

总结

对于Parameter Efficient fine-tune，Lora是其中一个实现方式，其他还有比如p-tuning等。但是通过Lora，我们大致理解了从原来直接对pretrained model fine-tune到现在的引入新的参数（比如）来使得我们能用低资源来使用LLM的能力。

展开全文 >>

document-QA-layoutLMv2

2023-10-31

介绍

书接上文，layoutLM微调FUNSD数据集介绍了layoutlm和layoutxlm如何做named entity recognition，以及多模态-CLIP和多模态-字幕生成介绍多模态是如何融合的，本文继续基于layoutLM系列，基于huggingface document_question_answering来进行debug是如何实现的。

更新：针对layoutxlm在docvqa_zh上的训练代码已经放到document-qa啦。

原始数据

在这之前，都是在介绍如何处理数据，也即如下代码：


#
from datasets import load_dataset

dataset = load_dataset("nielsr/docvqa_1200_examples")

updated_dataset = dataset.map(lambda example: {"question": example["query"]["en"]}, remove_columns=["query"])
updated_dataset = updated_dataset.map(
    lambda example: {"answer": example["answers"][0]}, remove_columns=["answer", "answers"]
)
updated_dataset = updated_dataset.filter(lambda x: len(x["words"]) + len(x["question"].split()) < 512)

updated_dataset = updated_dataset.remove_columns("words")
updated_dataset = updated_dataset.remove_columns("bounding_boxes")

updated_dataset['train'] = updated_dataset['train'].select(range(10))
updated_dataset['test'] = updated_dataset['test'].select(range(5))

>>> dataset['test'].select(range(1)).to_dict().keys()
dict_keys(['id', 'image', 'query', 'answers', 'words', 'bounding_boxes', 'answer'])

>>> dataset['test'].select(range(1)).to_dict()['query']
[{'de': 'Was ist die Standortadresse von NSDA?', 'en': 'What the location address of NSDA?', 'es': '¿Cuál es la dirección de ubicación del NSDA?', 'fr': "Quelle est l'adresse de la NSDA?", 'it': "Qual e' l'indirizzo della NSDA?"}]

可以看到，默认dataset有如上几个字段，其中query有德语以及英语，后面updated_dataset做了过滤，只保留了为英语的、以及长度小于512的，最终保留字段如下：

>>> updated_dataset['test']
Dataset({
    features: ['id', 'image', 'answer', 'question'],
    num_rows: 5
})

>>> updated_dataset['test'].select(range(1)).to_dict()['question']
['What the location address of NSDA?']
>>> updated_dataset['test'].select(range(1)).to_dict()['answer']
['1128 SIXTEENTH ST., N. W., WASHINGTON, D. C. 20036']

反而变简单了，所以咱们也不用再刻意关注dataset了。

其中一条数据如下：

>>> aaa = updated_dataset['test'].select(range(1)).to_dict()

>>> import io
>>> Image.open(io.BytesIO(aaa['image'][0]['bytes'])).show()
>>> aaa['question']
['What the location address of NSDA?']
>>> aaa['answer']
['1128 SIXTEENTH ST., N. W., WASHINGTON, D. C. 20036']

图像处理

看看人家，标注的bbox之类的就不要啦，咱要自己搞。。不过这可以理解它是怎么处理滴。

这部分对应Preprocessing document images,也即如下代码。

image_processor = processor.image_processor


def get_ocr_words_and_boxes(examples):
    images = [image.convert("RGB") for image in examples["image"]]
    encoded_inputs = image_processor(images)

    examples["image"] = encoded_inputs.pixel_values
    examples["words"] = encoded_inputs.words
    examples["boxes"] = encoded_inputs.boxes

    return examples

dataset_with_ocr = updated_dataset.map(get_ocr_words_and_boxes, batched=True, batch_size=2)

从image_processor进去，最终到apply_tesseract,其代码如下所示：


def apply_tesseract(
    image: np.ndarray,
    lang: Optional[str],
    tesseract_config: Optional[str] = None,
    input_data_format: Optional[Union[str, ChannelDimension]] = None,
):
    """Applies Tesseract OCR on a document image, and returns recognized words + normalized bounding boxes."""
    tesseract_config = tesseract_config if tesseract_config is not None else ""

    # apply OCR
    pil_image = to_pil_image(image, input_data_format=input_data_format)
    image_width, image_height = pil_image.size
    data = pytesseract.image_to_data(pil_image, lang=lang, output_type="dict", config=tesseract_config)
    words, left, top, width, height = data["text"], data["left"], data["top"], data["width"], data["height"]

    # filter empty words and corresponding coordinates
    irrelevant_indices = [idx for idx, word in enumerate(words) if not word.strip()]
    words = [word for idx, word in enumerate(words) if idx not in irrelevant_indices]
    left = [coord for idx, coord in enumerate(left) if idx not in irrelevant_indices]
    top = [coord for idx, coord in enumerate(top) if idx not in irrelevant_indices]
    width = [coord for idx, coord in enumerate(width) if idx not in irrelevant_indices]
    height = [coord for idx, coord in enumerate(height) if idx not in irrelevant_indices]

    # turn coordinates into (left, top, left+width, top+height) format
    actual_boxes = []
    for x, y, w, h in zip(left, top, width, height):
        actual_box = [x, y, x + w, y + h]
        actual_boxes.append(actual_box)

    # finally, normalize the bounding boxes
    normalized_boxes = []
    for box in actual_boxes:
        normalized_boxes.append(normalize_box(box, image_width, image_height))

    assert len(words) == len(normalized_boxes), "Not as many words as there are bounding boxes"

    return words, normalized_boxes

咱来看下tesseract识别结果：

from PIL import ImageDraw
draw = ImageDraw.ImageDraw(pil_image)
import random
for b in actual_boxes:
    draw.rectangle(b, outline=(random.randint(0, 255), random.randint(0, 255), random.randint(0, 255)))
    
pil_image.show()

识别结果如下：

可以看到，tesseract拿到每个词的识别坐标。

注意：这里忽略了图片本身操作，比如resize、reshape等操作哦


原始图片1	OCR1

文本处理

这部分对应Preprocessing text data.

基于上图知道其answer为T.F. Riehl，通过subfinder函数其在原文的位置为start_index=17和end_index=18，通过OCR1图可知其具体位置。

接着tokenizer传入了question,words（ocr原文识别结果）,boxes,我们来看其是怎么实现的以及其具体目的。

1 2	encoding = tokenizer(example["question"], example["words"], example["boxes"]) tokenizer.decode(encoding["input_ids"])

在这之前，我们可以看到，其具体做的就是encode拿input_ids, attention_mask和token_type_ids,其具体如下：


>>> self.decode(sanitized_tokens['input_ids'][0])
'[CLS] who is in cc in this letter? [SEP] wie baw brown & williamson tobacco corporation research & development internal correspondence to : r. h. honeycutt ce : t. f. riehl from :. c. j. cook date : may 8, 1995 subject : review of existing brainstorming ideas / 483 the major function of the product innovation graup is to develop marketable nove! products that would be profitable to manufacture and sell. novel is defined as : of a new kind, or different from anything seen or known before. innovation is defined as : something new or different introduced ; act of innovating ; introduction of new things or methods. the products may incorporate the latest technologies, materials and know - how available to give then a unique taste or look. the first task of the product innovation group was to assemble, review and categorize a list of existing brainstorming ideas. ideas were grouped into two major categories labeled appearance and taste / aroma. these categories are used for novel products that may differ from a visual and / or taste / aroma point of view compared to canventional cigarettes. other categories include a combination of the above, filters, packaging and brand extensions. appearance this category is used for novel cigarette constructions that yield visually different products with minimal changes in smoke chemistry two cigarettes in cne. emulti - plug te build yaur awn cigarette. eswitchable menthol or non menthol cigarette. * cigarettes with interspaced perforations to enable smoker to separate unburned section for future smoking. « short cigarette, tobacco section 30 mm. « extremely fast buming cigarette. « novel cigarette constructions that permit a significant reduction iretobacco weight while maintaining smoking mechanics and visual characteristics. higher basis weight paper : potential reduction in tobacco weight. « more rigid tobacco column ; stiffing agent for tobacco ; e. g. starch * colored tow and cigarette papers ; seasonal promotions, e. g. pastel colored cigarettes for easter or in an ebony and ivory brand containing a mixture of all black ( black paper and tow ) and ail white cigarettes. 499150498 [SEP]'

但是也是从这开始，讲述了bbox是如何跟words对齐的。

其代码如下：

最终生成的结果如下：


>>> for input_id, box in zip(sanitized_tokens['input_ids'][0], sanitized_tokens['bbox'][0]):
>>>     print(self.decode(input_id),'\t', box)
    
[CLS] 	 [0, 0, 0, 0]
who 	 [0, 0, 0, 0]
is 	 [0, 0, 0, 0]
in 	 [0, 0, 0, 0]
cc 	 [0, 0, 0, 0]
in 	 [0, 0, 0, 0]
this 	 [0, 0, 0, 0]
letter 	 [0, 0, 0, 0]
? 	 [0, 0, 0, 0]
[SEP] 	 [1000, 1000, 1000, 1000]
wi 	 [455, 66, 502, 91]
##e 	 [455, 66, 502, 91]
ba 	 [455, 93, 503, 103]
##w 	 [455, 93, 503, 103]
brown 	 [296, 116, 348, 133]
& 	 [356, 121, 367, 129]
williamson 	 [372, 120, 470, 129]
tobacco 	 [475, 120, 547, 128]
corporation 	 [552, 118, 661, 127]
research 	 [372, 133, 452, 142]
& 	 [457, 133, 468, 141]
development 	 [473, 133, 585, 142]
internal 	 [623, 158, 691, 165]
correspondence 	 [694, 158, 823, 165]
to 	 [143, 200, 168, 215]
: 	 [143, 200, 168, 215]
r 	 [239, 201, 253, 211]
. 	 [239, 201, 253, 211]
h 	 [259, 201, 273, 211]
. 	 [259, 201, 273, 211]
honey 	 [279, 201, 351, 212]
##cut 	 [279, 201, 351, 212]
##t 	 [279, 201, 351, 212]
ce 	 [144, 224, 168, 245]
: 	 [144, 224, 168, 245]
t 	 [231, 224, 265, 244]
. 	 [231, 224, 265, 244]
f 	 [231, 224, 265, 244]
. 	 [231, 224, 265, 244]
ri 	 [267, 224, 307, 244]
##eh 	 [267, 224, 307, 244]
##l 	 [267, 224, 307, 244]
from 	 [145, 259, 193, 269]
: 	 [145, 259, 193, 269]
. 	 [211, 268, 212, 269]
c 	 [239, 259, 269, 268]
. 	 [239, 259, 269, 268]
j 	 [239, 259, 269, 268]
. 	 [239, 259, 269, 268]
cook 	 [276, 259, 313, 268]
date 	 [145, 285, 189, 302]

着重看上图40~46行，即可明白tokenizer分成subword后，其box按照原词的box进行分配。这个也和原来使用layoutXLM来做是一样的，其在这里。

剩下部分就是encode_dataset函数了，除了和box对齐，另外一个就是基于subfinder函数来找到start_positions和end_positions来作为label。

至此，大致理解了其文本处理方式以及如何和box进行对齐，但是要注意subfinder函数，如果answer没有在words（即ocr识别原文）没有找到，这条数据就废掉了。

模型

模型部分简单如下：

self
LayoutLMv2ForQuestionAnswering(
  (layoutlmv2): LayoutLMv2Model(
    (embeddings): LayoutLMv2Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (x_position_embeddings): Embedding(1024, 128)
      (y_position_embeddings): Embedding(1024, 128)
      (h_position_embeddings): Embedding(1024, 128)
      (w_position_embeddings): Embedding(1024, 128)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (visual): LayoutLMv2VisualBackbone(
      (backbone): FPN(
        (fpn_lateral2): Conv2d(256, 256, kernel_size=(1, 1), stride=(1, 1))
        (fpn_output2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (fpn_lateral3): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1))
        (fpn_output3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (fpn_lateral4): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1))
        (fpn_output4): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (fpn_lateral5): Conv2d(2048, 256, kernel_size=(1, 1), stride=(1, 1))
        (fpn_output5): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (top_block): LastLevelMaxPool()
        (bottom_up): ResNet(
          ）
    )
    (visual_proj): Linear(in_features=256, out_features=768, bias=True)
    (visual_LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (visual_dropout): Dropout(p=0.1, inplace=False)
    (encoder): LayoutLMv2Encoder(
    
  )
  (qa_outputs): Linear(in_features=768, out_features=2, bias=True)
)

但是一直没搞清楚其visual用的是resnet还是变种，不过这里就忽略了。。。（layoutLMv3用的就是ViT了）。

那接下来我们就一个目的了，看visual feature和text feature如何融合。

这部分反而看的云里雾里，比如为什么生成一个visual_bbox，剩下生成embedding、图像、transformer部分就是常规操作了，先忽略。

这种双指针的方式可以解决一部分文档问答问题，但是针对表格之类的，比如：

Q1: 班级1班的老师的姓名？
Q2：班级1班语文老师的姓名和数学老师的姓名？

即一个表格中多个答案和一个疑问句中多个疑问点，就造成这类模型的是无法满足的。

展开全文 >>

多模态-字幕生成

2023-10-16

介绍

在上一篇文章多模态-CLIP，介绍了CLIP中text跟image如何匹配。本文介绍如何基于image来做字幕生成，也即Image Caption，属于text-to-image任务。

其整体流程用到了transformer/vision_encoder_decoder架构，即使用ViT来作为图像的encoder，gpt2来作为文本的decoder。当然你也可以使用其他模型，整体架构如下图所示。

参考

1、zero_nlp vit-gpt2-image-chinese-captioning
2、The Illustrated Image Captioning using transformers

展开全文 >>

多模态-CLIP

2023-10-11

问题

多模态如何做融合，本文是对CLIP模型理解做个记录。

前提

目前业界有中文开源版本的，例如Chinese-CLIP以及IDEA/Fengshenbang-LM太乙系列，本文采用Chinese-CLIP来梳理其流程。

数据集采用wukong-dataset，预训练模型使用chinese-clip-vit-base-patch16来进行实验。

流程

1. 文本处理

import pandas as pd
import torch
from PIL import Image
from datasets import Dataset
from transformers import ChineseCLIPProcessor, ChineseCLIPModel, Trainer, TrainingArguments

model_name_or_path = "chinese-clip-vit-base-patch16"

model = ChineseCLIPModel.from_pretrained(model_name_or_path)
processor = ChineseCLIPProcessor.from_pretrained(model_name_or_path)

text_str = ['中国', '哈哈哈，我在这里']
text_res = processor(text=text_str)
print(text_res)

基于ChineseClip官方说明，知道其text-encoder部分都使用了chinese-roberta-wwm，另外一个可验证点是其vocab.txt的md5值和chinese-roberta-wwm是一样的。所以文本处理，就是找了中文版的bert来做中文的支持，故这部分到此就结束啦～

2. 图像处理

import pandas as pd
import torch
from PIL import Image
from datasets import Dataset
from transformers import ChineseCLIPProcessor, ChineseCLIPModel, Trainer, TrainingArguments

model_name_or_path = "chinese-clip-vit-base-patch16"

model = ChineseCLIPModel.from_pretrained(model_name_or_path)
processor = ChineseCLIPProcessor.from_pretrained(model_name_or_path)

image_str = '00010405-0083.jpg'
image_input = Image.open(image_str)
img_res = processor(images=image_input)
print(img_res)

其流程如下所示，包括转RGB、resize、rescale、normalize、然后转CHW通道。

注意resize那里将图片调整为(224, 224)，这里对后面处理有用。

3. 融合

重点来喽～

其整理流程如下所示。

3.1 vision model

其vision_model下获取embedding如下。

其patch_embeds经过conv2d，转成成了torch.Size([10, 768, 14, 14])(224-16)/16+1，接着
201行代码为：

1 2	patch_embeds = patch_embeds.flatten(2).transpose(1, 2)

14*14=196,最终转换成了(10, 196, 768)，到这里就清晰了～

不过在patch_embeds结束后，引入了一个class_embeds，这个就是类似bert中的[CLS]位置，用以做下游分类的。

在拿到vision embedding之后，后面就是encoder部分啦，这里对应ChineseCLIPVisionEncoder，这部分本文先忽略。

3.2 text model

这部分就是bert处理流程了，作者也写的很明白，就是bert那一套。

3.3 融合

至此拿到vision_outputs和text_outputs，其vision_outputs为：

1 2	last_hidden_state=(10, 197, 768) pooler_output=(10, 768)

其text_outputs为:

1 2	last_hidden_state=(10, 64, 768)

好奇：至此768维已经对齐了，为啥还要各自经过一个self.visual_projection和self.text_projection将其转为512维。。。

3.4 计算loss

这里处理跟simcse计算loss流程蛮类似的，不过这里计算loss还是蛮有意思的：


def chinese_clip_loss(similarity: torch.Tensor) -> torch.Tensor:
    caption_loss = contrastive_loss(similarity)
    image_loss = contrastive_loss(similarity.t())
    return (caption_loss + image_loss) / 2.0

正常来讲，我们只需要计算一次即可，这里分别进行计算，也算是一个有意思的点。

至此，模型整体流程大致完成。能够用来基于文本找图像。

1 2	# 备忘 new_steps = (steps - kernel_size + 2padding) / strides + 1

那是否有一种文本跟图像语义对齐的呢？留给以后～

展开全文 >>

简介

最小实现代码

参考资料

简介

前言

注意点

1. log平滑

2. 折扣因子

总结

源码

前言

一些特别的点

1. 俺是value based的，所以不需要softmax

2. eps是干嘛的，是否是必须的？

3. target_net是否是必须的？

4. reply memory是干嘛的？

源码

前言

代码分析

一、gpt2-sentiment.ipynb

1. Load IMDB dataset

2. Model和Ref Model

3. reward model

4. 训练

二、gpt2-sentiment-control.ipynb

三、best_of_n.ipynb

实验

1. 数据集

2. reward model

3. train RLHF

4. eval RLHF

介绍

微调

使用

llama.cpp

介绍

简单理解

paddle实现

注意点1

注意点2

注意点3

总结

介绍

原始数据

图像处理

文本处理

模型

More

介绍

参考

问题

前提

流程

1. 文本处理

2. 图像处理

3. 融合

3.1 vision model

3.2 text model

3.3 融合

3.4 计算loss