【AI大模型面试真题】面试官：LLM推理中超出content length的外推方法有哪些？

长度外推或者说是Length Generalization研究的是如何预训练的时候使用较小的长度，但在外推的时候能够泛化到更大的长度上去。长度外推问题依然是目前Transformer亟待解决但还没解决的一个问题。好的外推性能表现就是在泛化到超长序列的时候，者相关指标不会出现较大的下降，模型表现依然很稳健。

bugyinyin

748人浏览 · 2025-01-18 10:56:59

bugyinyin · 2025-01-18 10:56:59 发布

回答：主要有的几个做法为ALiBi、内插法、NTK-Aware方法、Yarn方法。

展开说一下：
长度外推或者说是Length Generalization研究的是如何预训练的时候使用较小的长度，但在外推的时候能够泛化到更大的长度上去。长度外推问题依然是目前Transformer亟待解决但还没解决的一个问题。好的外推性能表现就是在泛化到超长序列的时候，者相关指标不会出现较大的下降，模型表现依然很稳健。

目前或者说着2年内研究长度泛化的一些经典的思路如下：

ALiBi直接外推，ALiBi主要是在计算attention score后添加上一个不可学习的bias，公式即假设有8个heads，m是一个预先定义好的值，可以选择不同的取值来调整每个head在注意力计算中的权重分配。具体来说，m的取值可以是如下：1/2, 1/4, 1/8, 1/16, 1/32, 1/64, 1/128, 1/256。这些值表示在每个头的注意力计算中，不同头的权重分配逐渐变小，通常是为了增强模型的表达能力或控制注意力的焦点。ALiBi（Attention with Linear Biases）由于其使用线性偏差的设计，无法在单层注意力机制中有效捕捉远距离的依赖关系。与标准的自注意力机制不同，ALiBi的设计通过引入线性偏差来在局部范围内调整注意力分配，从而更适应于捕捉局部信息。然而，它能够进行外推（即捕捉更长距离的信息），这是因为它通过多层的注意力机制逐步扩展感知的范围。换句话说，ALiBi的远距离信息感知能力依赖于网络的深度，随着层数的增加，模型能够通过多个注意力层逐步捕获更远的依赖信息。但这种能力是有限的，因为它的感知范围随着层数的增加呈线性增长。
PI内插法，将预测的长文本的位置编码乘上因子Ltrain / Ltest，缩放到训练长度范围内，流程如下：(1) 训练阶段：(1,2,3,4,…,n) (2) 测试阶段：(1,2,3,4,…,n,…,2n) -> (0.5,1,…,n) [通过内插的方式来实现]。尽管位置内插（PI）方法有效避免了远距离位置越界的问题，但它也同时压缩了相邻Token之间的距离，这可能会严重影响模型的局部分辨率，导致困扰模型的潜在问题，如PPL（困惑度）增大。不过，研究表明，经过常规文本微调之后，PI方法依然能够取得较好的效果。从整体上来看，我们可以理解，这种做法实际上是对位置编码中的sin(m/base^{-2i/d})中的m进行了缩放处理（这里的sin仅为示例，base设为100000）。例如，在原始情况下，m的值可能是1，而经过压缩后，它的值就变成了0.5。

def _compute_linear_scaling_rope_parameters(
    config: Optional[PretrainedConfig] = None,
    device: Optional["torch.device"] = None,
    seq_len: Optional[int] = None,
    **rope_kwargs,
) -> Tuple["torch.Tensor", float]:
    """
    Computes the inverse frequencies with linear scaling. Credits to the Reddit user /u/kaiokendev
    Args:
        config ([`~transformers.PretrainedConfig`]):
            The model configuration.
        device (`torch.device`):
            The device to use for initialization of the inverse frequencies.
        seq_len (`int`, *optional*):
            The current sequence length. Unused for this type of RoPE.
        rope_kwargs (`Dict`, *optional*):
            BC compatibility with the previous RoPE class instantiation, will be removed in v4.45.
    Returns:
        Tuple of (`torch.Tensor`, `float`), containing the inverse frequencies for the RoPE embeddings and the
        post-processing scaling factor applied to the computed cos/sin (unused in this type of RoPE).
    """
    if config isnotNoneand len(rope_kwargs) > 0:
        raise ValueError(
            "Unexpected arguments: `**rope_kwargs` and `config` are mutually exclusive in "
            f"`_compute_linear_scaling_rope_parameters`, got `rope_kwargs`={rope_kwargs} and `config`={config}"
        )
    if len(rope_kwargs) > 0:
        factor = rope_kwargs["factor"]
    elif config isnotNone:
        factor = config.rope_scaling["factor"]

    # Gets the default RoPE parameters
    inv_freq, attention_factor = _compute_default_rope_parameters(config, device, seq_len, **rope_kwargs)

    # Then applies linear scaling to the frequencies.
    # NOTE: originally, scaling was applied to the position_ids. However, we get `embs = inv_freq @ position_ids`, so
    # applying scaling to the inverse frequencies is equivalent.
    inv_freq /= factor
    return inv_freq, attention_factor

NTK-aware系列：NTK-aware Scaled RoPE，目前基于RoPE的大模型太多了，基于该位置编码研究外推的算法很多，NTK-aware Scaled RoPE是一个网友提出来的，思路也是很简单,就是该上面sin(m/base^{{-2i/d})中的base,变成了sin(m/(base*alph)}{-2i/d})，从苏剑林等人的分析来看，这种方式类似于进制转换，将b进制转换为lambda b进制，其中lambda=k^(2/d)。且理论上分析，这种方法可以实现高频外推、低频内插的目的，高频指的是如下[具体参考文献2]。苏剑林提出：NTK-aware Scaled RoPE“平摊”就不是最优的，应该是低位(比如第35维到64维)要分摊更多，高位(第10-35维)分摊更少，这就导致了混合进制的产生(这里不细说).代码如下：

import transformers
old_init = transformers.models.llama.modeling_llama.LlamaRotaryEmbedding.__init__
def ntk_scaled_init(self, dim, max_position_embeddings=2048, base=10000, device=None):    
    #The method is just these three lines    
    max_position_embeddings = 16384    
    a = 8 #Alpha value    
    base = base * a ** (dim / (dim-2)) #Base change formula    
    old_init(self, dim, max_position_embeddings, base, device)
transformrs.models.llama.modeling_llama.LlamaRotaryEmbedding.__init__ = ntk_scaled_init

Dynamically Scaled RoPE：这里找了一个transformers中的代码，位于https://github.com/huggingface/transformers/blob/10feacd88aef9569e240b7e3833ab32b297e4460/src/transformers/modeling_rope_utils.py#L112，对于动态 NTK，设置了一个α，其中α 的缩放设置为 (α * 当前序列长度 / 原始模型上下文长度) - (α - 1)。随着序列长度的增加动态缩放超参数。从如下代码中可以看到其base缩放的过程：

def _compute_dynamic_ntk_parameters(
    config: Optional[PretrainedConfig] = None,
    device: Optional["torch.device"] = None,
    seq_len: Optional[int] = None,
    **rope_kwargs,
) -> Tuple["torch.Tensor", float]:
    """
    Computes the inverse frequencies with NTK scaling. Credits to the Reddit users /u/bloc97 and /u/emozilla
    Args:
        config ([`~transformers.PretrainedConfig`]):
            The model configuration.
        device (`torch.device`):
            The device to use for initialization of the inverse frequencies.
        seq_len (`int`, *optional*):
            The current sequence length, used to update the dynamic RoPE at inference time.
        rope_kwargs (`Dict`, *optional*):
            BC compatibility with the previous RoPE class instantiation, will be removed in v4.45.
    Returns:
        Tuple of (`torch.Tensor`, `float`), containing the inverse frequencies for the RoPE embeddings and the
        post-processing scaling factor applied to the computed cos/sin (unused in this type of RoPE).
    """
    # TODO (joao): use the new `original_max_position_embeddings` from rope_scaling
    if config isnotNoneand len(rope_kwargs) > 0:
        raise ValueError(
            "Unexpected arguments: `**rope_kwargs` and `config` are mutually exclusive in "
            f"`_compute_dynamic_ntk_parameters`, got `rope_kwargs`={rope_kwargs} and `config`={config}"
        )
    if len(rope_kwargs) > 0:
        base = rope_kwargs["base"]
        dim = rope_kwargs["dim"]
        max_position_embeddings = rope_kwargs["max_position_embeddings"]
        factor = rope_kwargs["factor"]
    elif config isnotNone:
        base = config.rope_theta
        partial_rotary_factor = config.partial_rotary_factor if hasattr(config, "partial_rotary_factor") else1.0
        head_dim = getattr(config, "head_dim", config.hidden_size // config.num_attention_heads)
        dim = int(head_dim * partial_rotary_factor)
        max_position_embeddings = config.max_position_embeddings
        factor = config.rope_scaling["factor"]

    attention_factor = 1.0# Unused in this type of RoPE

    # seq_len: default to max_position_embeddings, e.g. at init time
    seq_len = seq_len if seq_len isnotNoneand seq_len > max_position_embeddings else max_position_embeddings

    # Compute the inverse frequencies 核心代码
    base = base * ((factor * seq_len / max_position_embeddings) - (factor - 1)) ** (dim / (dim - 2))
    inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2, dtype=torch.int64).float().to(device) / dim))
    return inv_freq, attention_factor

苏剑林提出的几种方式，包括ReRoPE和Leaky ReRoPE，其主要的思路是：结合外推和内插的方法呢，先自己设定一个窗口大小w，在窗口内我们使用大小为1的位置间隔，在窗口外我们使用大小为1/k的位置间隔。在这个case下，不管输入长度是多少，它的位置编码范围都不超过w,这种方式可以支持任意长度的content. 但这种思路会增加计算量，因为窗口内核窗口外都算一遍后再归到一起，增加了计算开销(这些都是后话)。除此外，苏建林还提出了log n 的缩放因子，在测试阶段每个q都呈上log n 以稳定注意力分数的分布，该点虽说是trick但有理论证明且效果很好。[具体可以通过文献2去找其发布的一些博客]。
Yarn，有两份部分工作，第一部分是对灵活进制的设计，说白了就是不同的位置用不同的进制。对于低维度，表示数字高位（数字=token的位置信息），外推，就是角度不变；对于中间维度（旋转位置编码向量的维度），表示数字的中间的位，需要逐步内插，也即逐步减小角度，相当于逐步增大“进制”，对于高维度，表示数字的低位，彻底内插，就是把进制变成原来的进制的40倍。[看参考文献7]；计算公式如下：

第二份工作是在相关研究观察到，在对logits进行softmax操作之前引入温度t可以统一地影响困惑度，无论数据样本和扩展上下文窗口上的token位置如何，更准确地说，我们将注意力权重的计算修改为softmax(qt/(tsqrt(d)))这个t就是参数。YaRN方法在微调和非微调场景中均超过以前所有方法，由于其占用空间较小，YaRN与修改注意力机制库(如Flash Attention 2[13])直接兼容，且在对不到0.1%的原始预训练数据进行微调后，YaRN在上下文窗口扩展中达到了最先进的性能，同时，如果YaRN与动态缩放的推理技术相结合而得到的Dynamic-yarn，其允许在超过2倍的上下文窗口扩展，而无需任何微调。代码和上面的连接一样，这里也列一下

def _compute_yarn_parameters(
    config: PretrainedConfig, device: "torch.device", seq_len: Optional[int] = None, **rope_kwargs
) -> Tuple["torch.Tensor", float]:
    """
    Computes the inverse frequencies with NTK scaling. Please refer to the
    [original paper](https://arxiv.org/abs/2309.00071)
    Args:
        config ([`~transformers.PretrainedConfig`]):
            The model configuration.
        device (`torch.device`):
            The device to use for initialization of the inverse frequencies.
        seq_len (`int`, *optional*):
            The current sequence length. Unused for this type of RoPE.
        rope_kwargs (`Dict`, *optional*):
            BC compatibility with the previous RoPE class instantiation, will be removed in v4.45.
    Returns:
        Tuple of (`torch.Tensor`, `float`), containing the inverse frequencies for the RoPE embeddings and the
        post-processing scaling factor applied to the computed cos/sin.
    """
    # No need to keep BC with yarn, unreleased when this new pattern was created.
    if len(rope_kwargs) > 0:
        raise ValueError(
            f"Unexpected arguments: `**rope_kwargs` should be unset in `_compute_yarn_parameters`, got {rope_kwargs}"
        )

    base = config.rope_theta
    partial_rotary_factor = config.partial_rotary_factor if hasattr(config, "partial_rotary_factor") else1.0
    head_dim = getattr(config, "head_dim", config.hidden_size // config.num_attention_heads)
    dim = int(head_dim * partial_rotary_factor)
    max_position_embeddings = config.max_position_embeddings
    factor = config.rope_scaling["factor"]

    # Sets the attention factor as suggested in the paper
    attention_factor = config.rope_scaling.get("attention_factor")
    if attention_factor isNone:
        attention_factor = 0.1 * math.log(factor) + 1.0

    # Optional config options
    # beta_fast/beta_slow: as suggested in the paper, default to 32/1 (correspondingly)
    beta_fast = config.rope_scaling.get("beta_fast") or32
    beta_slow = config.rope_scaling.get("beta_slow") or1

    # Compute the inverse frequencies
    def find_correction_dim(num_rotations, dim, base, max_position_embeddings):
        """Inverse dimension formula to find the dimension based on the number of rotations"""
        return (dim * math.log(max_position_embeddings / (num_rotations * 2 * math.pi))) / (2 * math.log(base))

    def find_correction_range(low_rot, high_rot, dim, base, max_position_embeddings):
        """Find dimension range bounds based on rotations"""
        low = math.floor(find_correction_dim(low_rot, dim, base, max_position_embeddings))
        high = math.ceil(find_correction_dim(high_rot, dim, base, max_position_embeddings))
        return max(low, 0), min(high, dim - 1)

    def linear_ramp_factor(min, max, dim):
        if min == max:
            max += 0.001# Prevent singularity

        linear_func = (torch.arange(dim, dtype=torch.float32) - min) / (max - min)
        ramp_func = torch.clamp(linear_func, 0, 1)
        return ramp_func

    # Note on variable naming: "interpolation" comes from the original technique, where we interpolate the position IDs
    # to expand the possible context length. In other words, interpolation = apply scaling factor.
    pos_freqs = base ** (torch.arange(0, dim, 2).float().to(device) / dim)
    inv_freq_extrapolation = 1.0 / pos_freqs
    inv_freq_interpolation = 1.0 / (factor * pos_freqs)

    low, high = find_correction_range(beta_fast, beta_slow, dim, base, max_position_embeddings)

    # Get n-dimensional rotational scaling corrected for extrapolation
    inv_freq_extrapolation_factor = 1 - linear_ramp_factor(low, high, dim // 2).float().to(device)
    inv_freq = (
        inv_freq_interpolation * (1 - inv_freq_extrapolation_factor)
        + inv_freq_extrapolation * inv_freq_extrapolation_factor
    )

    return inv_freq, attention_factor