超细节！从源代码剖析Self-Attention知识点

原创海晨威 PaperWeekly 2022-03-17

收录于话题 #自然语言处理 210个

©PaperWeekly 原创 · 作者｜海晨威

学校｜同济大学硕士生

研究方向｜自然语言处理

在当前的 NLP 领域，Transformer / BERT 已然成为基础应用，而 Self-Attention 则是两者的核心部分，下面尝试用 Q&A 和源码的形式深入 Self-Attention 的细节。

Q&A

1. Self-Attention 的核心是什么？

Self-Attention 的核心是用文本中的其它词来增强目标词的语义表示，从而更好的利用上下文的信息。

2. Self-Attention 的时间复杂度是怎么计算的？

Self-Attention 时间复杂度：，这里，n 是序列的长度，d 是 embedding 的维度，不考虑 batch 维。

Self-Attention 包括三个步骤：相似度计算，softmax 和加权平均。

它们分别的时间复杂度是：

相似度计算 可以看作大小为和的两个矩阵相乘：，得到一个的矩阵。

softmax 就是直接计算了，时间复杂度为。

加权平均 可以看作大小为和的两个矩阵相乘：，得到一个的矩阵。

因此，Self-Attention 的时间复杂度是。

这里再提一下 Tansformer 中的 Multi-Head Attention，多头 Attention，简单来说就是多个 Self-Attention 的组合，它的作用类似于 CNN 中的多核。

多头的实现不是循环的计算每个头，而是通过 transposes and reshapes，用矩阵乘法来完成的。

In practice, the multi-headed attention are done with transposes and reshapes rather than actual separate tensors. —— 来自 google BERT 源代码注释

Transformer/BERT 中把 d ，也就是 hidden_size/embedding_size 这个维度做了 reshape 拆分，可以去看 Google 的 TF 源码或者上面的 pytorch 源码：

hidden_size (d) = num_attention_heads (m) * attention_head_size (a)，也即 d=m*a。

并将 num_attention_heads 维度 transpose 到前面，使得 Q 和 K 的维度都是 (m,n,a)，这里不考虑 batch 维度。

这样点积可以看作大小为 (m,n,a) 和 (m,a,n) 的两个张量相乘，得到一个 (m,n,n) 的矩阵，其实就相当于 m 个头，时间复杂度是。

张量乘法时间复杂度分析参见：矩阵、张量乘法的时间复杂度分析 [1]。

因此 Multi-Head Attention 时间复杂度就是，而实际上，张量乘法可以加速，因此实际复杂度会更低一些。

3. 不考虑多头的原因，self-attention中词向量不乘QKV参数矩阵，会怎么样？

对于 Attention 机制，都可以用统一的 query/key/value 模式去解释，而对于 self-attention，一般会说它的 q=k=v，这里的相等实际上是指它们来自同一个基础向量，而在实际计算时，它们是不一样的，因为这三者都是乘了 QKV 参数矩阵的。那如果不乘，每个词对应的 q,k,v 就是完全一样的。

在 self-attention 中，sequence 中的每个词都会和 sequence 中的每个词做点积去计算相似度，也包括这个词本身。

在相同量级的情况下，qi 与 ki 点积的值会是最大的（可以从“两数和相同的情况下，两数相等对应的积最大”类比过来）。

那在 softmax 后的加权平均中，该词本身所占的比重将会是最大的，使得其他词的比重很少，无法有效利用上下文信息来增强当前词的语义表示。

而乘以 QKV 参数矩阵，会使得每个词的 q,k,v 都不一样，能很大程度上减轻上述的影响。

当然，QKV 参数矩阵也使得多头，类似于 CNN 中的多核，去捕捉更丰富的特征/信息成为可能。

4. 在常规 attention 中，一般有 k=v，那 self-attention 可以嘛？

self-attention 实际只是 attention 中的一种特殊情况，因此 k=v 是没有问题的，也即 K，V 参数矩阵相同。

扩展到 Multi-Head Attention 中，乘以 Q、K 参数矩阵之后，其实就已经保证了多头之间的差异性了，在 q 和 k 点积 +softmax 得到相似度之后，从常规 attention 的角度，觉得再去乘以和 k 相等的 v 会更合理一些。

在 Transformer / BERT 中，完全独立的 QKV 参数矩阵，可以扩大模型的容量和表达能力。

但采用 Q，K=V 这样的参数模式，我认为也是没有问题的，也能减少模型的参数，又不影响多头的实现。

当然，上述想法并没有做过实验，为个人观点，仅供参考。

源码

在整个 Transformer / BERT 的代码中，(Multi-Head Scaled Dot-Product) Self-Attention 的部分是相对最复杂的，也是 Transformer / BERT 的精髓所在，这里给出 Pytorch 版本的实现 [2]，并对重要的代码加上了注释和维度说明。

话不多说，都在代码里，它主要有三个部分：

初始化：包括有几个头，每个头的大小，并初始化 QKV 三个参数矩阵。

class SelfAttention(nn.Module):
    def __init__(self, config):
        super(SelfAttention, self).__init__()
        if config.hidden_size % config.num_attention_heads != 0:
            raise ValueError(
                "The hidden size (%d) is not a multiple of the number of attention "
                "heads (%d)" % (config.hidden_size, config.num_attention_heads))
        # 在Transformer/BERT中，这里的 all_head_size 就等于 config.hidden_size
        # 应该是一种简化，为了从embedding到最后输出维度都保持一致
        # 这样使得多个attention头合起来维度还是config.hidden_size
        # 而 attention_head_size 就是每个attention头的维度，要保证可以整除
        self.num_attention_heads = config.num_attention_heads
        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
        self.all_head_size = self.num_attention_heads * self.attention_head_size

        # 三个参数矩阵
        self.query = nn.Linear(config.hidden_size, self.all_head_size)
        self.key = nn.Linear(config.hidden_size, self.all_head_size)
        self.value = nn.Linear(config.hidden_size, self.all_head_size)

        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)

transposes and reshapes：这个函数主要是把维度大小为 [batch_size * seq_length * hidden_size] 的 q,k,v 向量变换成 [batch_size * num_attention_heads * seq_length * attention_head_size]，便于后面做 Multi-Head Attention。

    def transpose_for_scores(self, x):
        """
        shape of x: batch_size * seq_length * hidden_size
        这个操作是把hidden_size分解为 self.num_attention_heads * self.attention_head_size
        然后再交换 seq_length 维度 和 num_attention_heads 维度
        为什么要做这一步：因为attention是要对query中的每个字和key中的每个字做点积，即是在 seq_length 维度上
        query和key的点积是 [seq_length * attention_head_size] * [attention_head_size * seq_length]=[seq_length * seq_length]
        """
        # 这里是一个维度拼接：(1,2)+(3,4) -> (1, 2, 3, 4)
        new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
        x = x.view(*new_x_shape)
        return x.permute(0, 2, 1, 3)

前向计算： 乘以 QKV 参数矩阵 —> transposes and reshapes —> 做 scaled —> 加 attention mask —> Softmax —> 加权平均 —> 维度恢复。

 def forward(self, hidden_states, attention_mask):
        # shape of hidden_states and mixed_*_layer: batch_size * seq_length * hidden_size
        mixed_query_layer = self.query(hidden_states)
        mixed_key_layer = self.key(hidden_states)
        mixed_value_layer = self.value(hidden_states)

        # shape of *_layer: batch_size * num_attention_heads * seq_length * attention_head_size
        query_layer = self.transpose_for_scores(mixed_query_layer)
        key_layer = self.transpose_for_scores(mixed_key_layer)
        value_layer = self.transpose_for_scores(mixed_value_layer)

        # Take the dot product between "query" and "key" to get the raw attention scores.
        # shape of attention_scores: batch_size * num_attention_heads * seq_length * seq_length
        attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))

        # 这里就是做 Scaled，将方差统一到1，避免维度的影响
        attention_scores /= math.sqrt(self.attention_head_size)

        # shape of attention_mask: batch_size * 1 * 1 * seq_length. 它可以自动广播到和attention_scores一样的维度
        # 我们初始输入的attention_mask是：batch_size * seq_length，做了两次unsqueeze之后得到当前的attention_mask
        attention_scores = attention_scores + attention_mask

        # Normalize the attention scores to probabilities. Softmax 不改变维度
        # shape of attention_scores: batch_size * num_attention_heads * seq_length * seq_length
        attention_probs = nn.Softmax(dim=-1)(attention_scores)
        attention_probs = self.dropout(attention_probs)

        # shape of value_layer: batch_size * num_attention_heads * seq_length * attention_head_size
        # shape of first context_layer: batch_size * num_attention_heads * seq_length * attention_head_size
        # shape of second context_layer: batch_size * seq_length * num_attention_heads * attention_head_size
        # context_layer 维度恢复到：batch_size * seq_length * hidden_size
        context_layer = torch.matmul(attention_probs, value_layer)
        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
        new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
        context_layer = context_layer.view(*new_context_layer_shape)
        return context_layer

Attention is all you need ! 希望这篇文章能让你对 Self-Attention 有更深的理解。

参考文献

[1]https://liwt31.github.io/2018/10/12/mul-complexity/

[2]https://github.com/hichenway/CodeShare/tree/master/bert_pytorch_source_code

更多阅读

#投稿通道#

让你的论文被更多人看到

如何才能让更多的优质内容以更短路径到达读者群体，缩短读者寻找优质内容的成本呢？答案就是：你不认识的人。

总有一些你不认识的人，知道你想知道的东西。PaperWeekly 或许可以成为一座桥梁，促使不同背景、不同方向的学者和学术灵感相互碰撞，迸发出更多的可能性。

PaperWeekly 鼓励高校实验室或个人，在我们的平台上分享各类优质内容，可以是最新论文解读，也可以是学习心得或技术干货。我们的目的只有一个，让知识真正流动起来。

📝 来稿标准：

• 稿件确系个人原创作品，来稿需注明作者个人信息（姓名+学校/工作单位+学历/职位+研究方向）

• 如果文章并非首发，请在投稿时提醒并附上所有已发布链接

• PaperWeekly 默认每篇文章都是首发，均会添加“原创”标志

📬 投稿邮箱：

• 投稿邮箱：hr@paperweekly.site

• 所有文章配图，请单独在附件中发送

• 请留下即时联系方式（微信或手机），以便我们在编辑发布时和作者沟通

🔍

现在，在「知乎」也能找到我们了

进入知乎首页搜索「PaperWeekly」

点击「关注」订阅我们的专栏吧

关于PaperWeekly

PaperWeekly 是一个推荐、解读、讨论、报道人工智能前沿论文成果的学术平台。如果你研究或从事 AI 领域，欢迎在公众号后台点击「交流群」，小助手将把你带入 PaperWeekly 的交流群里。

宾曰语云被法学教授投诉：严重侵权，“违法犯罪”！

京东Plus的隐藏特权，很多会员都没领取，白交了会员费...

呼吁四川大学澄清：1998年1月，川大有多少个“姜涛与爱人程月玲”？

二湘：朱令去世一周年，清华学子控诉清华在朱令案中的冷血和无耻

多长高8厘米！国内知名专家首次公开“追高秘笈”！担心孩子长不高的家长速来！

超细节！从源代码剖析Self-Attention知识点

您可能也对以下帖子感兴趣

宾曰语云被法学教授投诉：严重侵权，“违法犯罪”！

京东Plus的隐藏特权，很多会员都没领取，白交了会员费...

呼吁四川大学澄清：1998年1月，川大有多少个“姜涛与爱人程月玲”？

二湘：朱令去世一周年，清华学子控诉清华在朱令案中的冷血和无耻

多长高8厘米！国内知名专家首次公开“追高秘笈”！担心孩子长不高的家长速来！

生成图片，分享到微信朋友圈

超细节！从源代码剖析Self-Attention知识点

您可能也对以下帖子感兴趣