深度学习：蒸馏Distill

Distilling the knowledge in a neural networkHinton 在论文中提出方法很简单，就是让学生模型的预测分布，来拟合老师模型（可以是集成模型）的预测分布，其中可通过用 logits 除以 temperature 来调节分布平滑程度，还避免一些极端情况影响。蒸馏时的softmax比之前的softmax多了一个参数T（temperature），T越大产生的概率

-柚子皮-

5464人浏览 · 2021-05-25 15:50:07

-柚子皮- · 2021-05-25 15:50:07 发布

Distilling the knowledge in a neural network

Hinton 在论文中提出方法很简单，就是让学生模型的预测分布，来拟合老师模型（可以是集成模型）的预测分布，其中可通过用老师模型 logits 除以 temperature 来调节分布平滑程度，还避免一些极端情况影响。

Note: 参数参考：hard权重0.8，soft权重p=0.2（soft权重一般不要超过0.5）。

蒸馏时的softmax

对于一个分类问题，定义soft label为模型的输出(即不同label的概率)， hard label为最终正确的label(也就是ground truth)，通常是通过最大化正确label的概率来进行学习的，但是不正确趋近于0的label也是有大有小的，这被称为"暗知识(Dark Knowledge)"，这也反应了模型的泛化能力。但因为过于趋近0不利于student模型学习，为了让student也容易学习tearcher的输出，tearcher的输出引入了带温度T的softmax概率为

比之前的softmax多了一个参数T（temperature），T越大产生的概率分布越平滑，此时本来的输出是0&1，现在变成0.2&0.8，对于student更有区分性更好学习。
[Distilling the knowledge in a neural network]

示例

1 Temperature缩放
将原始Logits除以Temperature参数（T）以调整概率分布平滑度：

~~scaled_logits = logits / temperature~~

2 计算交叉熵
使用tf.nn.**_cross_entropy_with_logits计算缩放后的损失：

3 梯度调整（可选）
若需保持梯度幅度与原任务一致（如知识蒸馏场景），对损失乘T²：

loss = loss * (temperature ** 2)
[deepseek]

DistilBert

DistillBert的做法相比bert-pkd就比较简单直接，还是保证模型的宽度不变，模型深度减为一半。主要在初始化和损失函数上下了功夫：

损失函数：采用知识蒸馏损失、Masked Language Model损失和cosine embedding损失加起来的值。
初始化：用Teacher模型的参数进行初始化，不过是从每两层中找一层出来。

Student architecture

和BERT类似，只是layer的数量减半
Student initialization

因为Student模型和Teacher模型每层的layer一样，因此每两层保留一层，利用相关的参数
Distillation

采用了RoBERTa的优化策略，动态mask，增大batch size，取消NSP任务的损失函数，
Training Loss

The final training objective is a linear combination of the distillation loss L_{ce} with the supervised training loss, in our case the masked language modeling loss L_{mlm} We found it beneficial to add a cosine embedding loss ( L_{cos} ) which will tend to align the directions of the student and teacher hidden states vectors.

最终的loss由三部分构成

    1 蒸馏损失，即 L_ce = ∑ t i ∗ log ( s_i ), 其中 s_i 是student输出的概率， t_i 是teacher输出的概率，当BERT预测的 t_i越高，而DistilBERT预测s_i越低，得到的Loss就会越高
    2 Mask language model loss，参考BERT，这部分也就是为hard loss
    3 Cosine Embedding Loss，利于让student学习和teacher一样的hidden state vector

[DistilBERT, a distilled version of BERT: smaller,faster, cheaper and lighter]

[DistilBert解读]

[模型训练损失值不变_Bert与模型蒸馏: PKD和DistillBert]

BERT-PKD (Patient Knowledge Distillation)

在hinton提到两个损失之上，再加上一个loss：L_PT。

PKD论文中做了对比，减少模型宽度和减少模型深度，得到的结论是减少宽度带来的efficiency提高不如减少深度来的更大。

论文所提出的多层蒸馏，即Student模型除了学习Teacher模型的概率输出之外，还要学习一些中间层的输出。论文提出了两种方法，第一种是Skip模式，即每隔几层去学习一个中间层，第二种是Last模式，即学习teacher模型的最后几层。如果是完全的去学习中间层的话，那么计算量很大。为了避免这个问题，我们注意到Bert模型中有个特殊字段[CLS]，因为其在 BERT 分类任务中的重要性，在蒸馏过程中，让student模型去学习[CLS]的中间的输出，计算过程是先归一化，然后直接均方差MSE 求损失。

Note:

1 至于学生模型中间层如何与老师模型中间层对应，论文中发现最佳策略是直接按倍数取老师模型对应层就行，比如1对2，2对4这样。

2 初始化的话就采用Teacher模型的前几层来做初始化。

3 更好的teacher模型会带来增长么？答案是不会的，可以看上图，把12层的Bert模型换成了24层的Bert模型，反而导致效果变差。究其原因，可能是因为在实验中，我们使用Teacher模型的前N层来初始化Student模型，对于24层模型来说，前N层更容易导致不匹配。而更好的方法则是Student模型先训练好，再去学Teacher模型。

[Patient Knowledge Distillation for BERT Model Compression]