自然语言处理(NLP)领域自 BERT 模型诞生以来,便开启了 “预训练 + 微调” 的新时代。作为 Google 2018 年推出的里程碑式模型,BERT 凭借双向上下文建模能力,在文本分类、问答系统等多项任务中实现性能突破,至今仍是开发者入门 NLP 的核心工具。本文结合实战经验,从原理、应用到常见问题解决,带你完整掌握 BERT 的使用方法。

一、BERT 核心原理速览

BERT(Bidirectional Encoder Representations from Transformers)的核心优势在于 “双向理解” 与 “通用迁移”,其底层逻辑可概括为三点:

  • 架构基础:完全基于 Transformer 编码器,通过自注意力机制动态捕捉词与词的上下文关联,配合残差连接与层归一化解决梯度消失问题。
  • 预训练任务:通过 Masked Language Model(MLM,遮盖词汇预测)和 Next Sentence Prediction(NSP,句子关系判断)两大无监督任务,学习通用语言规律。
  • 输入表示:由词嵌入、段嵌入(区分句子对)和位置嵌入(编码语序)拼接而成,确保模型理解词汇语义、句子边界和位置信息。

相比传统 RNN、LSTM 等单向模型,BERT 能更准确地处理歧义词汇(如 “银行” 的多义性),且通过大规模语料预训练(33 亿词),具备极强的迁移学习能力。

二、BERT 典型应用场景

BERT 的 “预训练 + 微调” 范式使其能快速适配多种 NLP 任务,无需定制化模型结构:

  1. 文本分类:情感分析、垃圾邮件检测、话题分类等,例如电商评价正负面判断。
  2. 问答系统:抽取式问答(从文本中定位答案,如 SQuAD 任务)、开放域问答(结合知识库生成回复)。
  3. 命名实体识别(NER):识别文本中的人名、地点、组织等实体,如从新闻中提取关键信息。
  4. 语义相似度计算:判断两句话是否表达相同含义,适用于智能客服的意图匹配。

  • 选模型 + 加载工具用 Hugging Face 的transformers库,选对应语言 / 规模的 BERT(如bert-base-chinese),加载分词器(处理文本)和预训练模型(提供基础能力)。

  • 文本转模型能读的格式用分词器把文本编码成 “输入 ID、注意力掩码” 等张量,统一长度(截断 / 补全到 512 词)。

  • 绑定任务 + 配设备给预训练模型加个 “任务头”(如分类任务加全连接层),把模型和数据移到 CPU/GPU。

  • 训练或推理

    • 训练:前向传播算损失→反向传播更新参数;
    • 推理:模型切 “评估模式”,直接输出预测结果(如分类标签、实体位置)。


#mian

import random
import torch
import torch.nn as nn
import numpy as np
import os
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"
from model_utils.data import get_data_loader
from model_utils.model import myBertModel
from model_utils.train import train_val
# 必须放在导入torch后、模型初始化前(最开头的核心配置)
import torch

# 禁用cuDNN所有优化,避免不兼容指令
torch.backends.cudnn.enabled = False
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True

# 后续的设备定义、模型初始化正常执行
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
def seed_everything(seed):
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True
    random.seed(seed)
    np.random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
#################################################################
seed_everything(0)
###############################################


lr = 0.0001
batchsize = 2
loss = nn.CrossEntropyLoss()
bert_path = "bert-base-chinese"
num_class = 2
data_path = "jiudian.txt"
max_acc= 0.6



model = myBertModel(bert_path, num_class, device).to(device,dtype=torch.float32)

optimizer = torch.optim.AdamW(model.parameters(), lr=lr, weight_decay=0.00001)

train_loader, val_loader = get_data_loader(data_path, batchsize)

epochs = 5    #
save_path = "model_save/best_model.pth"

scheduler = torch.optim.lr_scheduler.CosineAnnealingWarmRestarts(optimizer, T_0=20, eta_min=1e-9) #改变学习率
val_epoch = 1


para = {
    "model": model,
    "train_loader": train_loader,
    "val_loader": val_loader,
    "scheduler" :scheduler,
    "optimizer": optimizer,
    "loss": loss,
    "epoch": epochs,
    "device": device,
    "save_path": save_path,
    "max_acc": max_acc,
    "val_epoch": val_epoch   #训练多少论验证一次
}

train_val(para)










#data

#  data负责产生两个dataloader
from torch.utils.data import DataLoader, Dataset
from sklearn.model_selection import train_test_split   #给X,Y 和分割比例, 分割出来一个训练集和验证机的X, Y
import torch


def read_file(path):
    data = []
    label = []

    with open(path, "r", encoding="utf-8") as f:
        for i, line in enumerate(f):
            if i == 0:
                continue
            if i > 200 and i< 7500:
                continue
            line = line.strip("\n")
            line = line.split(",", 1)  #把这句话,按照,分割, 1表示分割次数
            data.append(line[1])
            label.append(line[0])
    print("读了%d的数据"%len(data))
    return data, label


# file = "../jiudian.txt"
# read_file(file)
class jdDataset(Dataset):
    def __init__(self, data, label):
        self.X = data
        self.Y = torch.LongTensor([int(i) for i in label])

    def __getitem__(self, item):
        return self.X[item], self.Y[item]

    def __len__(self):
        return len(self.Y)





def get_data_loader(path, batchsize, val_size=0.2):          #读入数据,分割数据。
    data, label = read_file(path)
    train_x, val_x, train_y, val_y = train_test_split(data, label, test_size=val_size, shuffle=True, stratify=label)
    train_set = jdDataset(train_x, train_y)
    val_set = jdDataset(val_x, val_y)
    train_loader = DataLoader(train_set, batchsize, shuffle=True)
    val_loader = DataLoader(val_set, batchsize, shuffle=True)
    return train_loader, val_loader

if __name__ == "__main__":
    get_data_loader("../jiudian.txt", 2)


#model

import torch
import torch.nn as nn
from transformers import BertModel, BertTokenizer, BertConfig


class myBertModel(nn.Module):
    def __init__(self, bert_path, num_class, device):
        super(myBertModel, self).__init__()

        self.bert = BertModel.from_pretrained(bert_path)
        # config = BertConfig.from_pretrained(bert_path)
        # self.bert = BertModel(config)



        self.device = device
        self.cls_head = nn.Linear(768, num_class)
        self.tokenizer = BertTokenizer.from_pretrained(bert_path)

    def forward(self, text):
        input = self.tokenizer(text, return_tensors="pt", truncation=True, padding="max_length", max_length=128)
        input_ids = input["input_ids"].to(self.device)
        token_type_ids = input['token_type_ids'].to(self.device)
        attention_mask = input['attention_mask'].to(self.device)

        sequence_out, pooler_out = self.bert(input_ids=input_ids,
                        token_type_ids=token_type_ids,
                        attention_mask=attention_mask,
                        return_dict=False)      #return_dict

        pred = self.cls_head(pooler_out)
        return pred

if __name__ == "__main__":
    model = myBertModel("../bert-base-chinese", 2)
    pred = model("今天天气真好")


#train

import torch
import time
import matplotlib.pyplot as plt
import numpy as np
from tqdm import tqdm



def train_val(para):

########################################################
    model = para['model']
    train_loader =para['train_loader']
    val_loader = para['val_loader']
    scheduler = para['scheduler']
    optimizer = para['optimizer']
    loss = para['loss']
    epoch = para['epoch']
    device = para['device']
    save_path = para['save_path']
    max_acc = para['max_acc']
    val_epoch = para['val_epoch']


#################################################
    plt_train_loss = []
    plt_train_acc = []
    plt_val_loss = []
    plt_val_acc = []
    val_rel = []

    for i in range(epoch):
        start_time = time.time()
        model.train()
        train_loss = 0.0
        train_acc = 0.0
        val_acc = 0.0
        val_loss = 0.0
        for batch in tqdm(train_loader):
            model.zero_grad()
            text, labels = batch[0], batch[1].to(device)
            pred = model(text)
            bat_loss = loss(pred, labels)
            bat_loss.backward()
            optimizer.step()
            scheduler.step()              #scheduler     调整学习率
            optimizer.zero_grad()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)       #梯度裁切
            train_loss += bat_loss.item()    #.detach 表示去掉梯度
            train_acc += np.sum(np.argmax(pred.cpu().data.numpy(),axis=1)== labels.cpu().numpy())
        plt_train_loss . append(train_loss/train_loader.dataset.__len__())
        plt_train_acc.append(train_acc/train_loader.dataset.__len__())
        if i % val_epoch == 0:
            model.eval()
            with torch.no_grad():
                for batch in tqdm(val_loader):
                    val_text, val_labels = batch[0], batch[1].to(device)
                    val_pred = model(val_text)
                    val_bat_loss = loss(val_pred, val_labels)
                    val_loss += val_bat_loss.cpu().item()

                    val_acc += np.sum(np.argmax(val_pred.cpu().data.numpy(), axis=1) == val_labels.cpu().numpy())
                    val_rel.append(val_pred)

            if val_acc > max_acc:
                torch.save(model, save_path+str(epoch)+"ckpt")
                max_acc = val_acc
            plt_val_loss.append(val_loss/val_loader.dataset.__len__())
            plt_val_acc.append(val_acc/val_loader.dataset.__len__())
            print('[%03d/%03d] %2.2f sec(s) TrainAcc : %3.6f TrainLoss : %3.6f | valAcc: %3.6f valLoss: %3.6f  ' % \
                  (i, epoch, time.time()-start_time, plt_train_acc[-1], plt_train_loss[-1], plt_val_acc[-1], plt_val_loss[-1])
                  )
            if i % 50 == 0:
                torch.save(model, save_path+'-epoch:'+str(i)+ '-%.2f'%plt_val_acc[-1])
        else:
            plt_val_loss.append(plt_val_loss[-1])
            plt_val_acc.append(plt_val_acc[-1])
            print('[%03d/%03d] %2.2f sec(s) TrainAcc : %3.6f TrainLoss : %3.6f   ' % \
                  (i, epoch, time.time()-start_time, plt_train_acc[-1], plt_train_loss[-1])
                  )
    plt.plot(plt_train_loss)
    plt.plot(plt_val_loss)
    plt.title('loss')
    plt.legend(['train', 'val'])
    plt.show()

    plt.plot(plt_train_acc)
    plt.plot(plt_val_acc)
    plt.title('Accuracy')
    plt.legend(['train', 'val'])
    plt.savefig('acc.png')
    plt.show()

Logo

腾讯云面向开发者汇聚海量精品云计算使用和开发经验,营造开放的云计算技术生态圈。

更多推荐