🚀 Day02 - NLP自然语言处理进阶

📖 导读
第二天学习,继续深入NLP各个模块。


🗺️ 复习与进阶

1.1 文本预处理强化

import jieba
import jieba.posseg as pseg

# 精确模式
text = "传智教育是一家上市公司"
words = jieba.lcut(text)

# 词性标注
result = pseg.lcut(text)
for w, f in result:
    print(f"{w}: {f}")

1.2 自定义词典

jieba.add_word("深度学习", freq=10, tag='n')
jieba.load_userdict("custom_dict.txt")

💻 文本向量化

2.1 One-Hot深化

from tensorflow.keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer()
tokenizer.fit_on_texts(corpus)
one_hot = tokenizer.texts_to_matrix(texts, mode='binary')

2.2 Word2Vec训练

import fasttext

model = fasttext.train_unsupervised('corpus.txt', model='cbow', dim=100)
vec = model.get_word_vector("关键词")
similar = model.get_nearest_neighbors("关键词")

📊 数据分析

3.1 分布分析

import pandas as pd
import seaborn as sns

df = pd.read_csv('data.tsv', sep='\t')
df['length'] = df['text'].apply(len)
sns.countplot(x='label', data=df)

3.2 词云生成

from wordcloud import WordCloud

wc = WordCloud(font_path='simhei.ttf').generate(text)
plt.imshow(wc)
plt.axis('off')

🧠 RNN系列

3.1 RNN实现

rnn = nn.RNN(input_size=256, hidden_size=512, num_layers=2, batch_first=True)
output, hidden = rnn(x, h0)

3.2 LSTM/GRU

lstm = nn.LSTM(256, 512, 2, bidirectional=True)
gru = nn.GRU(256, 512, 2, bidirectional=True)

🎯 注意力机制

class Attention(nn.Module):
    def forward(self, hidden, encoder_outputs):
        # 计算注意力权重
        # 加权求和
        return context, weights

🔥 Transformer

编码器

class Encoder(nn.Module):
    def __init__(self, embed_size, heads, num_layers):
        super().__init__()
        self.attention = SelfAttention(embed_size, heads)
        self.norm = nn.LayerNorm(embed_size)
    
    def forward(self, x, mask):
        attn = self.attention(x, x, x, mask)
        return self.norm(x + attn)

解码器

class Decoder(nn.Module):
    def forward(self, x, encoder_out, src_mask, trg_mask):
        # 掩码注意力 + 编码器-解码器注意力
        return output

🚀 迁移学习

FastText分类

model = fasttext.train_supervised('train.txt', lr=0.1, epoch=5)
pred, prob = model.predict('文本')

BERT

from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('bert-base-chinese')
model = AutoModel.from_pretrained('bert-base-chinese')

📝 总结

Day02继续深入NLP各个模块,为后续项目实战打下基础。

Logo

腾讯云面向开发者汇聚海量精品云计算使用和开发经验,营造开放的云计算技术生态圈。

更多推荐