【向量数据库】向量数据库的构建和检索

使用sentence-transformers库，将文本编码为向量，构建向量数据库并使用faiss进行向量检索

风好衣轻

416人浏览 · 2024-08-11 18:22:14

风好衣轻 · 2024-08-11 18:22:14 发布

1、使用 sentence-transformers 将文本编码为向量

安装 sentence-transformers：

pip install -U sentence-transformers

在 huggingface 下载 all-MiniLM-L6-v2 模型权重（1_Pooling 是文件夹，里面包含一个 config.json 文件）：

~$ ls
1_Pooling    config_sentence_transformers.json  model.safetensors  sentence_bert_config.json  tokenizer_config.json  train_script.py
config.json  data_config.json                   modules.json       special_tokens_map.json    tokenizer.json         vocab.txt

运行下面的示例脚本，将一句话编码为一个向量：

from sentence_transformers import SentenceTransformer

model_path = "/hub/weights/all-MiniLM-L6-v2"
model = SentenceTransformer(model_path)
sentence = ['This framework generates embeddings for each input sentence']
embedding = model.encode(sentence)
print(len(embedding), len(embedding[0]))  # 1 384

2、使用SQuAD-explorer数据集构建向量数据库

请添加图片描述

下载 SQuAD-explorer 数据集，这个数据集分为 Training Set 和 Dev Set ，Dev Set 更小更方便格式化预览数据集的结构，也更方便调试。

也可以使用其他的数据集，像第一节演示的那样，只需要是模型支持的语言的句子就可以编码成向量。

.json文件加载后的第一层是一个Python dict，包含两个key："version"和"data"，"data"对应的值是一个list，可以看一下这个list的长度：

import json

with open("dev-v2.0.json", "r") as f:
    data = json.load(f)

data = data["data"]
print(len(data))  # 35

dev数据集中有35条，train数据集中有442条，对于每一条数据，也是包含两个key："title"和"paragraphs"，"paragraphs"对应的值是一个list，"paragraphs"里的每一个元素是dict，我们只需要关注里面的"qas"，即QA pairs，下面使用这些QA pairs来构建向量数据库。

下面先对数据进行读取，获取到数据集中包含的全部QA，由于有些问题含有多个答案，有些问题没有答案，这里统一排除掉没有答案的问题，对于包含多个答案的问题仅获取第一条答案，形成一对一的QA映射关系：

import json

with open("dev-v2.0.json", "r") as f:
    dataset = json.load(f)

qas = [
    (qas["question"], qas["answers"][0]["text"])
    for data in dataset["data"]
    for paragraphs in data["paragraphs"]
    for qas in paragraphs["qas"]
    if qas["answers"]
]

print(len(qas))
print(qas[0])

可以看到第一组问答已经被提取出来了：

5928
('In what country is Normandy located?', 'France')

接下来我们查询一个问题的答案，如果问题刚好和数据中存在的问题完全一致，就可以匹配到答案，例如：

qas_dict = dict(qas)
q = 'In what country is Normandy located?'
a = qas_dict.get(q)
print(a)

把 qas_dict 看作一个简易的 key-value 数据库，用精准的问题去查询可以得到问题的答案

France

但如果问题和数据库中的 key 有点偏差（字符串不相等），就无法检索到这个问题的答案，这也是普通的数据库和向量数据库的最主要的差别之一，为了保证相似的问题也可以检索到正确的答案，我们可以使用向量数据库。

# 这里使用 CPU 版本的 faiss
pip install faiss-cpu

下面是完整的示例代码

import json

import faiss
import numpy as np
from sentence_transformers import SentenceTransformer

model_path = "/path/to/all-MiniLM-L6-v2"
model = SentenceTransformer(model_path)


def load_qa_data(data_path):
    with open(data_path, "r") as f:
        dataset = json.load(f)
    qas = [
        (qas["question"], qas["answers"][0]["text"])
        for data in dataset["data"]
        for paragraphs in data["paragraphs"]
        for qas in paragraphs["qas"]
        if qas["answers"]
    ]
    return np.array(qas)


def str_to_vec(sentence_list):
    embedding = model.encode(sentence_list)
    return embedding


def build_faiss_index(vectors, nlist=100, pq_m=8):
    d = vectors.shape[1]
    quantizer = faiss.IndexFlatL2(d)
    index = faiss.IndexIVFPQ(quantizer, d, nlist, 8, pq_m)
    index.train(vectors)
    index.add(vectors)
    return index

k = 10

qas = load_qa_data("dev-v2.0.json")
q, a = [qa[0] for qa in qas], [qa[1] for qa in qas]
q_vec = str_to_vec(q)
print(q_vec.shape)
index = build_faiss_index(q_vec)

query_vector = str_to_vec(["What country does Normandy belong to?"])
distances, indices = index.search(query_vector, k)

for i in range(k):
    print(f"==> distance: {distances[0][i]:.4f}, indice: {indices[0][i]}, {q[i]}")

运行的结果：

==> distance: 0.5488, indice: 0, In what country is Normandy located?
==> distance: 0.6171, indice: 10, When were the Normans in Normandy?
==> distance: 0.6253, indice: 6, From which countries did the Norse originate?
==> distance: 0.6960, indice: 5255, Who was the Norse leader?
==> distance: 0.7582, indice: 12, What century did the Normans first gain their separate identity?
==> distance: 0.7674, indice: 4725, Who was the duke in the battle of Hastings?
==> distance: 0.7722, indice: 22, Who ruled the duchy of Normandy
==> distance: 0.7759, indice: 4488, What religion were the Normans
==> distance: 0.7777, indice: 5259, What is the original meaning of the word Norman?
==> distance: 0.8038, indice: 4706, When was the Latin version of the word Norman first recorded?

可以发现即使是把问题换成了：

“What country does Normandy belong to?”

也仍然能够匹配到在384维的空间内与它最接近的一个句子（L2距离为0.5488）：

“In what country is Normandy located?”

我们就可以通过它的索引 0 找到对应的答案了。

腾讯云开发者社区

腾讯云面向开发者汇聚海量精品云计算使用和开发经验，营造开放的云计算技术生态圈。

更多推荐

Springboot+MySQL 实现从数据库获取数据展示到前端

1.运行效果效果图如下：2.创建数据库表数据库名为"springboot",数据库表名为"user"，数据库表里面放了5条数据，字段分别为：“id”，“name”，“age”，“email”。3.新建SpringBoot项目1、选择Spring Initial...

腾讯云开发者社区

腾讯云建官网：一个完整的自助建站的过程，不用代码，2小时就可以制作自己的网站

腾讯云开发者社区

金仓数据库KingbaseES用户管理--如何启用/禁用角色

关键字：KingbaseES、数据库、角色、启用/禁用内容正文：1.问题：如何在KingbaseES数据库中启用/禁用角色？2.什么角色的启用/禁用？KingbaseES数据库中启用/禁用角色功能是指在不删除角色的前提下，在本地库内使角色失效,使失效的角色再生效。3.KingbaseES角色的启用/禁用KingbaseES通过插件的方式来启用/禁用角色。这种方式更为灵活，当数据库的实用场景需要该功