【LangChain系列】第四篇:向量数据库与嵌入简介及实践
在这篇博文中,我们将探讨向量存储和嵌入,它们是构建聊天机器人和对数据语料库执行语义搜索的最重要组件。
在这篇博文中,我们将探讨向量存储和嵌入,它们是构建聊天机器人和对数据语料库执行语义搜索的最重要组件。
一、工作流
回想一下检索增强生成 (RAG) 的整个工作流程:
我们从文档开始,创建这些文档的较小拆分,为这些拆分生成嵌入,然后将它们存储在矢量存储中。向量存储是一个数据库,您可以在以后轻松查找类似的向量。
二、安装
设置适当的环境变量并加载我们将要处理的文档 - cs229_lectures:
import os
from langchain_openai import OpenAI
from dotenv import load_dotenv, find_dotenv
from langchain_community.document_loaders.pdf import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
_ = load_dotenv(find_dotenv())
client = OpenAI(
api_key=os.getenv("OPENAI_API_KEY")
)
loaders = [
PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture01.pdf"), # Duplicate documents on purpose - messy data
PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture01.pdf"),
PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture02.pdf"),
PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture03.pdf"),
]
docs = []
for loader in loaders:
docs.extend(loader.load())
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=150)
splits = text_splitter.split_documents(docs)
print("Length of splits: ", len(splits)) # Length of splits: 209
三、嵌入
现在,我们已经将文档拆分为更小的、语义上有意义的块,是时候为它们创建嵌入了。嵌入获取一段文本并创建该文本的数字表示,以便具有相似内容的文本在此数字空间中具有相似的向量。这使我们能够比较这些向量并找到相似的文本片段。
为了说明这一点,让我们尝试一些玩具示例:
from langchain_openai import OpenAIEmbeddings
import numpy as np
embedding = OpenAIEmbeddings()
sentence1 = "i like dogs"
sentence2 = "i like canines"
sentence3 = "the weather is ugly outside"
embedding1 = embedding.embed_query(sentence1)
embedding2 = embedding.embed_query(sentence2)
embedding3 = embedding.embed_query(sentence3)
print(np.dot(embedding1, embedding2)) # 0.9631227500523609
print(np.dot(embedding1, embedding3)) # 0.7703257495981695
print(np.dot(embedding2, embedding3)) # 0.7591627401108028
不出所料,关于宠物的前两句话有非常相似的嵌入(点积为 0.96),而关于天气的句子与两个与宠物相关的句子不太相似(点积为 0.77 和 0.76)。
四、向量存储
接下来,我们将这些嵌入存储在向量存储中,这将使我们能够在以后尝试查找给定问题的相关文档时轻松查找类似的向量。
在本课中,我们将使用 Chroma 矢量存储,因为它是轻量级的,并且在内存中,因此很容易上手:
from langchain.vectorstores import Chroma
persist_directory = "docs/chroma/"
# this code is only for ipynb files
# !rm -rf ./docs/chroma # remove old database files if any
vectordb = Chroma.from_documents(
documents=splits, embedding=embedding, persist_directory=persist_directory
)
print(vectordb._collection.count()) # 209
五、相似性检索
相似性搜索是如何工作的:
question = "is there an email i can ask for help"
docs = vectordb.similarity_search(question, k=3)
print("Length of context docs: ", len(docs)) # 3
print(docs[0].page_content)
cs229-qa@cs.stanford.edu. This goes to an acc ount that's read by all the TAs and me. So
rather than sending us email individually, if you send email to this account, it will
actually let us get back to you maximally quickly with answers to your questions.
If you're asking questions about homework probl ems, please say in the subject line which
assignment and which question the email refers to, since that will also help us to route
your question to the appropriate TA or to me appropriately and get the response back to
you quickly.
Let's see. Skipping ahead — let's see — for homework, one midterm, one open and term
project. Notice on the honor code. So one thi ng that I think will help you to succeed and
do well in this class and even help you to enjoy this cla ss more is if you form a study
group.
So start looking around where you' re sitting now or at the end of class today, mingle a
little bit and get to know your classmates. I strongly encourage you to form study groups
and sort of have a group of people to study with and have a group of your fellow students
to talk over these concepts with. You can also post on the class news group if you want to
use that to try to form a study group.
But some of the problems sets in this cla ss are reasonably difficult. People that have
taken the class before may tell you they were very difficult. And just I bet it would be
more fun for you, and you'd probably have a be tter learning experience if you form a
这将返回提及 cs229-qa@cs.stanford.edu 电子邮件地址的相关块,用于询问有关课程材料的问题。
在此之后,让我们保留向量数据库以备将来使用:
vectordb.persist()
六、故障模式
虽然基本的语义搜索效果很好,但可能会出现一些边缘情况和故障模式。让我们来探讨其中的一些。
1.重复文档
question = "what did they say about matlab?"
docs = vectordb.similarity_search(question, k=5)
print(docs[0].page_content)
# Document(page_content='those homeworks will be done in either MATLA B or in Octave, which is sort of — I \nknow some people call it a free ve rsion of MATLAB, which it sort of is, sort of isn't. \nSo I guess for those of you that haven't s een MATLAB before, and I know most of you \nhave, MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to \nplot data. And it's sort of an extremely easy to learn tool to use for implementing a lot of \nlearning algorithms. \nAnd in case some of you want to work on your own home computer or something if you \ndon't have a MATLAB license, for the purposes of this class, there's also — [inaudible] \nwrite that down [inaudible] MATLAB — there' s also a software package called Octave \nthat you can download for free off the Internet. And it has somewhat fewer features than MATLAB, but it's free, and for the purposes of this class, it will work for just about \neverything. \nSo actually I, well, so yeah, just a side comment for those of you that haven't seen \nMATLAB before I guess, once a colleague of mine at a different university, not at \nStanford, actually teaches another machine l earning course. He's taught it for many years. \nSo one day, he was in his office, and an old student of his from, lik e, ten years ago came \ninto his office and he said, "Oh, professo r, professor, thank you so much for your', metadata={'page': 8, 'source': 'docs/cs229_lectures/MachineLearning-Lecture01.pdf'})
print(docs[1].page_content)
# Document(page_content='those homeworks will be done in either MATLA B or in Octave, which is sort of — I \nknow some people call it a free ve rsion of MATLAB, which it sort of is, sort of isn't. \nSo I guess for those of you that haven't s een MATLAB before, and I know most of you \nhave, MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to \nplot data. And it's sort of an extremely easy to learn tool to use for implementing a lot of \nlearning algorithms. \nAnd in case some of you want to work on your own home computer or something if you \ndon't have a MATLAB license, for the purposes of this class, there's also — [inaudible] \nwrite that down [inaudible] MATLAB — there' s also a software package called Octave \nthat you can download for free off the Internet. And it has somewhat fewer features than MATLAB, but it's free, and for the purposes of this class, it will work for just about \neverything. \nSo actually I, well, so yeah, just a side comment for those of you that haven't seen \nMATLAB before I guess, once a colleague of mine at a different university, not at \nStanford, actually teaches another machine l earning course. He's taught it for many years. \nSo one day, he was in his office, and an old student of his from, lik e, ten years ago came \ninto his office and he said, "Oh, professo r, professor, thank you so much for your', metadata={'page': 8, 'source': 'docs/cs229_lectures/MachineLearning-Lecture01.pdf'})
注意,前两个结果是相同的。这是因为我们之前有意复制了第一讲的 PDF,导致相同的信息出现在两个不同的块中。理想情况下,我们希望检索不同的块。
2.未捕获结构化信息
question = "what did they say about regression in the third lecture?"
docs = vectordb.similarity_search(question, k=5)
for doc in docs:
print(doc.metadata)
# {'page': 0, 'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf'}
# {'page': 14, 'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf'}
# {'page': 0, 'source': 'docs/cs229_lectures/MachineLearning-Lecture02.pdf'}
# {'page': 6, 'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf'}
# {'page': 8, 'source': 'docs/cs229_lectures/MachineLearning-Lecture01.pdf'}
print(docs[4].page_content)
into his office and he said, "Oh, professo r, professor, thank you so much for your
machine learning class. I learned so much from it. There's this stuff that I learned in your
class, and I now use every day. And it's help ed me make lots of money, and here's a
picture of my big house."
So my friend was very excited. He said, "W ow. That's great. I'm glad to hear this
machine learning stuff was actually useful. So what was it that you learned? Was it
logistic regression? Was it the PCA? Was it the data ne tworks? What was it that you
learned that was so helpful?" And the student said, "Oh, it was the MATLAB."
So for those of you that don't know MATLAB yet, I hope you do learn it. It's not hard,
and we'll actually have a short MATLAB tutori al in one of the discussion sections for
those of you that don't know it.
Okay. The very last piece of logistical th ing is the discussion s ections. So discussion
sections will be taught by the TAs, and atte ndance at discussion sections is optional,
although they'll also be recorded and televi sed. And we'll use the discussion sections
mainly for two things. For the next two or th ree weeks, we'll use the discussion sections
to go over the prerequisites to this class or if some of you haven't seen probability or
statistics for a while or maybe algebra, we'll go over those in the discussion sections as a
refresher for those of you that want one.
在这种情况下,我们期望所有检索到的文件都来自问题中指定的第三讲。但是,我们看到结果也包括来自其他讲座的块。这里的直觉是,关于仅查询第三讲的结构化信息并未在语义嵌入中捕获,语义嵌入更侧重于回归本身的概念。
小节
在这篇博文中,我们介绍了使用向量存储和嵌入进行语义搜索的基础知识,以及可能出现的一些边缘情况和故障模式。在下一篇博客中,我们将讨论如何解决这些故障模式并增强我们的检索能力,确保我们检索相关和不同的块,同时将结构化信息合并到搜索过程中。
如何系统的去学习大模型LLM ?
作为一名热心肠的互联网老兵,我意识到有很多经验和知识值得分享给大家,也可以通过我们的能力和经验解答大家在人工智能学习中的很多困惑,所以在工作繁忙的情况下还是坚持各种整理和分享。
但苦于知识传播途径有限,很多互联网行业朋友无法获得正确的资料得到学习提升,故此将并将重要的 AI大模型资料
包括AI大模型入门学习思维导图、精品AI大模型学习书籍手册、视频教程、实战学习等录播视频免费分享出来。
所有资料 ⚡️ ,朋友们如果有需要全套 《LLM大模型入门+进阶学习资源包》,扫码获取~
一、全套AGI大模型学习路线
AI大模型时代的学习之旅:从基础到前沿,掌握人工智能的核心技能!
二、640套AI大模型报告合集
这套包含640份报告的合集,涵盖了AI大模型的理论研究、技术实现、行业应用等多个方面。无论您是科研人员、工程师,还是对AI大模型感兴趣的爱好者,这套报告合集都将为您提供宝贵的信息和启示。
三、AI大模型经典PDF籍
随着人工智能技术的飞速发展,AI大模型已经成为了当今科技领域的一大热点。这些大型预训练模型,如GPT-3、BERT、XLNet等,以其强大的语言理解和生成能力,正在改变我们对人工智能的认识。 那以下这些PDF籍就是非常不错的学习资源。
四、AI大模型商业化落地方案
阶段1:AI大模型时代的基础理解
- 目标:了解AI大模型的基本概念、发展历程和核心原理。
- 内容:
- L1.1 人工智能简述与大模型起源
- L1.2 大模型与通用人工智能
- L1.3 GPT模型的发展历程
- L1.4 模型工程
- L1.4.1 知识大模型
- L1.4.2 生产大模型
- L1.4.3 模型工程方法论
- L1.4.4 模型工程实践 - L1.5 GPT应用案例
阶段2:AI大模型API应用开发工程
- 目标:掌握AI大模型API的使用和开发,以及相关的编程技能。
- 内容:
- L2.1 API接口
- L2.1.1 OpenAI API接口
- L2.1.2 Python接口接入
- L2.1.3 BOT工具类框架
- L2.1.4 代码示例 - L2.2 Prompt框架
- L2.2.1 什么是Prompt
- L2.2.2 Prompt框架应用现状
- L2.2.3 基于GPTAS的Prompt框架
- L2.2.4 Prompt框架与Thought
- L2.2.5 Prompt框架与提示词 - L2.3 流水线工程
- L2.3.1 流水线工程的概念
- L2.3.2 流水线工程的优点
- L2.3.3 流水线工程的应用 - L2.4 总结与展望
- L2.1 API接口
阶段3:AI大模型应用架构实践
- 目标:深入理解AI大模型的应用架构,并能够进行私有化部署。
- 内容:
- L3.1 Agent模型框架
- L3.1.1 Agent模型框架的设计理念
- L3.1.2 Agent模型框架的核心组件
- L3.1.3 Agent模型框架的实现细节 - L3.2 MetaGPT
- L3.2.1 MetaGPT的基本概念
- L3.2.2 MetaGPT的工作原理
- L3.2.3 MetaGPT的应用场景 - L3.3 ChatGLM
- L3.3.1 ChatGLM的特点
- L3.3.2 ChatGLM的开发环境
- L3.3.3 ChatGLM的使用示例 - L3.4 LLAMA
- L3.4.1 LLAMA的特点
- L3.4.2 LLAMA的开发环境
- L3.4.3 LLAMA的使用示例 - L3.5 其他大模型介绍
- L3.1 Agent模型框架
阶段4:AI大模型私有化部署
- 目标:掌握多种AI大模型的私有化部署,包括多模态和特定领域模型。
- 内容:
- L4.1 模型私有化部署概述
- L4.2 模型私有化部署的关键技术
- L4.3 模型私有化部署的实施步骤
- L4.4 模型私有化部署的应用场景
学习计划:
- 阶段1:1-2个月,建立AI大模型的基础知识体系。
- 阶段2:2-3个月,专注于API应用开发能力的提升。
- 阶段3:3-4个月,深入实践AI大模型的应用架构和私有化部署。
- 阶段4:4-5个月,专注于高级模型的应用和部署。
这份完整版的所有 ⚡️ 大模型 LLM 学习资料已经上传CSDN,朋友们如果需要可以微信扫描下方CSDN官方认证二维码免费领取【保证100%免费
】
全套 《LLM大模型入门+进阶学习资源包》↓↓↓ 获取~
更多推荐
所有评论(0)