查看原文
其他

LangChain 读 pdf(上)

renee创业狗 Renee 创业随笔
2024-10-09

今天分享 LangChain 如何与 pdf 交互,做类似于 chatpdf 事情。

需要用到的代码是:https://colab.research.google.com/drive/13FpBqmhYa5Ex4smVhivfEhk2k4S5skwG

还是复制到自己的 Google Drive 中,一步步运行。

通过 LangChain 阅读 Reid Hoffman 今年的新书《Impromptu: Amplifying Our Humanity Through AI》。

把这个 pdf 输入进去之后,就可以开始问各种问题了:

len(raw_text)

这本书多少字:356630

raw_text[:100]

打印前100个字符:

Impromptu

Amplifying Our Humanity

Through AI

By Reid Hoffman

with GPT-4Impromptu: AmplIfyIng our

原理就是下面的图:




问问题之前,先对 pdf 进行处理:

  1. Text Splitter
  2. Embeddings

Text Splitter

当需要处理较长的文本时,将文本分割成若干段是很有必要的。使用 CharacterTextSplitter 把 raw text 进行分割。

# Splitting up the text into smaller chunks for indexing
text_splitter = CharacterTextSplitter(
separator = "\n",
chunk_size = 1000,
chunk_overlap = 200, #striding over the text
length_function = len,
)
texts = text_splitter.split_text(raw_text)

Making the embeddings

OpenAI 的 embeddings 主要用于自然语言处理(NLP)任务。embeddings 是一种将单词或短语转换为固定大小的向量表示的技术。这些向量表示可以捕捉词汇之间的语义和语法关系,有助于改善各种自然语言处理任务的性能,例如文本分类、情感分析、命名实体识别和机器翻译等。

embeddings = OpenAIEmbeddings()
docsearch = FAISS.from_texts(texts, embeddings)
docsearch.embedding_function

这个时候问 how does GPT-4 change educaiton?

query = "how does GPT-4 change educaiton?"
docs = docsearch.similarity_search(query)
docs[0]

就可以把对应的话题的文章内容找了出来。

Document(page_content='told a reporter from Quartz that the precise script “lets you talk \nless and engage students more.”\nDespite these successes, Bridge’s financial and business model \nfailed. Rather than growing from 100,000 students to ten \nmillion, per the plan, the company has exited business lines \nand shifted its model to licensing its ideas to governments.\nYet as we think of tools-plus-teachers helping as many as 600 \nmillion children worldwide who can’t currently expect to get \nany real education, the possibilities of LLMs on top of the New -\nGlobe tablet model become exciting.43Education\nReid: GPT-4, can you write the next 300 words from your \nperspective as a large language model and describe \nhow you could extend the Bridge/NewGlobe tablets into \na tool to help tens of millions of students throughout \nAfrica and Asia?\nGPT-4:  As a large language model, I can help by enhanc -\ning the quality, personalization, and scalability of their \nlearning content. Here is how I would do it:', metadata={})



👆准备工作做好了,就可以开始利用 Langchain 的函数来与 pdf 进行交互了

Plain QA Chain

from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI
chain = load_qa_chain(OpenAI(),
chain_type="stuff") # we are going to stuff all the docs in at once
# check the prompt
chain.llm_chain.prompt.template

可以给予pdf 来问各种问题。

比如问 who is the author of the book?

query = "who is the author of the book?"
docs = docsearch.similarity_search(query)
chain.run(input_documents=docs, question=query)

答案很正确:

The author of the book is Reid Hoffman.

如果问书中没有提到的问题,那答案就是不知道了。

比如问 has it rained this week?

query_02 = "has it rained this week?"
docs_02 = docsearch.similarity_search(query_02)
chain.run(input_documents=docs_02, question=query_02)

回答就是不知道:

No, I don't know if it has rained this week.

今天就先讲完准备工作和问问题,明天分享如何用不同的方法“拷问”这个 pdf。


继续滑动看下一个
Renee 创业随笔
向上滑动看下一个

您可能也对以下帖子感兴趣

文章有问题?点此查看未经处理的缓存