LangChain 读 pdf(上)
今天分享 LangChain 如何与 pdf 交互,做类似于 chatpdf 事情。
需要用到的代码是:https://colab.research.google.com/drive/13FpBqmhYa5Ex4smVhivfEhk2k4S5skwG
还是复制到自己的 Google Drive 中,一步步运行。
通过 LangChain 阅读 Reid Hoffman 今年的新书《Impromptu: Amplifying Our Humanity Through AI》。
把这个 pdf 输入进去之后,就可以开始问各种问题了:
len(raw_text)
这本书多少字:356630
raw_text[:100]
打印前100个字符:
Impromptu
Amplifying Our Humanity
Through AI
By Reid Hoffman
with GPT-4Impromptu: AmplIfyIng our
原理就是下面的图:
问问题之前,先对 pdf 进行处理:
Text Splitter Embeddings
Text Splitter
当需要处理较长的文本时,将文本分割成若干段是很有必要的。使用 CharacterTextSplitter 把 raw text 进行分割。
# Splitting up the text into smaller chunks for indexing
text_splitter = CharacterTextSplitter(
separator = "\n",
chunk_size = 1000,
chunk_overlap = 200, #striding over the text
length_function = len,
)
texts = text_splitter.split_text(raw_text)
Making the embeddings
OpenAI 的 embeddings 主要用于自然语言处理(NLP)任务。embeddings 是一种将单词或短语转换为固定大小的向量表示的技术。这些向量表示可以捕捉词汇之间的语义和语法关系,有助于改善各种自然语言处理任务的性能,例如文本分类、情感分析、命名实体识别和机器翻译等。
embeddings = OpenAIEmbeddings()
docsearch = FAISS.from_texts(texts, embeddings)
docsearch.embedding_function
这个时候问 how does GPT-4 change educaiton?
query = "how does GPT-4 change educaiton?"
docs = docsearch.similarity_search(query)
docs[0]
就可以把对应的话题的文章内容找了出来。
Document(page_content='told a reporter from Quartz that the precise script “lets you talk \nless and engage students more.”\nDespite these successes, Bridge’s financial and business model \nfailed. Rather than growing from 100,000 students to ten \nmillion, per the plan, the company has exited business lines \nand shifted its model to licensing its ideas to governments.\nYet as we think of tools-plus-teachers helping as many as 600 \nmillion children worldwide who can’t currently expect to get \nany real education, the possibilities of LLMs on top of the New -\nGlobe tablet model become exciting.43Education\nReid: GPT-4, can you write the next 300 words from your \nperspective as a large language model and describe \nhow you could extend the Bridge/NewGlobe tablets into \na tool to help tens of millions of students throughout \nAfrica and Asia?\nGPT-4: As a large language model, I can help by enhanc -\ning the quality, personalization, and scalability of their \nlearning content. Here is how I would do it:', metadata={})
👆准备工作做好了,就可以开始利用 Langchain 的函数来与 pdf 进行交互了
Plain QA Chain
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI
chain = load_qa_chain(OpenAI(),
chain_type="stuff") # we are going to stuff all the docs in at once
# check the prompt
chain.llm_chain.prompt.template
可以给予pdf 来问各种问题。
比如问 who is the author of the book?
query = "who is the author of the book?"
docs = docsearch.similarity_search(query)
chain.run(input_documents=docs, question=query)
答案很正确:
The author of the book is Reid Hoffman.
如果问书中没有提到的问题,那答案就是不知道了。
比如问 has it rained this week?
query_02 = "has it rained this week?"
docs_02 = docsearch.similarity_search(query_02)
chain.run(input_documents=docs_02, question=query_02)
回答就是不知道:
No, I don't know if it has rained this week.
今天就先讲完准备工作和问问题,明天分享如何用不同的方法“拷问”这个 pdf。