LLMs：Embeddings-based 的搜索进行问答

Original 程序员叶同学 HelloTech技术派 2024-03-16

GPT 擅长回答问题，但仅限于它从训练数据中记住的主题。

如果你想让 GPT 回答关于不熟悉领域的问题，应该怎么做？例如：

2021年9月之后的最新活动
领域私有数据
来自过去对话的信息
......

本文分享一种”搜索-提问“的方法，使 GPT 能够使用参考文本信息回答问题。

搜索：在文本库中搜索相关文本部分
提问：将检索到的文本部分插入到 GPT 的消息中，并向它提问

为什么搜索比微调更好

GPT可以通过两种方式学习知识：

通过模型权重（即在训练集上微调模型）
通过模型输入（即将知识插入到输入消息中）

尽管微调感觉是更自然的选择——毕竟，数据训练是 GPT 学习所有其他知识的方式——但通常不建议将其作为教授模型知识的一种方式。微调更适合教授专业任务或风格，而对于事实回忆则不太可靠。

打个比方，模型权重就像长期记忆。当微调模型时，就像为一周后的考试而学习一样。当考试到来时，模型可能会忘记细节，或者记错了它从未读过的事实。

然而，消息输入就像短期记忆。当将知识插入到消息中时，就像用打开的笔记参加考试一样。有了笔记，模型更有可能得出正确的答案。

与微调相比，文本搜索的一个缺点是，每个模型都受到一次可以读取的最大文本量的限制：

Model 型	Maximum text length 最大文本长度
`gpt-3.5-turbo`	4,096 tokens （~5 页）
`gpt-4`	8,192 tokens （~10 页）
`gpt-4-32k`	32,768 tokens （~40 页）
`gpt-4-1106-preview`	128K tokens

继续这个类比，可以把这个模型想象成一个学生，他一次只能看几页笔记，尽管可能有书架上的教科书可以借鉴。

因此，要构建一个能够利用大量文本来回答问题的系统，我们建议使用“搜索-提问”方法。

搜索

可以通过多种方式搜索文本。例如，

基于词法的搜索（Lexical-based search）
基于图形的搜索（Graph-based search）
基于嵌入的搜索（Embedding-based search）

考虑将仅嵌入搜索作为系统的起点。更好的搜索系统可能会结合多种搜索方法，以及受欢迎程度、新近度、用户历史记录、与先前搜索结果的冗余、点击率数据等功能。问答检索性能也可以通过HyDE等技术得到改善，在HyDE中，问题首先被转化为假设的答案，然后再被嵌入。同样，GPT 还可以通过将问题自动转换为一组关键字或搜索词来潜在地改善搜索结果。

完整代码

准备搜索数据（每个文档一次）

收集（Collect）：我们将下载数百篇关于 2022 年奥运会的维基百科文章
块（Chunk）：文档被拆分为要嵌入的简短部分，大部分是独立的部分
嵌入（Embed）：每个部分都嵌入了 OpenAI API
存储（Store）：保存嵌入（对于大型数据集，请使用矢量数据库）

搜索（每个查询一次）

给定用户问题，从 OpenAI API 生成查询的嵌入
使用嵌入，按与查询的相关性对文本部分进行排名

提问（每个查询一次）

将问题和最相关的部分插入到 GPT 的消息中
返回 GPT 的答案

环境准备

导入必要的库

# imports
import ast  # for converting embeddings saved as strings back to arrays
from openai import OpenAI # for calling the OpenAI API
import pandas as pd  # for storing text and embeddings data
import tiktoken  # for counting tokens
import os # for getting API token from env variable OPENAI_API_KEY
from scipy import spatial  # for calculating vector similarities for search

# models
EMBEDDING_MODEL = "text-embedding-ada-002"
GPT_MODEL = "gpt-3.5-turbo"

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your OpenAI API key if not set as env var>"))

GPT 无法回答有关时事的问题

由于训练数据 gpt-3.5-turbo gpt-4 大多于 2021 年 9 月结束，因此模型无法回答有关最近事件（例如 2022 年冬奥会）的问题。

例如，让我们试着问“哪些运动员在 2022 年获得了冰壶金牌？

# an example question about the 2022 Olympics
query = 'Which athletes won the gold medal in curling at the 2022 Winter Olympics?'

response = client.chat.completions.create(
    messages=[
        {'role': 'system', 'content': 'You answer questions about the 2022 Winter Olympics.'},
        {'role': 'user', 'content': query},
    ],
    model=GPT_MODEL,
    temperature=0,
)

print(response.choices[0].message.content)

模型输出

As an AI language model, I don't have real-time data. However, I can provide you with general information. The gold medalists in curling at the 2022 Winter Olympics will be determined during the event. The winners will be the team that finishes in first place in the respective men's and women's curling competitions. To find out the specific gold medalists, you can check the official Olympic website or reliable news sources for the most up-to-date information.

在这种情况下，模型对 2022 年一无所知，无法回答问题。

将相关主题插入到输入消息中

为了帮助提供 2022 年冬奥会冰壶的模型知识，可以将相关维基百科文章的上半部分复制并粘贴到消息中：

# text copied and pasted from: https://en.wikipedia.org/wiki/Curling_at_the_2022_Winter_Olympics
# I didn't bother to format or clean the text, but GPT will still understand it
# the entire article is too long for gpt-3.5-turbo, so I only included the top few sections

wikipedia_article_on_curling = """Curling at the 2022 Winter Olympics

Article
Talk
Read
Edit
View history
From Wikipedia, the free encyclopedia
Curling
at the XXIV Olympic Winter Games
Curling pictogram.svg
Curling pictogram
Venue	Beijing National Aquatics Centre
Dates	2–20 February 2022
No. of events	3 (1 men, 1 women, 1 mixed)
Competitors	114 from 14 nations
← 20182026 →
Men's curling
at the XXIV Olympic Winter Games
Medalists
1st place, gold medalist(s)		 Sweden
2nd place, silver medalist(s)		 Great Britain
3rd place, bronze medalist(s)		 Canada
Women's curling
at the XXIV Olympic Winter Games
Medalists
1st place, gold medalist(s)		 Great Britain
2nd place, silver medalist(s)		 Japan
3rd place, bronze medalist(s)		 Sweden
Mixed doubles's curling
at the XXIV Olympic Winter Games
Medalists
1st place, gold medalist(s)		 Italy
2nd place, silver medalist(s)		 Norway
3rd place, bronze medalist(s)		 Sweden
Curling at the
2022 Winter Olympics
Curling pictogram.svg
Qualification
Statistics
Tournament
Men
Women
Mixed doubles
vte
The curling competitions of the 2022 Winter Olympics were held at the Beijing National Aquatics Centre, one of the Olympic Green venues. Curling competitions were scheduled for every day of the games, from February 2 to February 20.[1] This was the eighth time that curling was part of the Olympic program.

In each of the men's, women's, and mixed doubles competitions, 10 nations competed. The mixed doubles competition was expanded for its second appearance in the Olympics.[2] A total of 120 quota spots (60 per sex) were distributed to the sport of curling, an increase of four from the 2018 Winter Olympics.[3] A total of 3 events were contested, one for men, one for women, and one mixed.[4]

Qualification
Main article: Curling at the 2022 Winter Olympics – Qualification
Qualification to the Men's and Women's curling tournaments at the Winter Olympics was determined through two methods (in addition to the host nation). Nations qualified teams by placing in the top six at the 2021 World Curling Championships. Teams could also qualify through Olympic qualification events which were held in 2021. Six nations qualified via World Championship qualification placement, while three nations qualified through qualification events. In men's and women's play, a host will be selected for the Olympic Qualification Event (OQE). They would be joined by the teams which competed at the 2021 World Championships but did not qualify for the Olympics, and two qualifiers from the Pre-Olympic Qualification Event (Pre-OQE). The Pre-OQE was open to all member associations.[5]

For the mixed doubles competition in 2022, the tournament field was expanded from eight competitor nations to ten.[2] The top seven ranked teams at the 2021 World Mixed Doubles Curling Championship qualified, along with two teams from the Olympic Qualification Event (OQE) – Mixed Doubles. This OQE was open to a nominated host and the fifteen nations with the highest qualification points not already qualified to the Olympics. As the host nation, China qualified teams automatically, thus making a total of ten teams per event in the curling tournaments.[6]

Summary
Nations	Men	Women	Mixed doubles	Athletes
 Australia			Yes	2
 Canada	Yes	Yes	Yes	12
 China	Yes	Yes	Yes	12
 Czech Republic			Yes	2
 Denmark	Yes	Yes		10
 Great Britain	Yes	Yes	Yes	10
 Italy	Yes		Yes	6
 Japan		Yes		5
 Norway	Yes		Yes	6
 ROC	Yes	Yes		10
 South Korea		Yes		5
 Sweden	Yes	Yes	Yes	11
 Switzerland	Yes	Yes	Yes	12
 United States	Yes	Yes	Yes	11
Total: 14 NOCs	10	10	10	114
Competition schedule

The Beijing National Aquatics Centre served as the venue of the curling competitions.
Curling competitions started two days before the Opening Ceremony and finished on the last day of the games, meaning the sport was the only one to have had a competition every day of the games. The following was the competition schedule for the curling competitions:

RR	Round robin	SF	Semifinals	B	3rd place play-off	F	Final
Date
Event
Wed 2	Thu 3	Fri 4	Sat 5	Sun 6	Mon 7	Tue 8	Wed 9	Thu 10	Fri 11	Sat 12	Sun 13	Mon 14	Tue 15	Wed 16	Thu 17	Fri 18	Sat 19	Sun 20
Men's tournament								RR	RR	RR	RR	RR	RR	RR	RR	RR	SF	B	F	
Women's tournament									RR	RR	RR	RR	RR	RR	RR	RR	SF	B	F
Mixed doubles	RR	RR	RR	RR	RR	RR	SF	B	F												
Medal summary
Medal table
Rank	Nation	Gold	Silver	Bronze	Total
1	 Great Britain	1	1	0	2
2	 Sweden	1	0	2	3
3	 Italy	1	0	0	1
4	 Japan	0	1	0	1
 Norway	0	1	0	1
6	 Canada	0	0	1	1
Totals (6 entries)	3	3	3	9
Medalists
Event	Gold	Silver	Bronze
Men
details	 Sweden
Niklas Edin
Oskar Eriksson
Rasmus Wranå
Christoffer Sundgren
Daniel Magnusson	 Great Britain
Bruce Mouat
Grant Hardie
Bobby Lammie
Hammy McMillan Jr.
Ross Whyte	 Canada
Brad Gushue
Mark Nichols
Brett Gallant
Geoff Walker
Marc Kennedy
Women
details	 Great Britain
Eve Muirhead
Vicky Wright
Jennifer Dodds
Hailey Duff
Mili Smith	 Japan
Satsuki Fujisawa
Chinami Yoshida
Yumi Suzuki
Yurika Yoshida
Kotomi Ishizaki	 Sweden
Anna Hasselborg
Sara McManus
Agnes Knochenhauer
Sofia Mabergs
Johanna Heldin
Mixed doubles
details	 Italy
Stefania Constantini
Amos Mosaner	 Norway
Kristin Skaslien
Magnus Nedregotten	 Sweden
Almida de Val
Oskar Eriksson
Teams
Men
 Canada	 China	 Denmark	 Great Britain	 Italy
Skip: Brad Gushue
Third: Mark Nichols
Second: Brett Gallant
Lead: Geoff Walker
Alternate: Marc Kennedy

Skip: Ma Xiuyue
Third: Zou Qiang
Second: Wang Zhiyu
Lead: Xu Jingtao
Alternate: Jiang Dongxu

Skip: Mikkel Krause
Third: Mads Nørgård
Second: Henrik Holtermann
Lead: Kasper Wiksten
Alternate: Tobias Thune

Skip: Bruce Mouat
Third: Grant Hardie
Second: Bobby Lammie
Lead: Hammy McMillan Jr.
Alternate: Ross Whyte

Skip: Joël Retornaz
Third: Amos Mosaner
Second: Sebastiano Arman
Lead: Simone Gonin
Alternate: Mattia Giovanella

 Norway	 ROC	 Sweden	 Switzerland	 United States
Skip: Steffen Walstad
Third: Torger Nergård
Second: Markus Høiberg
Lead: Magnus Vågberg
Alternate: Magnus Nedregotten

Skip: Sergey Glukhov
Third: Evgeny Klimov
Second: Dmitry Mironov
Lead: Anton Kalalb
Alternate: Daniil Goriachev

Skip: Niklas Edin
Third: Oskar Eriksson
Second: Rasmus Wranå
Lead: Christoffer Sundgren
Alternate: Daniel Magnusson

Fourth: Benoît Schwarz
Third: Sven Michel
Skip: Peter de Cruz
Lead: Valentin Tanner
Alternate: Pablo Lachat

Skip: John Shuster
Third: Chris Plys
Second: Matt Hamilton
Lead: John Landsteiner
Alternate: Colin Hufman

Women
 Canada	 China	 Denmark	 Great Britain	 Japan
Skip: Jennifer Jones
Third: Kaitlyn Lawes
Second: Jocelyn Peterman
Lead: Dawn McEwen
Alternate: Lisa Weagle

Skip: Han Yu
Third: Wang Rui
Second: Dong Ziqi
Lead: Zhang Lijun
Alternate: Jiang Xindi

Skip: Madeleine Dupont
Third: Mathilde Halse
Second: Denise Dupont
Lead: My Larsen
Alternate: Jasmin Lander

Skip: Eve Muirhead
Third: Vicky Wright
Second: Jennifer Dodds
Lead: Hailey Duff
Alternate: Mili Smith

Skip: Satsuki Fujisawa
Third: Chinami Yoshida
Second: Yumi Suzuki
Lead: Yurika Yoshida
Alternate: Kotomi Ishizaki

 ROC	 South Korea	 Sweden	 Switzerland	 United States
Skip: Alina Kovaleva
Third: Yulia Portunova
Second: Galina Arsenkina
Lead: Ekaterina Kuzmina
Alternate: Maria Komarova

Skip: Kim Eun-jung
Third: Kim Kyeong-ae
Second: Kim Cho-hi
Lead: Kim Seon-yeong
Alternate: Kim Yeong-mi

Skip: Anna Hasselborg
Third: Sara McManus
Second: Agnes Knochenhauer
Lead: Sofia Mabergs
Alternate: Johanna Heldin

Fourth: Alina Pätz
Skip: Silvana Tirinzoni
Second: Esther Neuenschwander
Lead: Melanie Barbezat
Alternate: Carole Howald

Skip: Tabitha Peterson
Third: Nina Roth
Second: Becca Hamilton
Lead: Tara Peterson
Alternate: Aileen Geving

Mixed doubles
 Australia	 Canada	 China	 Czech Republic	 Great Britain
Female: Tahli Gill
Male: Dean Hewitt

Female: Rachel Homan
Male: John Morris

Female: Fan Suyuan
Male: Ling Zhi

Female: Zuzana Paulová
Male: Tomáš Paul

Female: Jennifer Dodds
Male: Bruce Mouat

 Italy	 Norway	 Sweden	 Switzerland	 United States
Female: Stefania Constantini
Male: Amos Mosaner

Female: Kristin Skaslien
Male: Magnus Nedregotten

Female: Almida de Val
Male: Oskar Eriksson

Female: Jenny Perret
Male: Martin Rios

Female: Vicky Persinger
Male: Chris Plys
"""

query = f"""Use the below article on the 2022 Winter Olympics to answer the subsequent question. If the answer cannot be found, write "I don't know."

Article:
\"\"\"
{wikipedia_article_on_curling}
\"\"\"

Question: Which athletes won the gold medal in curling at the 2022 Winter Olympics?"""

response = client.chat.completions.create(
    messages=[
        {'role': 'system', 'content': 'You answer questions about the 2022 Winter Olympics.'},
        {'role': 'user', 'content': query},
    ],
    model=GPT_MODEL,
    temperature=0,
)

print(response.choices[0].message.content)

由于输入消息中包含的维基百科文章，GPT 正确回答。

In the men's curling event, the gold medal was won by Sweden. In the women's curling event, the gold medal was won by Great Britain. In the mixed doubles curling event, the gold medal was won by Italy.

如何使用 Embeddings-based 的搜索自动注入知识

准备搜索数据

为了节省时间和费用，准备了一个预先嵌入的数据集，其中包含数百篇关于2022年冬季奥运会的维基百科文章。

要了解我们是如何构建此数据集的，或自行修改它，请参阅嵌入维基百科条目以供搜索。

# download pre-chunked text and pre-computed embeddings
# this file is ~200 MB, so may take a minute depending on your connection speed
embeddings_path = "https://cdn.openai.com/API/examples/data/winter_olympics_2022.csv"

df = pd.read_csv(embeddings_path)

# convert embeddings from CSV str type back to list type
df['embedding'] = df['embedding'].apply(ast.literal_eval)

# the dataframe has two columns: "text" and "embedding"
df

	text	embedding
0	Lviv bid for the 2022 Winter Olympics	-0.005021067801862955, 0.00026050032465718687...
1	Lviv bid for the 2022 Winter Olympics	0.0033927420154213905, -0.007447326090186834,...
2	Lviv bid for the 2022 Winter Olympics	-0.00915789045393467, -0.008366798982024193, ...
3	Lviv bid for the 2022 Winter Olympics	0.0030951891094446182, -0.006064314860850573,...
4	Lviv bid for the 2022 Winter Olympics	-0.002936174161732197, -0.006185177247971296,...
...	...	...

搜索

现在，我们将定义一个搜索函数，该函数：

采用用户查询和包含文本和嵌入列（ dataframe with text & embedding columns）
使用 OpenAI API 嵌入用户查询
使用查询嵌入和文本嵌入之间的距离对文本进行排名
返回两个列表

排名前 N 位的文本，按相关性排名
它们相应的相关性分数

# search function
def strings_ranked_by_relatedness(
    query: str,
    df: pd.DataFrame,
    relatedness_fn=lambda x, y: 1 - spatial.distance.cosine(x, y),
    top_n: int = 100
) -> tuple[list[str], list[float]]:
    """Returns a list of strings and relatednesses, sorted from most related to least."""
    query_embedding_response = client.embeddings.create(
        model=EMBEDDING_MODEL,
        input=query,
    )
    query_embedding = query_embedding_response.data[0].embedding
    strings_and_relatednesses = [
        (row["text"], relatedness_fn(query_embedding, row["embedding"]))
        for i, row in df.iterrows()
    ]
    strings_and_relatednesses.sort(key=lambda x: x[1], reverse=True)
    strings, relatednesses = zip(*strings_and_relatednesses)
    return strings[:top_n], relatednesses[:top_n]

示例

# examples
strings, relatednesses = strings_ranked_by_relatedness("curling gold medal", df, top_n=5)
for string, relatedness in zip(strings, relatednesses):
    print(f"{relatedness=:.3f}")
    display(string)

输出

relatedness=0.879

'Curling at the 2022 Winter Olympics\n\n==Medal summary==\n\n===Medal table===\n\n{{Medals table\n | caption        = \n | host           = \n | flag_template  = flagIOC\n | event          = 2022 Winter\n | team           = \n | gold_CAN = 0 | silver_CAN = 0 | bronze_CAN = 1\n | gold_ITA = 1 | silver_ITA = 0 | bronze_ITA = 0\n | gold_NOR = 0 | silver_NOR = 1 | bronze_NOR = 0\n | gold_SWE = 1 | silver_SWE = 0 | bronze_SWE = 2\n | gold_GBR = 1 | silver_GBR = 1 | bronze_GBR = 0\n | gold_JPN = 0 | silver_JPN = 1 | bronze_JPN - 0\n}}'

relatedness=0.872

"Curling at the 2022 Winter Olympics\n\n==Results summary==\n\n===Women's tournament===\n\n====Playoffs====\n\n=====Gold medal game=====\n\n''Sunday, 20 February, 9:05''\n{{#lst:Curling at the 2022 Winter Olympics – Women's tournament|GM}}\n{{Player percentages\n| team1 = {{flagIOC|JPN|2022 Winter}}\n| [[Yurika Yoshida]] | 97%\n| [[Yumi Suzuki]] | 82%\n| [[Chinami Yoshida]] | 64%\n| [[Satsuki Fujisawa]] | 69%\n| teampct1 = 78%\n| team2 = {{flagIOC|GBR|2022 Winter}}\n| [[Hailey Duff]] | 90%\n| [[Jennifer Dodds]] | 89%\n| [[Vicky Wright]] | 89%\n| [[Eve Muirhead]] | 88%\n| teampct2 = 89%\n}}"

relatedness=0.869

'Curling at the 2022 Winter Olympics\n\n==Results summary==\n\n===Mixed doubles tournament===\n\n====Playoffs====\n\n=====Gold medal game=====\n\n\'\'Tuesday, 8 February, 20:05\'\'\n{{#lst:Curling at the 2022 Winter Olympics – Mixed doubles tournament|GM}}\n{| class="wikitable"\n!colspan=4 width=400|Player percentages\n|-\n!colspan=2 width=200 style="white-space:nowrap;"| {{flagIOC|ITA|2022 Winter}}\n!colspan=2 width=200 style="white-space:nowrap;"| {{flagIOC|NOR|2022 Winter}}\n|-\n| [[Stefania Constantini]] || 83%\n| [[Kristin Skaslien]] || 70%\n|-\n| [[Amos Mosaner]] || 90%\n| [[Magnus Nedregotten]] || 69%\n|-\n| \'\'\'Total\'\'\' || 87%\n| \'\'\'Total\'\'\' || 69%\n|}'

relatedness=0.868

"Curling at the 2022 Winter Olympics\n\n==Medal summary==\n\n===Medalists===\n\n{| {{MedalistTable|type=Event|columns=1}}\n|-\n|Men<br/>{{DetailsLink|Curling at the 2022 Winter Olympics – Men's tournament}}\n|{{flagIOC|SWE|2022 Winter}}<br>[[Niklas Edin]]<br>[[Oskar Eriksson]]<br>[[Rasmus Wranå]]<br>[[Christoffer Sundgren]]<br>[[Daniel Magnusson (curler)|Daniel Magnusson]]\n|{{flagIOC|GBR|2022 Winter}}<br>[[Bruce Mouat]]<br>[[Grant Hardie]]<br>[[Bobby Lammie]]<br>[[Hammy McMillan Jr.]]<br>[[Ross Whyte]]\n|{{flagIOC|CAN|2022 Winter}}<br>[[Brad Gushue]]<br>[[Mark Nichols (curler)|Mark Nichols]]<br>[[Brett Gallant]]<br>[[Geoff Walker (curler)|Geoff Walker]]<br>[[Marc Kennedy]]\n|-\n|Women<br/>{{DetailsLink|Curling at the 2022 Winter Olympics – Women's tournament}}\n|{{flagIOC|GBR|2022 Winter}}<br>[[Eve Muirhead]]<br>[[Vicky Wright]]<br>[[Jennifer Dodds]]<br>[[Hailey Duff]]<br>[[Mili Smith]]\n|{{flagIOC|JPN|2022 Winter}}<br>[[Satsuki Fujisawa]]<br>[[Chinami Yoshida]]<br>[[Yumi Suzuki]]<br>[[Yurika Yoshida]]<br>[[Kotomi Ishizaki]]\n|{{flagIOC|SWE|2022 Winter}}<br>[[Anna Hasselborg]]<br>[[Sara McManus]]<br>[[Agnes Knochenhauer]]<br>[[Sofia Mabergs]]<br>[[Johanna Heldin]]\n|-\n|Mixed doubles<br/>{{DetailsLink|Curling at the 2022 Winter Olympics – Mixed doubles tournament}}\n|{{flagIOC|ITA|2022 Winter}}<br>[[Stefania Constantini]]<br>[[Amos Mosaner]]\n|{{flagIOC|NOR|2022 Winter}}<br>[[Kristin Skaslien]]<br>[[Magnus Nedregotten]]\n|{{flagIOC|SWE|2022 Winter}}<br>[[Almida de Val]]<br>[[Oskar Eriksson]]\n|}"

relatedness=0.867

提问

通过上面的搜索功能，现在可以自动检索相关知识并将其插入到 GPT 的消息中。

义一个函数 ask ：

搜索与查询相关的文本
将文本填充到 GPT 的消息中
将消息发送到 GPT
返回 GPT 的答案

def num_tokens(text: str, model: str = GPT_MODEL) -> int:
    """Return the number of tokens in a string."""
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))


def query_message(
    query: str,
    df: pd.DataFrame,
    model: str,
    token_budget: int
) -> str:
    """Return a message for GPT, with relevant source texts pulled from a dataframe."""
    strings, relatednesses = strings_ranked_by_relatedness(query, df)
    introduction = 'Use the below articles on the 2022 Winter Olympics to answer the subsequent question. If the answer cannot be found in the articles, write "I could not find an answer."'
    question = f"\n\nQuestion: {query}"
    message = introduction
    for string in strings:
        next_article = f'\n\nWikipedia article section:\n"""\n{string}\n"""'
        if (
            num_tokens(message + next_article + question, model=model)
            > token_budget
        ):
            break
        else:
            message += next_article
    return message + question


def ask(
    query: str,
    df: pd.DataFrame = df,
    model: str = GPT_MODEL,
    token_budget: int = 4096 - 500,
    print_message: bool = False,
) -> str:
    """Answers a query using GPT and a dataframe of relevant texts and embeddings."""
    message = query_message(query, df, model=model, token_budget=token_budget)
    if print_message:
        print(message)
    messages = [
        {"role": "system", "content": "You answer questions about the 2022 Winter Olympics."},
        {"role": "user", "content": message},
    ]
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=0
    )
    response_message = response.choices[0].message.content
    return response_message

反向激励，在加速这个社会的黑化

指向自身的觉醒，只会导向新的困境

清华大学的113周年校庆：都这么创新了吗？

刚刚，司法部原副部长，在公安部工作过28年的省公安厅原厅长被查

观察｜实验室被强行关闭，病毒学家在门口过夜？

LLMs：Embeddings-based 的搜索进行问答

为什么搜索比微调更好

搜索

完整代码

环境准备

GPT 无法回答有关时事的问题

将相关主题插入到输入消息中

如何使用 Embeddings-based 的搜索自动注入知识

准备搜索数据

搜索

提问

推荐阅读

您可能也对以下帖子感兴趣

反向激励，在加速这个社会的黑化

指向自身的觉醒，只会导向新的困境

清华大学的113周年校庆：都这么创新了吗？

刚刚，司法部原副部长，在公安部工作过28年的省公安厅原厅长被查

观察｜实验室被强行关闭，病毒学家在门口过夜？

生成图片，分享到微信朋友圈

LLMs：Embeddings-based 的搜索进行问答

为什么搜索比微调更好

搜索

完整代码

环境准备

GPT 无法回答有关时事的问题

将相关主题插入到输入消息中

如何使用 Embeddings-based 的搜索自动注入知识

准备搜索数据

搜索

提问

推荐阅读

您可能也对以下帖子感兴趣