
LLM实战 | 使用LLM抽取关键词

       下面使用Mistral 7B大模型来抽取关键词,由于transformer库不支持Mistral 7B,因此安装sentence-transformers

pip install --upgrade git+https://github.com/UKPLab/sentence-transformerspip install keybert ctransformers[cuda]pip install --upgrade git+https://github.com/huggingface/transformers



from ctransformers import AutoModelForCausalLM
# Set gpu_layers to the number of layers to offload to GPU. # Set to 0 if no GPU acceleration is available on your system.model = AutoModelForCausalLM.from_pretrained( "TheBloke/Mistral-7B-Instruct-v0.1-GGUF", model_file="mistral-7b-instruct-v0.1.Q4_K_M.gguf", model_type="mistral", gpu_layers=50, hf=True)


from transformers import AutoTokenizer, pipeline
# Tokenizertokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")
# Pipelinegenerator = pipeline( model=model, tokenizer=tokenizer, task='text-generation', max_new_tokens=50, repetition_penalty=1.1)



>>> response = generator("What is 1+1?")>>> print(response[0]["generated_text"])
"""What is 1+1?A: 2"""


prompt = """I have the following document:* The website mentions that it only takes a couple of days to deliver but I still have not received mine
Extract 5 keywords from that document."""response = generator(prompt)print(response[0]["generated_text"])


"""I have the following document:* The website mentions that it only takes a couple of days to deliver but I still have not received mine
Extract 5 keywords from that document.
**Answer:**1. Website2. Mentions3. Deliver4. Couple5. Days"""

       如果我们希望无论输入文本如何,输出的结构都保持一致,我们就必须给LLM举一个例子。这就是更高级的提示工程的用武之地。与大多数大型语言模型一样,Mistral 7B需要特定的提示格式,如下图所示:

       基于上述Mistral 7B Prompt模板,我们构建关键词抽取Prompt,包括Example Prompt和Keyword Prompt,Example Prompt是抽取关键词的一个Prompt样例,Keyword Prompt是让LLM输出关键词的Prompt,下面展示一个例子:

example_prompt = """<s>[INST]I have the following document:- The website mentions that it only takes a couple of days to deliver but I still have not received mine.
Please give me the keywords that are present in this document and separate them with commas.Make sure you to only return the keywords and say nothing else. For example, don't say:"Here are the keywords present in the document"[/INST] meat, beef, eat, eating, emissions, steak, food, health, processed, chicken</s>"""

       Keyword Prompt充分利用了KeyBERT的 [DOCUMENT] 标签表示下面是文档:

keyword_prompt = """[INST]I have the following document:- [DOCUMENT]
Please give me the keywords that are present in this document and separate them with commas.Make sure you to only return the keywords and say nothing else. For example, don't say:"Here are the keywords present in the document"[/INST]"""

      关键词抽取的完整Prompt需要合并Example PromptKeyword Prompt,代码如下:

>>> prompt = example_prompt + keyword_prompt>>> print(prompt)"""<s>[INST]I have the following document:- The website mentions that it only takes a couple of days to deliver but I still have not received mine.
Please give me the keywords that are present in this document and separate them with commas.Make sure you to only return the keywords and say nothing else. For example, don't say: "Here are the keywords present in the document"[/INST] meat, beef, eat, eating, emissions, steak, food, health, processed, chicken</s>[INST]
I have the following document:- [DOCUMENT]
Please give me the keywords that are present in this document and separate them with commas.Make sure you to only return the keywords and say nothing else. For example, don't say: "Here are the keywords present in the document"[/INST]"""


from keybert.llm import TextGenerationfrom keybert import KeyLLM
# Load it in KeyLLMllm = TextGeneration(generator, prompt=prompt)kw_model = KeyLLM(llm)
documents = ["The website mentions that it only takes a couple of days to deliver but I still have not received mine.","I received my package!","Whereas the most powerful LLMs have generally been accessible only through limited APIs (if at all), Meta released LLaMA's model weights to the research community under a noncommercial license."]
keywords = kw_model.extract_keywords(documents)


[['deliver', 'days', 'website', 'mention', 'couple', 'still', 'receive', 'mine'], ['package', 'received'], ['LLM', 'API', 'accessibility', 'release', 'license', 'research', 'community', 'model', 'weights', 'Meta']]





from keybert import KeyLLMfrom sentence_transformers import SentenceTransformer
# Extract embeddingsmodel = SentenceTransformer('BAAI/bge-small-en-v1.5')embeddings = model.encode(documents, convert_to_tensor=True)
# Load it in KeyLLMkw_model = KeyLLM(llm)
# Extract keywordskeywords = kw_model.extract_keywords( documents, embeddings=embeddings, threshold=.5)



>>> keywords[['deliver', 'days', 'website', 'mention', 'couple', 'still', 'receive', 'mine'], ['deliver', 'days', 'website', 'mention', 'couple', 'still', 'receive', 'mine'], ['LLaMA', 'model', 'weights', 'release', 'noncommercial', 'license', 'research', 'community', 'powerful', 'LLMs', 'APIs']]





from keybert import KeyLLM, KeyBERT
# Load it in KeyLLMkw_model = KeyBERT(llm=llm, model='BAAI/bge-small-en-v1.5')
# Extract keywordskeywords = kw_model.extract_keywords(documents, threshold=0.5)


>>> keywords[['deliver', 'days', 'website', 'mention', 'couple', 'still', 'receive', 'mine'], ['deliver', 'days', 'website', 'mention', 'couple', 'still', 'receive', 'mine'], ['LLaMA', 'model', 'weights', 'release', 'license', 'research', 'community', 'powerful', 'LLMs', 'APIs', 'accessibility']]


