新增5种编程语言代码测试！大模型评测平台OpenCompass上新啦

Original OpenCompass OpenMMLab 2024-04-23

大语言模型（LLM）在理解和生成自然语言文本方面已经取得了显著的进步。随着应用场景的逐渐多样化，利用模型快速写出高质量代码，修复代码 Bug，提升开发效率等需求对大语言模型编程代码的能力提出了新的挑战。

学术社区在代码大模型上发展迅速，如 Code LLaMa，WizardCoder 等在社区获得了广泛关注。那我们该如何进行代码大模型的选型？相信通过全面透明的代码能力评测，你一定可以找到最适合自己需求的代码大模型方案。

OpenCompass 代码能力评测

主要评测集

HumanEval

HumanEval 是一个由 OpenAI 提供用于评估 AI 对编程问题的解决能力的数据集。这个数据集包含了大量通过人类专家编写的、各种不同难度和类型的编程问题。

每个问题都附带了一份详尽的描述文档，这份文档清晰地说明了问题的需求和限制，并指定了输入和输出的格式。同时，每个问题还附带了一个或多个测试样例，这些样例提供了对应的输入和期望的输出，可以用来验证程序的正确性。

HumanEval 问题举例：

def has_close_elements(numbers: List[float], threshold: float) -> bool: """ Check if in given list of numbers, are any two numbers closer to each other than given threshold. >>> has_close_elements([1.0, 2.0, 3.0], 0.5) False >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3) True """

测试函数举例，输出的结果需要通过 check() 函数：

def check(has_close_elements): assert has_close_elements([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.3) == True assert has_close_elements([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.05) == False assert has_close_elements([1.0, 2.0, 5.9, 4.0, 5.0], 0.95) == True assert has_close_elements([1.0, 2.0, 5.9, 4.0, 5.0], 0.8) == False assert has_close_elements([1.0, 2.0, 3.0, 4.0, 5.0, 2.0], 0.1) == True assert has_close_elements([1.1, 2.2, 3.1, 4.1, 5.1], 1.0) == True assert has_close_elements([1.1, 2.2, 3.1, 4.1, 5.1], 0.5) == False

check(has_close_elements)

这个数据集的目标是评估模型能否理解并解决实际的编程问题，它要求模型生成的代码不仅需要在语法上正确，还需要在功能上满足描述文档中的需求，并能通过所有的测试样例。

MBPP

这个基准测试包括 500 道由众包解决的 Python 编程问题，旨在能够被初级程序员解决，涵盖了编程基础、标准库功能等方面。每一个问题都包含任务描述、代码解决方案和三个自动化的测试用例。

MBPP 问题以及测试用例举例：

"text": "Write a python function to remove first and last occurrence of a given character from the string.","test_list": [ "assert remove_Occ(\"hello\",\"l\") == \"heo\"", "assert remove_Occ(\"abcda\",\"a\") == \"bcd\"", "assert remove_Occ(\"PHP\",\"P\") == \"H\""],

HumanEvalX

除了 Python 代码能力外，为了更好地评测代码生成模型的多语言生成能力， HumanEval-X 由 THUDM 构建，并用于衡量生成代码的功能正确性。HumanEval-X 包含 820 个高质量手写样本，覆盖 Python、C++、Java、JavaScript、Go。可以用于代码生成以及代码翻译等多种任务。

代码生成任务与代码翻译任务示意图

评测指标

对于代码生成任务，我们通常采用 OpenAI 提出的 pass@k 作为评价指标来计算，即请求 LLM 对每个任务进行 k 次生成，并测量它至少通过一次的概率，使用不同的 k 值进行评估，现在可以衡量模型输出中存在多少随机性。

大模型代码能力大 PK

Python 能力排行榜

在 Python 能力榜单中，我们选取 Humaneval 和 MBPP 两个数据集的平均值作为参考，可以看到 WizardCoder-Python 已经超过 ChatGPT 在这两个数据集上的性能，证明其在 Python 能力提升上的成功。另外可以看到经过 Python 数据集专门微调的模型能力明显强于同等大小的模型，例如 CodeLlama-34b-Python 在 Python 代码能力上强于同系列 CodeLlama-34b-Instruct，CodeLlama-34b。

多编程语言能力榜单

除了大家最关注的 Python 能力外，我们还基于 HumanevalX 提供了多编程语言能力的评测方法，包括（CPP，Java，Go，JavaScript）。代码能力的评测往往容易受到测试环境的软件版本的影响，OpenCompass 团队开发了 Code-Evalutor 辅助工具，提供了多种语言能力评测的服务搭建方式，用户可以使用我们提供的环境镜像轻松开展多语言能力的公平评测。

Code-Evaluator:

https://github.com/open-compass/code-evaluator

借助 OpenCompass 和 Code-Evaluator，我们发现在多语言能力上，已有的模型在部分语言上和 ChatGPT 相近。同时在 Python 数据集专门微调的模型在一些多语言能力上存在短板。GPT4 还是保持着较大的领先优势，这一部分能力是后续模型需要关注的方向之一。

多语言代码能力 Trade Off

整体来看，随着模型 Python 代码能力的提升，整体其他多语言的代码能力有所下降。代码大模型在关注单语言代码能力的同时，也应兼顾整体多语言能力的性能。

代码评测踩坑之后处理

目前代码生成的结果通常可以归类为两种范式：

一种是对 prompt 的续写，即直接追加到 prompt 之后做评测，针对这种方式，像 Python 这类对缩进敏感的语言，在与 prompt 拼接的时候就需要注意处理缩进，避免生成的 code 与 prompt 在缩进上不一致；

Prompt 示例：

from typing import List

输出示例：

numbers.sort() for i in range(len(numbers) - 1): if numbers[i + 1] - numbers[i] < threshold: return True return False

另一类是在不同 prompt 下直接生成整段代码，我们需要从输出中找到对应的代码块，再进行评测。

Prompt 示例：

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:Create a python script for this problem:from typing import List

### Response:

输出示例：

Here's the solution to the problem:

```pythonfrom typing import List

def has_close_elements(numbers: List[float], threshold: float) -> bool: """ Check if in given list of numbers, are any two numbers closer to each other than given threshold. """ for i in range(len(numbers)): for j in range(i+1, len(numbers)): if abs(numbers[i] - numbers[j]) < threshold: return True return False```

Explanation:

- We use two nested loops to compare each pair of numbers in the list.- The outer loop iterates over the indices of the list, and the inner loop iterates over the remaining indices.- We use the `abs()` function to calculate the absolute difference between the two numbers.

- If the absolute difference is less than the threshold, we return `True` to indicate that there are two numbers that are closer than the threshold.

- If we reach the end of the loops without finding any such pair, we return `False`.

总结

如何使用 OpenCompass 来完成代码能力评测可以参考代码库中的文档：

https://opencompass.readthedocs.io/zh_CN/latest/advanced_guides/code_eval_service.html

OpenCompass 项目地址：

https://github.com/open-compass/opencompass

Code-Evaluator:

https://github.com/open-compass/code-evaluator

大模型评测榜单:

https://opencompass.org.cn/leaderboard-llm

欢迎大家在 OpenCompass 提交评测申请。

点击下方“阅读原文”直达 OpenCompass

继续滑动看下一个

OpenMMLab

向上滑动看下一个

震撼！上海开放Google、Meta等国际平台，中国数字化迈出历史性一步！

战争烈度再升级，特朗普加征10%对华关税

特朗普称计划对中国商品征收25%的关税，外交部回应

从地方目标看全国！今年经济增长还是有很高要求

二姐李思林早知道李玟有轻生的念头，居然没一点防备措施！

新增5种编程语言代码测试！大模型评测平台OpenCompass上新啦

您可能也对以下帖子感兴趣

震撼！上海开放Google、Meta等国际平台，中国数字化迈出历史性一步！

战争烈度再升级，特朗普加征10%对华关税

特朗普称计划对中国商品征收25%的关税，外交部回应

从地方目标看全国！今年经济增长还是有很高要求

二姐李思林早知道李玟有轻生的念头，居然没一点防备措施！

生成图片，分享到微信朋友圈

新增5种编程语言代码测试！大模型评测平台OpenCompass上新啦

您可能也对以下帖子感兴趣