其他
GLM国产大模型训练加速:性能最高提升3倍,显存节省1/3,低成本上手
One-GLM:https://github.com/Oneflow-Inc/one-glm OneFlow:https://github.com/Oneflow-Inc/oneflow
1
GLM-large 训练性能和显存的表现
首先先展示一下分别使用官方的 GLM 仓库以及 One-GLM 仓库训练 GLM-large 网络的性能和显存表现(数据并行技术),硬件环境为 A100 PCIE 40G,BatchSize 设置为 8。
可以看到,在 GLM-large 的训练任务中,相比原始的基于 PyTorch、DeepSpeed、Apex 的 GLM 实现,OneFlow的性能有 120% - 276% 的加速,并且显存占用降低了10% -30%(测试结果均可用 oneflow >=0.9.0 复现)。
2
GLM 迁移,只需修改几行代码
将 import torch 替换为 import oneflow as torch
将 import torch.xx 替换为 import oneflow.xx
将 from apex.optimizers import FusedAdam as Adam 替换为 from oneflow.optim import Adam
将 from apex.normalization.fused_layer_norm import FusedLayerNorm as LayerNorm 替换为 from oneflow.nn import LayerNorm
注释掉 torch.distributed.ReduceOp,torch.distributed.new_group,,torch.distributed.TCPStore,torch.distributed.all_reduce 这些API,它们是 PyTorch DDP 所需要的,但 OneFlow 的数据并行是由内部的 SBP 和 Global Tensor 机制实现,并不需要这些 API。
其它许多模型的迁移更简单,比如在和 torchvision 对标的 flowvision 中,许多模型只需通过在 torchvision 模型文件中加入 import oneflow as torch 即可得到,让用户几乎没有额外成本。
3
两大调优手段
loss 计算部分的优化
CUDA Kernel Fuse
fused_bias_add_gelu: 将 bias_add 和 gelu 算子融合在一起。 fused_bias_add_dropout:将 bias_add 和 dropout 算子融合在一起。
4
LiBai 也能轻松搞定 GLM 推理
LiBai :https://github.com/Oneflow-Inc/libai LiBai 相关介绍:大模型训练之难,难于上青天?预训练易用、效率超群的「李白」模型库来了! GLM:https://github.com/Oneflow-Inc/libai/tree/glm_project/projects/GLM
用 LiBai 搭建的 GLM 可以便捷地实现model parallel + pipeline parallel推理, 很好地解决单卡放不下大规模模型的问题。
那么,用户如何利用大规模模型训练与推理仓库 LiBai 来构建 GLM 的分布式推理部分?下面用一个小例子解释一下。
分布式推理具有天然优势
要知道,模型的参数其实就是许多 tensor,也就是以矩阵的形式出现,大模型的参数也就是大矩阵,并行策略就是把大矩阵分为多个小矩阵,并分配到不同的显卡或不同的设备上,基础的 LinearLayer 在LiBai中的实现代码如下:
class Linear1D(nn.Module):
def __init__(self, in_features, out_features, parallel="data", layer_idx=0, ...):
super().__init__()
if parallel == "col":
weight_sbp = dist.get_nd_sbp([flow.sbp.broadcast, flow.sbp.split(0)])
elif parallel == "row":
weight_sbp = dist.get_nd_sbp([flow.sbp.broadcast, flow.sbp.split(1)])
elif parallel == "data":
weight_sbp = dist.get_nd_sbp([flow.sbp.broadcast, flow.sbp.broadcast])
else:
raise KeyError(f"{parallel} is not supported! Only support ('data', 'row' and 'col')")
self.weight = flow.nn.Parameter(
flow.empty(
(out_features, in_features),
dtype=flow.float32,
placement=dist.get_layer_placement(layer_idx), # for pipeline parallelism placement
sbp=weight_sbp,
)
)
init_method(self.weight)
...
def forward(self, x):
...
GLM 推理的 Demo 演示
单卡 generate 任务,我们选择 glm-10b 模型:
python demo.py
# demo.py
import oneflow as flow
from projects.GLM.tokenizer.glm_tokenizer import GLMGPT2Tokenizer
from libai.utils import distributed as dist
from projects.GLM.configs.glm_inference import cfg
from projects.GLM.modeling_glm import GLMForConditionalGeneration
from projects.GLM.utils.glm_loader import GLMLoaderHuggerFace
from omegaconf import DictConfig
tokenizer = GLMGPT2Tokenizer.from_pretrained("/data/home/glm-10b")
input_ids = tokenizer.encode(
[
"Ng is an adjunct professor at [MASK] (formerly associate professor and Director of its Stanford AI Lab or SAIL ). Also a pioneer in online education, Ng co-founded Coursera and deeplearning.ai."
],
return_tensors="of",
)
inputs = {"input_ids": input_ids, "attention_mask": flow.ones(input_ids.size())}
inputs = tokenizer.build_inputs_for_generation(inputs, max_gen_length=512)
sbp = dist.get_nd_sbp([flow.sbp.broadcast, flow.sbp.broadcast])
placement = dist.get_layer_placement(0)
dist.set_device_type("cpu")
loader = GLMLoaderHuggerFace(GLMForConditionalGeneration, cfg, "/path/to/glm-10b")
model = loader.load()
model = model.half().cuda()
dist.set_device_type("cuda")
outputs = model.generate(
inputs=inputs['input_ids'].to_global(sbp=sbp, placement=placement),
position_ids=inputs['position_ids'].to_global(sbp=sbp, placement=placement),
generation_attention_mask=inputs['generation_attention_mask'].to_global(sbp=sbp, placement=placement).half(),
max_length=512
)
res = tokenizer.decode(outputs[0])
print(res)
>>> [CLS] Ng is an adjunct professor at [MASK] (formerly associate professor and Director of its Stanford AI Lab or SAIL ). Also a pioneer in online education, Ng co-founded Coursera and deeplearning.ai.<|endoftext|> <|startofpiece|> Stanford University and a co-founder of <|endofpiece|>
4卡 model parallel+pipeline parallel generate 任务,选择 glm-10b 模型:
python3 -m oneflow.distributed.launch --nproc_per_node 4 demo.py
# demo.py
import oneflow as flow
from projects.GLM.tokenizer.glm_tokenizer import GLMGPT2Tokenizer
from libai.utils import distributed as dist
from projects.GLM.configs.glm_inference import cfg
from projects.GLM.modeling_glm import GLMForConditionalGeneration
from projects.GLM.utils.glm_loader import GLMLoaderHuggerFace
from omegaconf import DictConfig
# 只需简单配置并行方案
parallel_config = DictConfig(
dict(
data_parallel_size=1,
tensor_parallel_size=2,
pipeline_parallel_size=2,
pipeline_num_layers=2 * 24
)
)
dist.setup_dist_util(parallel_config)
tokenizer = GLMGPT2Tokenizer.from_pretrained("/data/home/glm-10b")
input_ids = tokenizer.encode(
[
"Ng is an adjunct professor at [MASK] (formerly associate professor and Director of its Stanford AI Lab or SAIL ). Also a pioneer in online education, Ng co-founded Coursera and deeplearning.ai."
],
return_tensors="of",
)
inputs = {"input_ids": input_ids, "attention_mask": flow.ones(input_ids.size())}
inputs = tokenizer.build_inputs_for_generation(inputs, max_gen_length=512)
sbp = dist.get_nd_sbp([flow.sbp.broadcast, flow.sbp.broadcast])
placement = dist.get_layer_placement(0)
loader = GLMLoaderHuggerFace(GLMForConditionalGeneration, cfg, "/path/to/glm-10b")
model = loader.load()
outputs = model.generate(
inputs=inputs['input_ids'].to_global(sbp=sbp, placement=placement),
position_ids=inputs['position_ids'].to_global(sbp=sbp, placement=placement),
generation_attention_mask=inputs['generation_attention_mask'].to_global(sbp=sbp, placement=placement),
max_length=512
)
res = tokenizer.decode(outputs[0])
if dist.is_main_process():
print(res)
>>> [CLS] Ng is an adjunct professor at [MASK] (formerly associate professor and Director of its Stanford AI Lab or SAIL ). Also a pioneer in online education, Ng co-founded Coursera and deeplearning.ai.<|endoftext|> <|startofpiece|> Stanford University and a co-founder of <|endofpiece|>
使用 One- GLM 训练的模型进行推理
LiBai对于OneFlow的模型加载同样方便,如果你希望使用one-glm训练后的模型进行推理,只需简单的将上述demo中的 GLMLoaderHuggerFace 替换为 GLMLoaderLiBai。
5
结语
欢迎Star、试用One-GLM:
One-GLM:https://github.com/Oneflow-Inc/one-glm OneFlow:https://github.com/Oneflow-Inc/oneflow
35张图,直观理解Stable Diffusion 2023年AI十大展望:GPT-4领衔大模型变革 李白:你的模型权重很不错,可惜被我没收了 OpenAI掌门Sam Altman:AI下一个发展阶段 比快更快,开源Stable Diffusion刷新作图速度 OneEmbedding:单卡训练TB级推荐模型不是梦 “零”代码改动,静态编译让太乙Stable Diffusion推理速度翻倍