【他山之石】Pytorch技巧：DataLoader的collate_fn参数使用详解

人工智能前沿讲习 2022-05-21

收录于合集 #他山之石 234个

“他山之石，可以攻玉”，站在巨人的肩膀才能看得更高，走得更远。在科研的道路上，更需借助东风才能更快前行。为此，我们特别搜集整理了一些实用的代码链接，数据集，软件，编程技巧等，开辟“他山之石”专栏，助你乘风破浪，一路奋勇向前，敬请关注。

作者：知乎—初识CV

地址：https://www.zhihu.com/people/AI_team-WSF

DataLoader完整的参数表如下：

class torch.utils.data.DataLoader( dataset, batch_size=1, shuffle=False, sampler=None, batch_sampler=None, num_workers=0, collate_fn=<function default_collate>, pin_memory=False, drop_last=False, timeout=0, worker_init_fn=None)

DataLoader在数据集上提供单进程或多进程的迭代器，几个关键的参数意思：

shuffle：设置为True的时候，每个世代都会打乱数据集。
collate_fn：如何取样本的，我们可以定义自己的函数来准确地实现想要的功能。
drop_last：告诉如何处理数据集长度除于batch_size余下的数据。True就抛弃，否则保留。

首先我们来看一个例子（不含collate_fn的值）：

import torchimport torch.utils.data as Dataimport numpy as np

test = np.array([0,1,2,3,4,5,6,7,8,9,10,11])

inputing = torch.tensor(np.array([test[i:i + 3] for i in range(10)]))target = torch.tensor(np.array([test[i:i + 1] for i in range(10)]))

torch_dataset = Data.TensorDataset(inputing,target)batch = 3

loader = Data.DataLoader( dataset=torch_dataset, batch_size=batch,

)

for (i, j) in loader:

print(i) print(j)

输出结果：

tensor([[0, 1, 2], [1, 2, 3], [2, 3, 4]], dtype=torch.int32)tensor([[0], [1], [2]], dtype=torch.int32)tensor([[3, 4, 5], [4, 5, 6], [5, 6, 7]], dtype=torch.int32)tensor([[3], [4], [5]], dtype=torch.int32)tensor([[ 6, 7, 8], [ 7, 8, 9], [ 8, 9, 10]], dtype=torch.int32)tensor([[6], [7], [8]], dtype=torch.int32)tensor([[ 9, 10, 11]], dtype=torch.int32)tensor([[9]], dtype=torch.int32)

不含collate_fn的值等价于下面这种形式（两者输出的结果是完全一样的）：

collate_fn=lambda x:( torch.cat( [x[i][j].unsqueeze(0) for i in range(len(x))], 0 ) for j in range(len(x[0])) )

看看collate_fn的值是什么意思。我们把它改为如下：

collate_fn=lambda x:x

输出改为：

for i in loader: print(i)

得到结果：

[(tensor([0, 1, 2], dtype=torch.int32), tensor([0], dtype=torch.int32)), (tensor([1, 2, 3], dtype=torch.int32), tensor([1], dtype=torch.int32)), (tensor([2, 3, 4], dtype=torch.int32), tensor([2], dtype=torch.int32))]

[(tensor([3, 4, 5], dtype=torch.int32), tensor([3], dtype=torch.int32)), (tensor([4, 5, 6], dtype=torch.int32), tensor([4], dtype=torch.int32)), (tensor([5, 6, 7], dtype=torch.int32), tensor([5], dtype=torch.int32))]

[(tensor([6, 7, 8], dtype=torch.int32), tensor([6], dtype=torch.int32)), (tensor([7, 8, 9], dtype=torch.int32), tensor([7], dtype=torch.int32)), (tensor([ 8,  9, 10], dtype=torch.int32), tensor([8], dtype=torch.int32))]

[(tensor([ 9, 10, 11], dtype=torch.int32), tensor([9], dtype=torch.int32))]

每个i都是一个列表，每个列表包含batch_size个元组，每个元组包含TensorDataset的单独数据。所以要将重新组合成每个batch包含3*3的input和3*1的target，就要重新解包并打包。看看我们的collate_fn：

collate_fn=lambda x:( torch.cat( [x[i][j].unsqueeze(0) for i in range(len(x))], 0 ) for j in range(len(x[0])) )

j取的是两个变量：input和target。i取的是batch_size。然后通过unsqueeze(0)方法在前面加一维。torch.cat(,0)将其打包起来。

扩展：

collate_fn的值改成下面的形式：

collate_fn=lambda x:( torch.cat( [x[i][j].unsqueeze(0) for i in range(len(x))], 0 ).unsqueeze(0) for j in range(len(x[0])) )

输出结果是：

tensor([[[0, 1, 2], [1, 2, 3], [2, 3, 4]]], dtype=torch.int32)tensor([[[0], [1], [2]]], dtype=torch.int32)tensor([[[3, 4, 5], [4, 5, 6], [5, 6, 7]]], dtype=torch.int32)tensor([[[3], [4], [5]]], dtype=torch.int32)tensor([[[ 6, 7, 8], [ 7, 8, 9], [ 8, 9, 10]]], dtype=torch.int32)tensor([[[6], [7], [8]]], dtype=torch.int32)tensor([[[ 9, 10, 11]]], dtype=torch.int32)tensor([[[9]]], dtype=torch.int32)

相比于之前的形式使输出结果多一维，每个batch包含1*3*3的input和1*3*1的target。

如果是图像数据的话，可以写成如下形式：

def detection_collate(batch): """Custom collate fn for dealing with batches of images that have a different number of associated object annotations (bounding boxes). Arguments: batch: (tuple) A tuple of tensor images and lists of annotations Return: A tuple containing: 1) (tensor) batch of images stacked on their 0 dim 2) (list of tensors) annotations for a given image are stacked on 0 dim """ targets = [] imgs = [] for sample in batch: imgs.append(sample[0]) targets.append(torch.FloatTensor(sample[1])) return torch.stack(imgs, 0), targets

# 代码只写出了collate_fn部分，其余的省略了。dataloader = torch.utils.data.DataLoader( collate_fn=detection_collate, )

参考

Pytorch技巧:DataLoader的collate_fn参数使用详解

https://www.yht7.com/news/15870

本文目的在于学术交流，并不代表本公众号赞同其观点或对其内容真实性负责，版权归原作者所有，如有侵权请告知删除。

“他山之石”历史文章

请点击文章底部“阅读原文”查看

分享、点赞、在看，给个三连击呗！

震撼！上海开放Google、Meta等国际平台，中国数字化迈出历史性一步！

战争烈度再升级，特朗普加征10%对华关税

特朗普称计划对中国商品征收25%的关税，外交部回应

从地方目标看全国！今年经济增长还是有很高要求

二姐李思林早知道李玟有轻生的念头，居然没一点防备措施！

【他山之石】Pytorch技巧：DataLoader的collate_fn参数使用详解

参考

Pytorch优化器及其内置优化算法原理介绍

神经网络学习 | 鸢尾花分类的实现

Pytorch 基础-tensor 数据结构

Transformer风险评分：实体嵌入+注意力机制

Pytorch：eval()的用法比较

ONNX模型文件->可执行文件 C Runtime通路具体实现方法

Pytorch mixed precision 概述（混合精度）

Weights & Biases （兼容多种深度学习框架的可视化工具WB中文简介）

GCN实现及其中的归一化

Pytorch Lightning 完全攻略

Tensorflow之TFRecord的原理和使用心得

从零开始实现一个卷积神经网络

斯坦福大规模网络数据集

超轻量的YOLO-Nano

MMAction2: 新一代视频理解工具箱

更多他山之石专栏文章，

请点击文章底部“阅读原文”查看

您可能也对以下帖子感兴趣

震撼！上海开放Google、Meta等国际平台，中国数字化迈出历史性一步！

战争烈度再升级，特朗普加征10%对华关税

特朗普称计划对中国商品征收25%的关税，外交部回应

从地方目标看全国！今年经济增长还是有很高要求

二姐李思林早知道李玟有轻生的念头，居然没一点防备措施！

生成图片，分享到微信朋友圈

【他山之石】Pytorch技巧：DataLoader的collate_fn参数使用详解

参考

更多他山之石专栏文章，

请点击文章底部“阅读原文”查看

您可能也对以下帖子感兴趣