TensorFlow 2.0 常用模块4：TFRecord

查看原文

其他

TensorFlow 2.0 常用模块4：TFRecord

Original 李锡涵 TensorFlow 2021-08-05

收录于话题

#简单粗暴 TensorFlow2

27个

文 / 李锡涵，Google Developers Expert

本文节选自《简单粗暴 TensorFlow 2.0》

在上一篇文章中，我们介绍了高效的数据流水线模块 tf.data 的流水线并行化加速。本篇文章我们将介绍 TensorFlow 另一个数据处理的利器——TFRecord。

TFRecord ：TensorFlow 数据集存储格式

TFRecord 是 TensorFlow 中的数据集存储格式。当我们将数据集整理成 TFRecord 格式后，TensorFlow 就可以高效地读取和处理这些数据集，从而帮助我们更高效地进行大规模的模型训练。

TFRecord 可以理解为一系列序列化的 tf.train.Example 元素所组成的列表文件，而每一个 tf.train.Example 又由若干个 tf.train.Feature 的字典组成。形式如下：

 1# dataset.tfrecords
 2[
 3    {   # example 1 (tf.train.Example)
 4        'feature_1': tf.train.Feature,
 5        ...
 6        'feature_k': tf.train.Feature
 7    },
 8    ...
 9    {   # example N (tf.train.Example)
10        'feature_1': tf.train.Feature,
11        ...
12        'feature_k': tf.train.Feature
13    }
14]

为了将形式各样的数据集整理为 TFRecord 格式，我们可以对数据集中的每个元素进行以下步骤：

读取该数据元素到内存；
将该元素转换为 tf.train.Example 对象（每一个 tf.train.Example 由若干个 tf.train.Feature 的字典组成，因此需要先建立 Feature 的字典）；
将该 tf.train.Example 对象序列化为字符串，并通过一个预先定义的 tf.io.TFRecordWriter 写入 TFRecord 文件。

而读取 TFRecord 数据则可按照以下步骤：

通过 tf.data.TFRecordDataset 读入原始的 TFRecord 文件（此时文件中的 tf.train.Example 对象尚未被反序列化），获得一个 tf.data.Dataset 数据集对象；
通过 Dataset.map 方法，对该数据集对象中的每一个序列化的 tf.train.Example 字符串执行 tf.io.parse_single_example 函数，从而实现反序列化。

以下我们通过一个实例，展示将上一节中使用的 cats_vs_dogs 二分类数据集的训练集部分转换为 TFRecord 文件，并读取该文件的过程。

将数据集存储为 TFRecord 文件

首先，与上一节类似，我们进行一些准备工作，下载数据集并解压到 data_dir ，初始化数据集的图片文件名列表及标签。

数据集
https://www.floydhub.com/fastai/datasets/cats-vs-dogs

 1import tensorflow as tf
 2import os
 3
 4data_dir = 'C:/datasets/cats_vs_dogs'
 5train_cats_dir = data_dir + '/train/cats/'
 6train_dogs_dir = data_dir + '/train/dogs/'
 7tfrecord_file = data_dir + '/train/train.tfrecords'
 8
 9train_cat_filenames = [train_cats_dir + filename for filename in os.listdir(train_cats_dir)]
10train_dog_filenames = [train_dogs_dir + filename for filename in os.listdir(train_dogs_dir)]
11train_filenames = train_cat_filenames + train_dog_filenames
12train_labels = [0] * len(train_cat_filenames) + [1] * len(train_dog_filenames)  # 将 cat 类的标签设为0，dog 类的标签设为1

然后，通过以下代码，迭代读取每张图片，建立 tf.train.Feature 字典和 tf.train.Example 对象，序列化并写入 TFRecord 文件。

1with tf.io.TFRecordWriter(tfrecord_file) as writer:
2    for filename, label in zip(train_filenames, train_labels):
3        image = open(filename, 'rb').read()     # 读取数据集图片到内存，image 为一个 Byte 类型的字符串
4        feature = {                             # 建立 tf.train.Feature 字典
5            'image': tf.train.Feature(bytes_list=tf.train.BytesList(value=[image])),  # 图片是一个 Bytes 对象
6            'label': tf.train.Feature(int64_list=tf.train.Int64List(value=[label]))   # 标签是一个 Int 对象
7        }
8        example = tf.train.Example(features=tf.train.Features(feature=feature)) # 通过字典建立 Example
9        writer.write(example.SerializeToString())   # 将Example序列化并写入 TFRecord 文件

值得注意的是， tf.train.Feature 支持三种数据格式：

tf.train.BytesList ：字符串或原始 Byte 文件（如图片），通过bytes_list参数传入一个由字符串数组初始化的 tf.train.BytesList 对象；
tf.train.FloatList ：浮点数，通过float_list参数传入一个由浮点数数组初始化的 tf.train.FloatList 对象；
tf.train.Int64List ：整数，通过int64_list参数传入一个由整数数组初始化的 tf.train.Int64List 对象。

如果只希望保存一个元素而非数组，传入一个只有一个元素的数组即可。

运行以上代码，不出片刻，我们即可在 tfrecord_file 所指向的文件地址获得一个 500MB 左右的 train.tfrecords 文件。

读取 TFRecord 文件

我们可以通过以下代码，读取之间建立的 train.tfrecords 文件，并通过 Dataset.map方法，使用 tf.io.parse_single_example 函数对数据集中的每一个序列化的 tf.train.Example 对象解码。

 1raw_dataset = tf.data.TFRecordDataset(tfrecord_file)    # 读取 TFRecord 文件
 2
 3feature_description = { # 定义Feature结构，告诉解码器每个Feature的类型是什么
 4    'image': tf.io.FixedLenFeature([], tf.string),
 5    'label': tf.io.FixedLenFeature([], tf.int64),
 6}
 7
 8def _parse_example(example_string): # 将 TFRecord 文件中的每一个序列化的 tf.train.Example 解码
 9    feature_dict = tf.io.parse_single_example(example_string, feature_description)
10    feature_dict['image'] = tf.io.decode_jpeg(feature_dict['image'])    # 解码JPEG图片
11    return feature_dict['image'], feature_dict['label']
12
13dataset = raw_dataset.map(_parse_example)

这里的 feature_description 字典类似于一个数据集的 “描述文件”，通过一个由键值对组成的字典，告知 tf.io.parse_single_example 函数每个 tf.train.Example 数据项有哪些 Feature，以及这些 Feature 的类型、形状等属性。 tf.io.FixedLenFeature 的三个输入参数 shape 、 dtype 和 default_value （可省略）为每个 Feature 的形状、类型和默认值。这里我们的数据项都是单个的数值或者字符串，所以 shape 为空数组。

运行以上代码后，我们获得一个数据集对象 dataset ，这已经是一个可以用于训练的 tf.data.Dataset 对象了！我们从该数据集中读取元素并输出验证：

1import matplotlib.pyplot as plt 
2
3for image, label in dataset:
4    plt.title('cat' if label == 0 else 'dog')
5    plt.imshow(image.numpy())
6    plt.show()

显示：

可见图片和标签都正确显示，数据集构建成功。

福利 | 问答环节

我们知道在入门一项新的技术时有许多挑战与困难需要克服。如果您有关于 TensorFlow 的相关问题，可在本文后留言，我们的工程师和 GDE 将挑选其中具有代表性的问题在下一期进行回答～

在上一篇文章《tf.data 流水线加速》中，我们对于部分具有代表性的问题回答如下：

Q1. Serving 时怎么 prefetch？

A：TensorFlow Serving 一般用于部署模型，而 tf.data 的 Dataset.prefetch 用于数据集的处理，一般在模型训练阶段使用。因此这两者能相遇的概率很小。

Q2. 我用 1.13 的 tf.keras 保存的 .h5 权重文件，使用 2.0 可以读取吗？

A：可以，使用 tf.keras.models.load_model('path_to_my_model.h5') 即可。具体可参考

https://www.tensorflow.org/guide/keras/save_and_serialize

Q3. AUTOTUNE 还很不完善，多种情况下存在内存泄漏的可能。Github 上提的 bug 也还没解决，不建议使用

A：如发现 AUTOTUNE 特性在自己的实践中存在问题，可手工设置参数。我们相信这一特性在日后会愈发完善。

《简单粗暴 TensorFlow 2.0》目录

公众号回复关键字“手册”获取系列内容合集及 FAQ

《鱿鱼游戏2》今天下午四点开播，网友无心上班了，导演悄悄剧透

刘恺威近况曝光，父亲刘丹证实已分手，目前失业在家，没有资源

紧急通告！三高的“克星”终于被找到了！！不是吃素和控糖,而是多喝它....

话费充值活动来了：95元充值100元电话费！

跟着南通住建局学“朝令夕改”