其他
【他山之石】tensorflow2.4性能调优最佳实践
“他山之石,可以攻玉”,站在巨人的肩膀才能看得更高,走得更远。在科研的道路上,更需借助东风才能更快前行。为此,我们特别搜集整理了一些实用的代码链接,数据集,软件,编程技巧等,开辟“他山之石”专栏,助你乘风破浪,一路奋勇向前,敬请关注。
地址:https://www.zhihu.com/people/star-all
01
02
physical_devices = tf.config.list_physical_devices('GPU')
for ind, item in enumerate(physical_devices):
tf.config.experimental.set_memory_growth(item, True)
TRAIN_GPUS = [0,1,2,3]
devices = ["/gpu:{}".format(i) for i in TRAIN_GPUS]
strategy = tf.distribute.MirroredStrategy(devices)
train_dist_dataset = strategy.experimental_distribute_dataset(train_dataset)
test_dist_dataset = strategy.experimental_distribute_dataset(test_daset)
with strategy.scope()
def distributed_train_step(dataset_inputs):
per_replica_losses = strategy.run(train_step, args=(dataset_inputs,))
return strategy.reduce(tf.distribute.ReduceOp.SUM, per_replica_losses, axis=None)
def distributed_test_step(dataset_inputs):
return strategy.run(test_step, args=(dataset_inputs,))
with strategy.scope():
def train_step(inputs):
images, labels = inputs
with tf.GradientTape() as tape:
predictions = model(images, training=True)
loss = compute_loss(labels, predictions)
gradients = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
compute_acc(labels, predictions, train_accuracy)
return loss
def test_step(inputs):
images, labels = inputs
predictions = model(images, training=False)
compute_acc(labels, predictions, test_accuracy)
03
with strategy.scope():
@tf.function
def distributed_train_step(dataset_inputs):
per_replica_losses = strategy.run(train_step, args=(dataset_inputs,))
return strategy.reduce(tf.distribute.ReduceOp.SUM, per_replica_losses, axis=None)
@tf.function
def distributed_test_step(dataset_inputs):
return strategy.run(test_step, args=(dataset_inputs,))
04
读取所有图片的路径和对应的label. 把图片路径给parse成图片.
如何shuffle. 哪里设定epoch 哪里设定batch 哪里设定预取prefetch.
image_roots, labels = generate_fileroots_labels(file_root)
dataset = tf.data.Dataset.from_tensor_slices((image_roots, labels))
dataset = dataset.repeat(100).shuffle(buffer_size=2000)
# dataset = dataset.map(_parse_data, num_parallel_calls=tf.data.experimental.AUTOTUNE)
dataset = dataset.map(_parse_data, num_parallel_calls=16)
dataset = dataset.batch(batch_size)
dataset = dataset.prefetch(tf.data.experimental.AUTOTUNE)
05
for t_step, x in enumerate(train_dist_dataset):
if t_step == 500:
tf.profiler.experimental.start('/tmp/' + datetime.datetime.now().strftime("%Y%m%d-%H%M%S"))
if t_step == 600:
tf.profiler.experimental.stop()
with tf.profiler.experimental.Trace('Train', step_num=t_step, _r=1):
step_loss = distributed_train_step(x)
Kernel Launch Time Host Compute Time Device Compute Time
14.3 % of the total step time sampled is spent on 'Kernel Launch'. It could be due to CPU contention with tf.data. In this case, you may try to set the environment variable TF_GPU_THREAD_MODE=gpu_private.
export TF_GPU_THREAD_MODE=gpu_private
stridedSlice cast
# TRAIN_GPUS = [0,1,2,3]
# devices = ["/gpu:{}".format(i) for i in TRAIN_GPUS]
# strategy = tf.distribute.MirroredStrategy(devices)
tf.config.set_visible_devices(physical_devices[0:8], 'GPU')
strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()
06
import torch as tf
“他山之石”历史文章
PyTorch | DDP系列:入门教程、实现原理与源代码解析、实战与技巧
教你用PyTorch玩转Transformer英译中翻译模型!
深度学习工程技巧之网格调参
PyTorch使用预训练模型进行模型加载
深度学习调参经验总结
PyTorch实现断点继续训练
Pytorch/Tensorflow-gpu训练并行加速trick(含代码)
从NumPy开始实现一个支持Auto-grad的CNN框架
pytorch_lightning 全程笔记
深度学习中的那些Trade-off
PyTorch 手把手搭建神经网络 (MNIST)
autograd源码剖析
怎样才能让你的模型更加高效运行?
更多他山之石专栏文章,
请点击文章底部“阅读原文”查看
分享、点赞、在看,给个三连击呗!