查看原文
其他

【他山之石】适合PyTorch小白的官网教程:Learning PyTorch With Examples

“他山之石,可以攻玉”,站在巨人的肩膀才能看得更高,走得更远。在科研的道路上,更需借助东风才能更快前行。为此,我们特别搜集整理了一些实用的代码链接,数据集,软件,编程技巧等,开辟“他山之石”专栏,助你乘风破浪,一路奋勇向前,敬请关注。

作者:知乎—刘昕宸

地址:https://www.zhihu.com/people/liu-xin-chen-64

本文参考网址(感兴趣的同学可直接去看英文原文):
https://pytorch.org/tutorials/beginner/pytorch_with_examples.html
读完这篇文章你会了解:
  1. 我们为什么需要PyTorch?
  2. PyTorch到底香在哪里?
  3. PyTorch具体是怎么做的?
  4. 如何快速应用PyTorch搭建神经网络?
    1. 不构建计算图、手动实现梯度计算、手动SGD更新参数
    2. 数据张量和参数张量不分离、自动计算梯度、手动SGD更新参数
    3. 数据张量和参数张量不分离、自动计算梯度、手动SGD更新参数
    4. 数据张量和参数张量不分离、自动计算梯度、使用Adam优化算法自动更新参数
    5. 自定义操作(需手动实现前向传播、反向传播)
    6. 自定义Module
    7. control flow + weight sharing
PyTorch的2个主要特征:
  • 处理N维度张量,和numpy类似,但是可以在GPU上运行
  • 支持自动微分来构建和训练大型的神经网络
下面的例子:
全连接ReLU网络,只有一个隐藏层,使用梯度下降,通过最小化网络输出和真实输出之间的欧氏距离,来拟合随机数据。


01

Tensor

with numpy

首先使用numpy来实现网络,numpy提供了大量的N维数组操作函数,是一个通用的科学计算框架。
和pytorch不一样的是,numpy并没有构建计算图、深度学习、梯度。因此,对numpy我们需要手动实现网络的前向传播和反向传播:
# -*- coding: utf-8 -*-import numpy as np
# N is batch size; D_in is input dimension;# H is hidden dimension; D_out is output dimension.N, D_in, H, D_out = 64, 1000, 100, 10
# Create random input and output datax = np.random.randn(N, D_in)y = np.random.randn(N, D_out)
# Randomly initialize weightsw1 = np.random.randn(D_in, H)w2 = np.random.randn(H, D_out)
learning_rate = 1e-6for t in range(500): # Forward pass: compute predicted y h = x.dot(w1) h_relu = np.maximum(h, 0) y_pred = h_relu.dot(w2)
# Compute and print loss loss = np.square(y_pred - y).sum() print(t, loss)
# Backprop to compute gradients of w1 and w2 with respect to loss grad_y_pred = 2.0 * (y_pred - y) grad_w2 = h_relu.T.dot(grad_y_pred) grad_h_relu = grad_y_pred.dot(w2.T) grad_h = grad_h_relu.copy() grad_h[h < 0] = 0 grad_w1 = x.T.dot(grad_h)
# Update weights w1 -= learning_rate * grad_w1 w2 -= learning_rate * grad_w2
从这里似乎根本看不出PyTorch较numpy的优势,连pytorch能在gpu上运行这一特点都没能体现。
那pytorch到底有什么用呢?我们接着往下看。

with PyTorch

numpy是一个非常优秀的框架,但是它不能使用GPU去加速数值计算。对于现在的深度神经网络,GPU常常能加速50倍甚至更多,所以numpy并不能很胜任深度神经网络的构建和训练。
PyTorch最根本的概念:张量(tensor)。pytorch中的张量类似于numpy数组:张量是一种N维数组,并且PyTorch提供了很多操作张量的函数。另外,张量能够构建计算图和梯度(为后面的自动微分做准备),当然也可以作为科学计算的通用工具。
此外,PyTorch能够使用GPU来加速数值计算,这是和numpy很不一样的地方。让PyTorch在GPU运行,只需要非常简单地将张量转化一下类型就可以了。
这里我们用PyTorch实现的一个2层网络去拟合随机数据。和在numpy中一样,我们需要去手动实现网络的前向传播和反向传播:
# -*- coding: utf-8 -*-
import torch

dtype = torch.floatdevice = torch.device("cpu")# device = torch.device("cuda:0") # Uncomment this to run on GPU
# N is batch size; D_in is input dimension;# H is hidden dimension; D_out is output dimension.N, D_in, H, D_out = 64, 1000, 100, 10
# Create random input and output datax = torch.randn(N, D_in, device=device, dtype=dtype)y = torch.randn(N, D_out, device=device, dtype=dtype)
# Randomly initialize weightsw1 = torch.randn(D_in, H, device=device, dtype=dtype)w2 = torch.randn(H, D_out, device=device, dtype=dtype)
learning_rate = 1e-6for t in range(500): # Forward pass: compute predicted y h = x.mm(w1) h_relu = h.clamp(min=0) y_pred = h_relu.mm(w2)
# Compute and print loss loss = (y_pred - y).pow(2).sum().item() if t % 100 == 99: print(t, loss)
# Backprop to compute gradients of w1 and w2 with respect to loss grad_y_pred = 2.0 * (y_pred - y) grad_w2 = h_relu.t().mm(grad_y_pred) grad_h_relu = grad_y_pred.mm(w2.t()) grad_h = grad_h_relu.clone() grad_h[h < 0] = 0 grad_w1 = x.t().mm(grad_h)
# Update weights using gradient descent w1 -= learning_rate * grad_w1 w2 -= learning_rate * grad_w2

02

Autograd

PyTorch: Tensors and autograd

前面我们已经使用PyTorch完全手动地实现了2层网络的前向传播和反向传播,看起来好像很简单,但是如果想要手动实现非常复杂的神经网络就变得异常困难了。
幸运的是,PyTorch提供了自动微分机制,来自动化神经网络反向传播的计算。
当使用autograd,前向传播网络需要定义一个计算图,图中的节点就是张量,边就是输入某一张量、输出另一张量的操作。反向传播可以通过这个计算图非常方便地计算梯度。
计算图听起来很复杂,但在实践中其实很简单。每个张量都代表计算图中的一个节点,如果x是一个张量且x.requires_grad=True,那么x.grad就是另一个张量,用来存储x的梯度信息。
现在我们使用PyTorch张量和自动求导来实现我们的2层网络,现在我们已经不需要自己再手动实现反向传播了:
# -*- coding: utf-8 -*-import torch
dtype = torch.floatdevice = torch.device("cpu")# device = torch.device("cuda:0") # Uncomment this to run on GPU
# N is batch size; D_in is input dimension;# H is hidden dimension; D_out is output dimension.N, D_in, H, D_out = 64, 1000, 100, 10
# Create random Tensors to hold input and outputs.# Setting requires_grad=False indicates that we do not need to compute gradients# with respect to these Tensors during the backward pass.x = torch.randn(N, D_in, device=device, dtype=dtype)y = torch.randn(N, D_out, device=device, dtype=dtype)
# Create random Tensors for weights.# Setting requires_grad=True indicates that we want to compute gradients with# respect to these Tensors during the backward pass.w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)
learning_rate = 1e-6for t in range(500): # Forward pass: compute predicted y using operations on Tensors; these # are exactly the same operations we used to compute the forward pass using # Tensors, but we do not need to keep references to intermediate values since # we are not implementing the backward pass by hand. y_pred = x.mm(w1).clamp(min=0).mm(w2)
# Compute and print loss using operations on Tensors. # Now loss is a Tensor of shape (1,) # loss.item() gets the scalar value held in the loss. loss = (y_pred - y).pow(2).sum() if t % 100 == 99: print(t, loss.item())
# Use autograd to compute the backward pass. This call will compute the # gradient of loss with respect to all Tensors with requires_grad=True. # After this call w1.grad and w2.grad will be Tensors holding the gradient # of the loss with respect to w1 and w2 respectively. loss.backward()
# Manually update weights using gradient descent. Wrap in torch.no_grad() # because weights have requires_grad=True, but we don't need to track this # in autograd. # An alternative way is to operate on weight.data and weight.grad.data. # Recall that tensor.data gives a tensor that shares the storage with # tensor, but doesn't track history. # You can also use torch.optim.SGD to achieve this. with torch.no_grad(): w1 -= learning_rate * w1.grad w2 -= learning_rate * w2.grad
# Manually zero the gradients after updating weights w1.grad.zero_() w2.grad.zero_()
这里补充一下:
loss.backward()实际就是反向传播了,然后PyTorch会帮你自动计算各个张量的梯度,使用w1.grad就可以获得w1的梯度了。
最后我们需要使用梯度,手动更新参数。
torch.no_grad也需要解释一下:
w1和w2作为网络参数,其梯度更新操作是不能记录到计算图的构建中的,因此需要使用torch.no_grad包一下。

PyTorch: Defining new autograd functions

在以上这种情况下,每一个原生的autograd操作实际上都是包括了2个对张量的操作。
  1. 前向传播操作,从input tensors到output tensors
  2. 反向传播操作,从output tensors的梯度到input tensors的梯度
在PyTorch中我们也可以定义自己的autograd操作,继承torch.autograd.Function,并且实现forward和backward即可。
定义完成后,我们就可以使用它构建自己的神经网络。
以下代码自定义了autograd操作ReLU非线性层,并使用它实现我们的2层神经网络:
# -*- coding: utf-8 -*-import torch

class MyReLU(torch.autograd.Function): """ We can implement our own custom autograd Functions by subclassing torch.autograd.Function and implementing the forward and backward passes which operate on Tensors. """
@staticmethod def forward(ctx, input): """ In the forward pass we receive a Tensor containing the input and return a Tensor containing the output. ctx is a context object that can be used to stash information for backward computation. You can cache arbitrary objects for use in the backward pass using the ctx.save_for_backward method. """ ctx.save_for_backward(input) return input.clamp(min=0)
@staticmethod def backward(ctx, grad_output): """ In the backward pass we receive a Tensor containing the gradient of the loss with respect to the output, and we need to compute the gradient of the loss with respect to the input. """ input, = ctx.saved_tensors grad_input = grad_output.clone() grad_input[input < 0] = 0 return grad_input

dtype = torch.floatdevice = torch.device("cpu")# device = torch.device("cuda:0") # Uncomment this to run on GPU
# N is batch size; D_in is input dimension;# H is hidden dimension; D_out is output dimension.N, D_in, H, D_out = 64, 1000, 100, 10
# Create random Tensors to hold input and outputs.x = torch.randn(N, D_in, device=device, dtype=dtype)y = torch.randn(N, D_out, device=device, dtype=dtype)
# Create random Tensors for weights.w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)
learning_rate = 1e-6for t in range(500): # To apply our Function, we use Function.apply method. We alias this as 'relu'. relu = MyReLU.apply
# Forward pass: compute predicted y using operations; we compute # ReLU using our custom autograd operation. y_pred = relu(x.mm(w1)).mm(w2)
# Compute and print loss loss = (y_pred - y).pow(2).sum() if t % 100 == 99: print(t, loss.item())
# Use autograd to compute the backward pass. loss.backward()
# Update weights using gradient descent with torch.no_grad(): w1 -= learning_rate * w1.grad w2 -= learning_rate * w2.grad
# Manually zero the gradients after updating weights w1.grad.zero_() w2.grad.zero_()


03

nn module

PyTorch: nn

虽然上面的原生autograd自动求导已经挺好用了,但是仔细想想还是不够的。
对于大型的神经网络,成千上万的参数tensor,如果我们还需要手动设置张量作为learnable parameters并手动更新它们,那真的得麻烦死!
我们想将那些可学习参数(learnable parameters)直接配置到layers中,将learnable parametets与layers的input tensors/output tensors区分开来。
对标tensotflow,像keras,tensorflow-slim,tflearn这些包实际干的就是这些事情,提供更高层次的操作来构建神经网络。
在PyTorch中,使用nn包来实现这一目标。
nn包定义了一系列Modules,类似于神经网络的各个layers。一个module能够接收input tensors,计算output tensors,也能够存储中间状态(比如learnable parameters)。
nn包还定义了一系列在训练中常使用的loss function
# -*- coding: utf-8 -*-import torch
# N is batch size; D_in is input dimension;# H is hidden dimension; D_out is output dimension.N, D_in, H, D_out = 64, 1000, 100, 10
# Create random Tensors to hold inputs and outputsx = torch.randn(N, D_in)y = torch.randn(N, D_out)
# Use the nn package to define our model as a sequence of layers. nn.Sequential# is a Module which contains other Modules, and applies them in sequence to# produce its output. Each Linear Module computes output from input using a# linear function, and holds internal Tensors for its weight and bias.model = torch.nn.Sequential( torch.nn.Linear(D_in, H), torch.nn.ReLU(), torch.nn.Linear(H, D_out),)
# The nn package also contains definitions of popular loss functions; in this# case we will use Mean Squared Error (MSE) as our loss function.loss_fn = torch.nn.MSELoss(reduction='sum')
learning_rate = 1e-4for t in range(500): # Forward pass: compute predicted y by passing x to the model. Module objects # override the __call__ operator so you can call them like functions. When # doing so you pass a Tensor of input data to the Module and it produces # a Tensor of output data. y_pred = model(x)
# Compute and print loss. We pass Tensors containing the predicted and true # values of y, and the loss function returns a Tensor containing the # loss. loss = loss_fn(y_pred, y) if t % 100 == 99: print(t, loss.item())
# Zero the gradients before running the backward pass. model.zero_grad()
# Backward pass: compute gradient of the loss with respect to all the learnable # parameters of the model. Internally, the parameters of each Module are stored # in Tensors with requires_grad=True, so this call will compute gradients for # all learnable parameters in the model. loss.backward()
# Update the weights using gradient descent. Each parameter is a Tensor, so # we can access its gradients like we did before. with torch.no_grad(): for param in model.parameters(): param -= learning_rate * param.grad
最后的参数更新,每一个param都是一个tensor。和之前不同的是,它们都被放到各个module中了,不需要我们再显式定义了。

PyTorch: optim

以上我们直接使用SGD(随机梯度下降)来更新参数,因此似乎直接简单地利用learning rate和梯度就能更新参数了。但是在实践中我们通常会使用AdaGrad,RMSProp,Adam等优化算法。
关于Momentum,RMSProp,Adam的介绍:
https://zhuanlan.zhihu.com/p/268193140
PyTorch提供了optim包来提供各种优化算法。
以下例子我们使用nn来构建网络,使用torch.optim.Adam(Adam Optimizer)来优化网络。
# -*- coding: utf-8 -*-import torch
# N is batch size; D_in is input dimension;# H is hidden dimension; D_out is output dimension.N, D_in, H, D_out = 64, 1000, 100, 10
# Create random Tensors to hold inputs and outputsx = torch.randn(N, D_in)y = torch.randn(N, D_out)
# Use the nn package to define our model and loss function.model = torch.nn.Sequential( torch.nn.Linear(D_in, H), torch.nn.ReLU(), torch.nn.Linear(H, D_out),)loss_fn = torch.nn.MSELoss(reduction='sum')
# Use the optim package to define an Optimizer that will update the weights of# the model for us. Here we will use Adam; the optim package contains many other# optimization algorithms. The first argument to the Adam constructor tells the# optimizer which Tensors it should update.learning_rate = 1e-4optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)for t in range(500): # Forward pass: compute predicted y by passing x to the model. y_pred = model(x)
# Compute and print loss. loss = loss_fn(y_pred, y) if t % 100 == 99: print(t, loss.item())
# Before the backward pass, use the optimizer object to zero all of the # gradients for the variables it will update (which are the learnable # weights of the model). This is because by default, gradients are # accumulated in buffers( i.e, not overwritten) whenever .backward() # is called. Checkout docs of torch.autograd.backward for more details. optimizer.zero_grad()
# Backward pass: compute gradient of the loss with respect to model # parameters loss.backward()
# Calling the step function on an Optimizer makes an update to its # parameters optimizer.step()
使用现成的optimizer,我们就不再需要手动更新参数了,像之前这样(SGD更新参数方式):
with torch.no_grad(): for param in model.parameters(): param -= learning_rate * param.grad

PyTorch: Custom nn Modules

我们也可以使用继承torch.mm.Module来定义自己的Module,同时实现forward方法就行了。
反向传播直接依赖你在这里构建的计算图,和你使用到的其他module中定义好的反向传播就可以实现了,不需要我们再手动实现了。
下面例子我们将我们的2层网络封装成一个module:
# -*- coding: utf-8 -*-import torch

class TwoLayerNet(torch.nn.Module): def __init__(self, D_in, H, D_out): """ In the constructor we instantiate two nn.Linear modules and assign them as member variables. """ super(TwoLayerNet, self).__init__() self.linear1 = torch.nn.Linear(D_in, H) self.linear2 = torch.nn.Linear(H, D_out)
def forward(self, x): """ In the forward function we accept a Tensor of input data and we must return a Tensor of output data. We can use Modules defined in the constructor as well as arbitrary operators on Tensors. """ h_relu = self.linear1(x).clamp(min=0) y_pred = self.linear2(h_relu) return y_pred

# N is batch size; D_in is input dimension;# H is hidden dimension; D_out is output dimension.N, D_in, H, D_out = 64, 1000, 100, 10
# Create random Tensors to hold inputs and outputsx = torch.randn(N, D_in)y = torch.randn(N, D_out)
# Construct our model by instantiating the class defined abovemodel = TwoLayerNet(D_in, H, D_out)
# Construct our loss function and an Optimizer. The call to model.parameters()# in the SGD constructor will contain the learnable parameters of the two# nn.Linear modules which are members of the model.criterion = torch.nn.MSELoss(reduction='sum')optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)for t in range(500): # Forward pass: Compute predicted y by passing x to the model y_pred = model(x)
# Compute and print loss loss = criterion(y_pred, y) if t % 100 == 99: print(t, loss.item())
# Zero gradients, perform a backward pass, and update the weights. optimizer.zero_grad() loss.backward() optimizer.step()

PyTorch: Control Flow + Weight Sharing

我们现在想实现一个非常奇怪的需求:
一个全连接ReLU网络,每次前向传播都选取一个1-4之间的随机数  ,我们将hidden layers的数量设置为  ,也就是重复调用一个中间层  次,复用它的参数。
直接看代码应该是比较清晰的:
# -*- coding: utf-8 -*-import randomimport torch

class DynamicNet(torch.nn.Module): def __init__(self, D_in, H, D_out): """ In the constructor we construct three nn.Linear instances that we will use in the forward pass. """ super(DynamicNet, self).__init__() self.input+_linear = torch.nn.Linear(D_in, H) self.middle_linear = torch.nn.Linear(H, H) self.output_linear = torch.nn.Linear(H, D_out)
def forward(self, x): """ For the forward pass of the model, we randomly choose either 0, 1, 2, or 3 and reuse the middle_linear Module that many times to compute hidden layer representations. Since each forward pass builds a dynamic computation graph, we can use normal Python control-flow operators like loops or conditional statements when defining the forward pass of the model. Here we also see that it is perfectly safe to reuse the same Module many times when defining a computational graph. This is a big improvement from Lua Torch, where each Module could be used only once. """ h_relu = self.input_linear(x).clamp(min=0) for _ in range(random.randint(0, 3)): h_relu = self.middle_linear(h_relu).clamp(min=0) y_pred = self.output_linear(h_relu) return y_pred

# N is batch size; D_in is input dimension;# H is hidden dimension; D_out is output dimension.N, D_in, H, D_out = 64, 1000, 100, 10
# Create random Tensors to hold inputs and outputsx = torch.randn(N, D_in)y = torch.randn(N, D_out)
# Construct our model by instantiating the class defined abovemodel = DynamicNet(D_in, H, D_out)
# Construct our loss function and an Optimizer. Training this strange model with# vanilla stochastic gradient descent is tough, so we use momentumcriterion = torch.nn.MSELoss(reduction='sum')optimizer = torch.optim.SGD(model.parameters(), lr=1e-4, momentum=0.9)for t in range(500): # Forward pass: Compute predicted y by passing x to the model y_pred = model(x)
# Compute and print loss loss = criterion(y_pred, y) if t % 100 == 99: print(t, loss.item())
# Zero gradients, perform a backward pass, and update the weights. optimizer.zero_grad() loss.backward() optimizer.step()


本文目的在于学术交流,并不代表本公众号赞同其观点或对其内容真实性负责,版权归原作者所有,如有侵权请告知删除。


直播预告



“他山之石”历史文章


更多他山之石专栏文章,

请点击文章底部“阅读原文”查看



分享、点赞、在看,给个三连击呗!

您可能也对以下帖子感兴趣

文章有问题?点此查看未经处理的缓存