第三讲-PPNet的训练算法和实现
#########################################################################################

简介
***************************

这一讲将介绍PPNet的训练算法的python实现。有兴趣的读者可以结合前面第一讲的调试方法，在 `acdevent.com's Jupyterhub server <https://acdevent.com/simulation.html>`__ 上调试并观察相关变量或者Tensor的shape, dtype, value range, device等，加深印象。

注意的是：在该Jupyterhub上运行的PPNet是一个在原完整版本的基础上、面向于AI/ML的初学者的简化版本，所以，该简化版本和原论文的仿真结果的配置有所不同，读者在调试的过程中要注意区别。

PPNet训练算法的python实现和数学原理
***********************************************************************************

在 `acdevent.com's Jupyterhub server <https://acdevent.com/simulation.html>`__ 上登录并打开main.ipynb. PPNet的简化版本的训练算法的python实现在main.ipynb的第10号单元的_train_or_test函数（如图 1）。

.. figure:: ./images/L3/figure01.1.png

.. figure:: ./images/L3/figure01.2.png

   图 1 _train_or_test函数

下面从该单元的第15行开始讲解该算法的实现。

**第15行：** 从dataloader中依次取出每一个batch的多张图片，得到：

* i ：计数器，从0开始

* image是一个Tensor，它的is_cpu属性是true, is_gpu属性是false；它的shape是5*3*56*56。5是batch_size，也就是原论文的n。3是输入的图片的色彩通道数目，也就是R、B、G这三种色彩通道。56*56是输入的图片被规整后的长度和宽度（也就是img_size），原论文的仿真程序的img_size是224*224，本简化版本的img_size是56*56。

image的每个像素点是浮点数，例如：

    | image[0,0,:,:]=tensor([[-1.4843, -1.2788, -1.3130,  ..., -0.8849, -1.5185, -1.4672],
    |   [-1.3644, -1.5014, -1.6555,  ..., -0.4911, -1.5014, -1.5185],
    |   [-1.6384, -1.7240, -1.6727,  ...,  0.1254, -1.3987, -1.5699],
    |   ...,
    |   [-0.5938, -0.6281, -0.8164,  ..., -0.2171, -0.5082, -0.6965],
    |   [-0.7479, -0.8335, -0.8507,  ..., -0.2342, -0.5253, -0.6965],
    |   [-0.8164, -0.8507, -0.7993,  ..., -0.2684, -0.5424, -0.6965]])

batch_size和img_size的值在main.ipynb的第5个单元中配置(如图 2);：

.. figure:: ./images/L3/figure02.png

   图 2 配置batch_size和img_size

* label也是一个Tensor，它的shape就是5(也就是batch_size)，每个元素的值是0或者1，例如：

    |    label=tensor([0, 0, 0, 1, 0])

这是因为：本简化版本只有2种类别（也就是原论文的K=2)；而原论文的第一个实验中有200种的鸟种类(也就是原论文的K=200)。

K的值在main.ipynb的第3个cell的约第三行（如图 3）：

.. figure:: ./images/L3/figure03.png

   图 3 配置K-种类的数目

**第19和20行：** 用于打开context-manager，开始跟踪gradient。

**第21行：** 每一个batch的多张图片输入到PPNet这个model中，得到output和min_distance。

PPNet是一个model，它的构造已经在前面的第二讲中详细解释：它就是原论文的图2的ProtoPNet architeture的convolution layers, prototype layer和fully connected layer的串联网络。

* output是一个Tensor， shape是5*2。5是batch_size, 2是待分类数目。它的元素的值是logits, 例如：

    | output=tensor([[15.2223, 11.4460],
    |         [11.9614, 12.5399],
    |         [12.6952,  9.8602],
    |         [13.5075, 11.0368],
    |         [11.8025, 11.4291]], grad_fn=<MmBackward0>)

* min_distance是原论文图2的架构图中max_pool的输出，也就是用于计算similarity score的输入：

.. math::

   \mathrm{d_j} \doteq \min_{\tilde{z} \in \text{patches}(f(x))} \| \tilde{z} - p_j \|_{2^2}, \quad j = 0, 1, 2, \ldots, m - 1

x是每一个batch中的一张图片；m是总共的prototype vector的数量。在原论文的实验中是2000，在本简化版本中为128。所以，min_distance的shape是5*128 （这里，5是batch size， 128是总共的prototype vector的数量m）。它的元素的值的例子是：

    | min_distances=tensor([[2.1039, 0.9906, 2.6339, 1.9970, 1.3366, 1.7225, 2.0380, 1.0471, 2.0000,
    |          1.4033, 2.7194, 2.0529, 1.7664, 2.9312, 1.9500, 2.7546, 1.0648, 2.5206,
    |          1.4906, 1.4306, 1.3405, 1.4322, 2.5337, 2.1308, 2.2131, 2.5743, 3.2091,
    |          0.9782, 1.8158, 1.8468, 2.7401, 1.7548, 2.4767, 1.0232, 1.0433, 0.5972,
    |          1.9929, 3.9777, 1.8551, 2.9142, 1.4984, 3.0085, 1.9300, 1.9615, 2.2476,
    |          1.2228, 1.5033, 2.1017, 2.4414, 3.0683, 2.6542, 3.2728, 1.1452, 1.2628,
    |          2.2507, 3.3322, 1.6725, 3.0021, 1.7119, 2.4529, 2.7976, 2.5616, 3.0969,
    |          0.9780, 3.4496, 3.1327, 0.7663, 1.6553, 2.4345, 2.3039, 2.5542, 1.9128,
    |          2.9555, 1.2214, 1.6910, 2.1016, 2.4681, 1.4196, 2.6411, 3.3818, 2.3821,
    |          2.6797, 1.7855, 0.8982, 1.8090, 2.1926, 1.9372, 1.5277, 2.3663, 1.3957,
    |          2.1039, 3.0425, 2.3233, 1.7481, 2.8753, 2.4431, 4.2995, 2.1842, 2.8236,
    |          2.4499, 1.3821, 3.7060, 2.9607, 3.3910, 3.1599, 1.7046, 1.5531, 3.8703,
    |          1.5109, 1.8871, 2.3786, 2.0469, 3.0021, 2.9218, 2.7644, 2.4703, 1.4400,
    |          0.4638, 1.6315, 2.4686, 2.2353, 4.0935, 2.9652, 2.7879, 1.6652, 1.5567,
    |          2.0876, 2.5679],
    | .....)

"prototype vector的数量m"的配置在main.ipynb的第3个单元（如图 4）：

.. figure:: ./images/L3/figure04.png

   图 4 prototype vector的数量m的配置

**第23行：** 计算每一个batch的交叉熵(cross-entropy loss)， 也就是原论文第2.2章节Training algorithm中的optimization problem公式中的第一部分(这里记为Ce)：

.. math::

   CE = \frac{1}{n} \sum_{i=1}^{n} \mathrm{CrossEntropy}\!\left(h \circ g_p \circ f(x_i), \, y_i\right)

其中CrsEnt()函数的定义是：

.. math::

   \mathrm{CrsEnt}(\bar{o}_i, y_i) \; \dot{=} \; - \log \left( \frac{\exp(o_{i,y_i})}{\sum_{c=1}^{C} \exp(o_{i,c})} \right)

其中,

.. math::

   \bar{o}_i \;\dot{=}\; h \circ g_p \circ f(x_i)
   
torch.nn.functional.cross_entropy(.....)的完整定义可以参考： `pytorch doc - torch.nn.functional.cross_entry`_

有兴趣的读者可以使用该函数的其他选项，包括使用class-specific weights, 和label smoothing。

**第25到第29行：** 计算原论文中的cluster cost (Clst)

.. math::

   \text{Clst} \doteq \frac{1}{n} \sum_{i=1}^{n} \left\{ \min_{j:p_j \in \mathcal{P}_{y_i}} \left( \min_{\tilde{z} \in \text{patches}(f(x_i))}  \left\| \tilde{z} - p_j \right\|_2^2 \right) \right\}
 
前面已提到：min_distances就已经是：

.. math::

   \min_{\tilde{z} \in patches(f(x))} \| \tilde{z}-p_j \|_2^2,j=0,1,2,...,m-1

所以，这里只需要区分集合 :math:`\left\{ j:p_j \in \mathbf{P_{y_i}} \right\}`

这可以通过第27行完成：

    |    correct_class_indicators = torch.t(ppnet.prototype_class_identity[:, target_class])
    |     (这里，torch.t只是对2D Tensors的倒置操作)

另外：第25到29行的计算使用了min-max transform:

    | 对于任何的集合 :math:`\left\{x_1, x_2, x_3, ..., x_n \right\}`,当 :math:`C` 足够大的时候，总是有：

.. math::

   min(x_1, x_2, ...,x_n)=C-max(C-x_1, c-x_2, ...., c-x_n)
   
**第32到34行：** 计算原论文中的separation cost (Sep)（注意：这里有一个正负号的差异，这个差异后来通过的正负号来补偿）。

.. math::

   Sep \doteq \frac{1}{n} \sum_{i=1}^{n} \min_{j:p_j \notin P_{y_i}} \min_{\tilde{z} \in patches(f(x_i))} \| \tilde{z}-p_j \|_2^2
   
它的计算过程和cluster cost的计算过程类似，差异在区分结合集合：

.. math::

   \left\{ j: p_j\notin P_{y_i} \right\}


**第37和38行：** 这是原论文作者引入的另一种计算separation cost的方法（该方法并没有在原论文中提及）：

.. math::

   \tilde{Sep} \doteq \frac{1}{n}\sum_{i=1}^{n}\underset{p_j\notin P_{y_i}}{mean} \min_{\tilde{z} \in patches (f(x_i))}\|\tilde{z} - p_j \|_2^2

在本实验中，该方式计算出来separation cost只用于log的打印。

**第40-41行：** 这是计算PPNet的fully connected layer的’L1-cost’, 这个cost在原论文的第2.2章节的’Convex optimization of last layer’中描述为：

.. math::

   L1 \doteq \sum_{k=1}^{K}\sum_{p_j\notin P_k}|w_h^{(k,j)}|
   
**第44-46行：** 这是计算PPNet的分类的正确率: 对每张图片，选择具有最大的output logits的类别作为预测结果（i.e., _, predicted = torch.max(output.data, dim=1)）

**第48-52行：** 用于收集训练过程中计算的各项loss/cost的数据。

**第55-59行：** 计算loss以及执行Stochastic gradient descent (SGD)。其中loss的计算就是原论文第2.2章节Training algorithm的公式：

.. math::

   \min_{\mathbf{P}, w_{conv}} Ce+\lambda_1 Clst + \lambda_2 Sep

其中 :math:`\lambda_1, \lambda_2` 的配置在main.ipynb的第12个cell中（如图 5）：

.. figure:: ./images/L3/figure05.png

   图 5 loss计算的权重系数的配置

计算loss后，然后就执行PyTorch的SGD更新（如图 6）：

.. figure:: ./images/L3/figure06.png

   图 6 SGD的更新步骤
	
这个SGD更新仅需要在training阶段执行；它不需要在testing阶段执行。

注意：:math:`\min_{\mathbf{P},w_{conv}} Ce+\lambda_1 Clst + \lambda_2 Sep` 中，待优化的参数是 :math:`\mathbf{P}` 和 :math:`w_{conv}` ，但是没有 :math:`w_h^{(.,.)}` . 。但是main.ipynb的第56行中的loss的计算实际是:

.. math::
   
   Ce+\lambda_1 Clst + \lambda_2 Sep + \lambda_3 L1

那怎么告知优化器仅优化 :math:`\mathbf{P}` 和 :math:`w_{conv}` 而不优化 :math:`w_h^{(.,.)}` ?

这实际通过在训练的warm-only阶段、joint阶段和last_only阶段，调用main.ipynb的第10个cell的下图(如图 7)的函数, 来控制是否计算对应的参数的梯度并更新参数。

.. figure:: ./images/L3/figure07.1.png

.. figure:: ./images/L3/figure07.2.png

.. figure:: ./images/L3/figure07.3.png

   图 7 warm-up, joint和last阶段对requires_grad的配置

下表列出了三个阶段（warm-only、joint, last_only）中的各个layer的参数的requires_grad的配置：

+------------------------------+------------------------------+------------------------------+------------------------------+------------------------------+
+ required_grad                | Convolutional_layer                                         | prototype_layer              | fully connected layer        +
+==============================+==============================+==============================+==============================+==============================+
+                              | VCG-19                       | add_on layers                |                              |                              +
+------------------------------+------------------------------+------------------------------+------------------------------+------------------------------+
+ warm_only                    | False                        | True                         | True                         | True                         +
+------------------------------+------------------------------+------------------------------+------------------------------+------------------------------+
+ joint                        | True                         | True                         | True                         | False                        +
+------------------------------+------------------------------+------------------------------+------------------------------+------------------------------+
+ last_only                    | False                        | False                        | False                        | False                        +
+------------------------------+------------------------------+------------------------------+------------------------------+------------------------------+

所以，当处于joint阶段的时候，full connected layers的参数（也就是 :math:`w_{h}^{(.,.)}` )的梯度不会被计算（也就是 :math:`w_{h}^{(.,.)}` 的梯度保持为0），所以，:math:`w_{h}^{(.,.)}` 的值固定不变。

总结
***********************************************************************************

以上就是PPNet的训练的python实现的讲解，可以看到原论文的作者一些巧妙的编程技巧和配置。下一讲，将讲解原论文另一个算法：Projection of prototypes.

.. _pytorch doc - torch.nn.functional.cross_entry: https://docs.pytorch.org/docs/stable/generated/torch.nn.functional.cross_entropy.html