Lecture 3: Training algorithm of PPNet and its python implementation
#########################################################################################

Introduction
***************************

This lecture will introduce the Python implementation of PPNet's training algorithm. Readers interested in this topic can use the debugging methods covered in Lecture 1, and then watch the relevant variables or Tensor properties such as shape, data type, value range, and device, all on the `acdevent.com's Jupyterhub server <https://acdevent.com/simulation.html>`__.

Note that: PPNet running on the acdevent.com’s Jupyterhub is a simplified version for AI/ML beginners developped from the original full version, so the configurations and the simulation results compared to the original paper is different. Readers should pay attention to that difference during debugging.


Python implementation and mathematical principles of PPNet training algorithm
***********************************************************************************

Log in and open main.ipynb on the `acdevent.com's Jupyterhub server <https://acdevent.com/simulation.html>`__. The Python implementation of this PPNet's simplified training algorithm is found in the *train_or_test* function of the 10th cell in main.ipynb (as shown in Figure 1).

.. figure:: ./images/L3/figure01.1.png

.. figure:: ./images/L3/figure01.2.png

   Figure 1 *_train_or_test* function

The implementation of the algorithm is explained below starting from the line 15 of this cell.

**Line 15:** Take each batch of multiple images from dataloader one patch by one patch to get:

* i: a counter starting from 0

* image: it is a Tensor with its *is_cpu* attribute set to true and *is_GPU* to false. Its shape is 5 :math:`\times` 356 :math:`\times` 56. Here, 5 represents the batch size (corresponding to the parameter *n* in the original paper); 3 denotes the number of color channels in the input image (i.e., R, B, G); and 56 :math:`\times` 56 indicates the normalized dimensions (img_size) of the input image. While the original paper's implementation used 224 :math:`\times` 224 for img_size, this simplified version employs 56 :math:`\times` 56.

Each pixel in the image is a floating point number, for example:

    | image[0,0,:,:]=tensor([[-1.4843, -1.2788, -1.3130,  ..., -0.8849, -1.5185, -1.4672],
    |   [-1.3644, -1.5014, -1.6555,  ..., -0.4911, -1.5014, -1.5185],
    |   [-1.6384, -1.7240, -1.6727,  ...,  0.1254, -1.3987, -1.5699],
    |   ...,
    |   [-0.5938, -0.6281, -0.8164,  ..., -0.2171, -0.5082, -0.6965],
    |   [-0.7479, -0.8335, -0.8507,  ..., -0.2342, -0.5253, -0.6965],
    |   [-0.8164, -0.8507, -0.7993,  ..., -0.2684, -0.5424, -0.6965]])

The values of batch_size and img_size are configured in the fifth cell of main.ipynb (see Figure 2);

.. figure:: ./images/L3/figure02.png

   Figure 2 Config batch_size and img_size

* label: is also a Tensor, its shape is 5 (that is, batch_size), and each element has a value of 0 or 1, for example:

    |    label=tensor([0, 0, 0, 1, 0])

This is because: this simplified version has only two categories (i.e., *K=2* in the original paper); whereas the first experiment in the original paper had 200 bird species (i.e., K=200 in the original paper).

The value of K is in the third cell of main.ipynb, approximatly at the 3rd row (see Figure 3):

.. figure:: ./images/L3/figure03.png

   Figure 3 Number of K-category configurations

**Lines 19 and 20:** Used to open the context-manager and start tracking the gradient.

**Line 21:** Multiple images of each batch are input into the PPNet model to obtain output and min_distance.

PPNet is a model and its construction has been explained in detail in the previous lecture 2: it is the concatenation of the convolutional layers, the prototype layers and the fully connected layers of the ProtoPNet architecture in Figure 2 of the original paper.

* output is a Tensor with shape of 5 :math:`\times` 2. 5 is the batch size, and 2 is the number of categories to be classified. Its elements are logits, for example:

    | output=tensor([[15.2223, 11.4460],
    |         [11.9614, 12.5399],
    |         [12.6952,  9.8602],
    |         [13.5075, 11.0368],
    |         [11.8025, 11.4291]], grad_fn=<MmBackward0>)


* min_distance is the output of max_pool in the architecture diagram of Figure 2 of the original paper, which is also the input used to calculate similarity score:

.. math::

   \mathrm{d_j} \doteq \min_{\tilde{z} \in \text{patches}(f(x))} \| \tilde{z} - p_j \|_{2^2}, \quad j = 0, 1, 2, \ldots, m - 1

*x* represents an image within each batch, while *m* denotes the total number of prototype vectors. In the original paper's experiments, *m* was 2000, whereas in this simplified version *m* is set to 128. Consequently, the shape of min_distance is 5 :math:`\times` 128 (where 5 corresponds to the batch size and 128 represents the total number of prototype vectors m). An example of its element values would be:

    | min_distances=tensor([[2.1039, 0.9906, 2.6339, 1.9970, 1.3366, 1.7225, 2.0380, 1.0471, 2.0000,
    |          1.4033, 2.7194, 2.0529, 1.7664, 2.9312, 1.9500, 2.7546, 1.0648, 2.5206,
    |          1.4906, 1.4306, 1.3405, 1.4322, 2.5337, 2.1308, 2.2131, 2.5743, 3.2091,
    |          0.9782, 1.8158, 1.8468, 2.7401, 1.7548, 2.4767, 1.0232, 1.0433, 0.5972,
    |          1.9929, 3.9777, 1.8551, 2.9142, 1.4984, 3.0085, 1.9300, 1.9615, 2.2476,
    |          1.2228, 1.5033, 2.1017, 2.4414, 3.0683, 2.6542, 3.2728, 1.1452, 1.2628,
    |          2.2507, 3.3322, 1.6725, 3.0021, 1.7119, 2.4529, 2.7976, 2.5616, 3.0969,
    |          0.9780, 3.4496, 3.1327, 0.7663, 1.6553, 2.4345, 2.3039, 2.5542, 1.9128,
    |          2.9555, 1.2214, 1.6910, 2.1016, 2.4681, 1.4196, 2.6411, 3.3818, 2.3821,
    |          2.6797, 1.7855, 0.8982, 1.8090, 2.1926, 1.9372, 1.5277, 2.3663, 1.3957,
    |          2.1039, 3.0425, 2.3233, 1.7481, 2.8753, 2.4431, 4.2995, 2.1842, 2.8236,
    |          2.4499, 1.3821, 3.7060, 2.9607, 3.3910, 3.1599, 1.7046, 1.5531, 3.8703,
    |          1.5109, 1.8871, 2.3786, 2.0469, 3.0021, 2.9218, 2.7644, 2.4703, 1.4400,
    |          0.4638, 1.6315, 2.4686, 2.2353, 4.0935, 2.9652, 2.7879, 1.6652, 1.5567,
    |          2.0876, 2.5679],
    | .....)

The configuration of "number m of prototype vectors" in the third unit of main.ipynb (Figure 4):

.. figure:: ./images/L3/figure04.png

   Figure 4 Configuration of the number *m* of prototype vectors

**Line 23:** Calculate the cross-entropy loss for each batch, which is the first part of the optimization problem formula in Section 2.2 Training algorithm of the original paper (denoted as Ce here):

.. math::

   CE = \frac{1}{n} \sum_{i=1}^{n} \mathrm{CrossEntropy}\!\left(h \circ g_p \circ f(x_i), \, y_i\right)

The definition of the CrsEnt() function is:

.. math::

   \mathrm{CrsEnt}(\bar{o}_i, y_i) \; \dot{=} \; - \log \left( \frac{\exp(o_{i,y_i})}{\sum_{c=1}^{C} \exp(o_{i,c})} \right)

where,

.. math::

   \bar{o}_i \;\dot{=}\; h \circ g_p \circ f(x_i)
   
The full definition of cross-entropy can be found in `pytorch doc - torch.nn.functional.cross_entry`_

Interested readers can use other options for the function, including the use of class-specific weights and label smoothing.

**Lines 25 to 29:** Calculate the cluster cost (Clst) in the original paper

.. math::

   \text{Clst} \doteq \frac{1}{n} \sum_{i=1}^{n} \left\{ \min_{j:p_j \in \mathcal{P}_{y_i}} \left( \min_{\tilde{z} \in \text{patches}(f(x_i))}  \left\| \tilde{z} - p_j \right\|_2^2 \right) \right\}
 
As mentioned earlier, min_distance already is:

.. math::

   \min_{\tilde{z} \in patches(f(x))} \| \tilde{z}-p_j \|_2^2,j=0,1,2,...,m-1

So, here we just need to distinguish between sets :math:`\left\{ j:p_j \in \mathbf{P_{y_i}} \right\}`

This can be done through line 27:

    |    correct_class_indicators = torch.t(ppnet.prototype_class_identity[:, target_class])
    |     (Here, torch.t is just an inversion of the 2D Tensor)

In addition: The calculations in rows 25 to 29 use the min-max transform:

    | For any set :math:`\left\{x_1, x_2, x_3, ..., x_n \right\}`, when :math:`C` is large enough, there is always,

.. math::

   min(x_1, x_2, ...,x_n)=C-max(C-x_1, c-x_2, ...., c-x_n)
   
**Lines 32 to 34:** Calculate the separation cost (Sep) in the original paper (note that there is a difference in the positive :math:`\lambda_2` and negative sign here, which is later compensated by the positive and negative sign).

.. math::

   Sep \doteq \frac{1}{n} \sum_{i=1}^{n} \min_{j:p_j \notin P_{y_i}} \min_{\tilde{z} \in patches(f(x_i))} \| \tilde{z}-p_j \|_2^2
   
The calculation process is similar to that of cluster cost, except for the distinction between combined sets:

.. math::

   \left\{ j: p_j\notin P_{y_i} \right\}


**Lines 37 and 38:** This is another method of calculating separation cost introduced by the author of the original paper (this method is not mentioned in the original paper):

.. math::

   \tilde{Sep} \doteq \frac{1}{n}\sum_{i=1}^{n}\underset{p_j\notin P_{y_i}}{mean} \min_{\tilde{z} \in patches (f(x_i))}\|\tilde{z} - p_j \|_2^2

In this experiment, the separation cost calculated by this method is merely used for log printing.

**Lines 40-41:** This is the calculation of the 'L1-cost' of the fully connected layer of PPNet. This cost is described in the 'Convex optimization of last layer' in Section 2.2 of the original paper as:

.. math::

   L1 \doteq \sum_{k=1}^{K}\sum_{p_j\notin P_k}|w_h^{(k,j)}|
   
**Lines 44-46:** This is the calculation of the classification accuracy of PPNet: For each image, select the category with the largest output logits as the predicted result i.e., _, predicted = torch.max(output.data, dim=1)

**Lines 48-52:** Used to collect data of various losses/costs calculated during training.

**Lines 55-59:** Calculate the loss and perform Stochastic Gradient Descent (SGD). The loss calculation follows the formula from Section 2.2 of the original paper titled "Training algorithm":

.. math::

   \min_{\mathbf{P}, w_{conv}} Ce+\lambda_1 Clst + \lambda_2 Sep

The confiugration of :math:`\lambda_1, \lambda_2` in the sum is in the 12th cell of main.ipynb.

.. figure:: ./images/L3/figure05.png

   Figure 5 Configuration of weight coefficients for loss calculation

After calculating the loss, then the SGD update steps of PyTorch are executed,

.. figure:: ./images/L3/figure06.png

   Figure 6 SGD update steps
	
This SGD update only needs to be performed during the training phase; it does not need to be performed during the testing phase.

Note: In the optimization process :math:`\min_{\mathbf{P},w_{conv}} Ce+\lambda_1 Clst + \lambda_2 Sep`, the parameters to be optimized are :math:`\mathbf{P}` and :math:`w_{conv}` ,but no  :math:`w_h^{(.,.)}` . However, the calculation of loss in line 56 of main.ipynb is actually:

.. math::
   
   Ce+\lambda_1 Clst + \lambda_2 Sep + \lambda_3 L1

So how does it tell the optimizer to optimize :math:`\mathbf{P}` and :math:`w_{conv}` and not optimize :math:`w_h^{(.,.)}` ?

This is actually controlled by calling the function of the 10th cell of main.ipynb in the warm-only, joint, and last_only stages of training (as shown in Figure 7) to determine whether to compute the gradient of the corresponding parameters and update the parameters.

.. figure:: ./images/L3/figure07.1.png

.. figure:: ./images/L3/figure07.2.png

.. figure:: ./images/L3/figure07.3.png

   Figure 7 configuration of *requires_grad* in warm-up, joint and last stages

The following table lists the *requires_grad* configuration for each layer in the three phases (warm-only, joint, last_only):

+------------------------------+------------------------------+------------------------------+------------------------------+------------------------------+
+ required_grad                | Convolutional_layer                                         | prototype_layer              | fully connected layer        +
+==============================+==============================+==============================+==============================+==============================+
+                              | VCG-19                       | add_on layers                |                              |                              +
+------------------------------+------------------------------+------------------------------+------------------------------+------------------------------+
+ warm_only                    | False                        | True                         | True                         | True                         +
+------------------------------+------------------------------+------------------------------+------------------------------+------------------------------+
+ joint                        | True                         | True                         | True                         | False                        +
+------------------------------+------------------------------+------------------------------+------------------------------+------------------------------+
+ last_only                    | False                        | False                        | False                        | False                        +
+------------------------------+------------------------------+------------------------------+------------------------------+------------------------------+

Therefore, when in the joint phase, the gradient of the parameters of the full connected layers (that is :math:`w_{h}^{(.,.)}`) will not be calculated, and the gradient of :math:`w_{h}^{(.,.)}` will remain zero, the values of :math:`w_{h}^{(.,.)}` is fixed.


Summary
***********************************************************************************

This concludes the Python implementation guide for PPNet training, showcasing the original paper's authors' ingenious programming techniques and configuration strategies. In the next lecture, we'll explore another core algorithm from the original research: Prototype Projection.


.. _pytorch doc - torch.nn.functional.cross_entry: https://docs.pytorch.org/docs/stable/generated/torch.nn.functional.cross_entropy.html