Dogs vs Cats – Conclusion

As a final post, I’m going to summarize what I’ve done for this Dogs vs Cats project. Let’s do it chronologically.

Blog post 1: First results reaching 87% accuracy

To begin with this project, I started by a relatively small network with an architecture inspired from the very classical AlexNet: a stack of convolutional and 2×2 maxpooling layers, followed by a MLP classifier. The size of  convolutional filters was decreasing as we go deeper into the network: 9×9, 7×7, 5×5, 3×3… ReLUs for each hidden layers and Softmax for the output layer. This network was taking 100×100 grayscale images as inputs. This post also presents quickly my data augmentation pipeline (rotation + cropping + flipping + resizing) which has never changed. This data augmentation process permitted the network to reach an accuracy equal to 87% (without data augmentation, the same network was reaching an accuracy equal to 78%).

Blog post 2:  Adopting the VGG net approach : more layers, smaller filters

The second post starts by a review of the VGGNet paper. The VGGNet approach consists in increasingly significantly the depth, using only 3×3 convolution filters. In my post, I make a comparison between two models with a comparable number of parameters: one network with an architecture inspired by the VGGNet paper, and another one similar to the network from my first post. The VGGNet-like network won the contest easily: 91% vs 86%. In this post, I started using bigger input image : 150×150, and I’ve used this input size until the end.

Note: All the testset scores reported in the first blog posts (from 2 to 7) corresponds to a score obtained on 2500 trainset images (at this time, I was not aware that it is still possible to make a Kaggle submission after the competition’s end).

Blog post 3: Regularization experiments – 7.4% error rate

In this post, I presents some regularization experiments. Dropout on fully connected layers permitted me to improve the accuracy of a network presented in the previous post: 92.6% (+1.1%). I also tried weight decay but it was not helping.

Blog post 4: Adding Batch Normalization – 5.1% error rate

This post starts by a review of the Batch Normalization paper. Batch Normalization is a technique designed to accelerate the training of neural networks by reducing a phenomenon called “Internal Covariate Shift”. It allows the use of higher learning rates, and also acts as a regularizer. After this paper review, I present my experiments when adding Batch Normalization between each layer of my previous network. I verified some claims of the papers: (a) BN accelerates the training: \times4 in the best scenario. (b) BN allows higher learning rates: I manage to train a network with a learning rate equal to 0.2. The same model without BN can’t deal with learning rates higher than 0.01. (c) BN acts as a regularizer and possibly makes dropout useless: for me, it turns out that dropout was useless when using BN. (d) BN permits to reach higher performance: with BN, the network reached an accuracy equal to 94.9% (+1% in comparison with the version without BN).

Blog post 5: What’s going on inside my CNN ?

In this post, I decided to spend some time evaluating my best model at this time. I started by an analysis of misclassified images. Then, I showed the feature maps obtained when applying the model on a single example. Finaly, I spent some time describing what’s going on when adding a Batch Normalization in terms of activity distribution.

Blog post 6: A scale-invariant approach – 2.5% error rate

This post presents the kind of architecures that have permitted me to reach really nice results: >97%. This architecture is inspired from this paper: Is object localization for free? by M. Oquab and based on the use of a Global Maxpooling Layer. In this post, I show that this architecture makes the network more scale-invariant, and that the Global Maxpooling Layer absorbs a kind of detection operation. I also showed that the network gives more importance to the head of the pet. In this post, I also presents my new testing approach, which requires averaging 12 predictions per image (varying the input size, the model and flipping or not the image).

Blog post 7: Kaggle submission – Models fusion attempt – Adversarial examples

I started this post by showing the Kaggle Testset score (97.51%) obtained by my ensemble of 2 networks presented in the previous post. This score corresponds to my first leaderboard entry. Then, I presented a model fusion attempt: the idea was to concatenate the features outputed by the Global Maxpooling Layer of both models, and give it to a single MLP. This idea has not leaded to higher performance. Finally, I played a bit with adversarial examples (see this paper). In the post, I said that I was working on a way to make my models “robust to adversarial examples by continuously generating adversarial dogs/cats during the training”. It turns out that I’ve never reported my results about that. Let’s do it now: well… it has been a failure… During the training, the network was seeing one batch of “normal” examples, and then one batch containing the corresponding “adversarial examples”. But this approach made the training slower, and at the end, the validationset accuracy was significantly lower (around -1%). But I’ve not played a lot with hyper-parameters for this experiment, so there is maybe a way to make it work.

Blog post 8: Parametric ReLU – 2.1% error rate

In my final post, I describe my experiments when replacing ReLUs by Parametric ReLUs. It improved slightly the Kaggle Testset score: 97.9% (+0.4%). This score corresponds to my second leaderboard entry. Then, as I’ve been asked to clarify some point about my testing approach by classmates, I came back on it. Even if computationally costly, this multi-scale testing approach is one of the main reasons to explain why I  got really good results.

Well… Playing with these dogs and cats was fun. If I had more time, I would have enjoyed trying to implement Spatial Transformer Networks, or experimenting Residual Neural Networks. Maybe next time…

kitty.png
The end.

Dogs vs Cats – Parametric ReLU – 2.1% error rate

Parametric ReLUs

During these last days, I’ve been experimenting an activation function called Parametric ReLU. Similarly to Leaky ReLU, this activation function does not saturate when z < 0. See Figure 1.

PReLU
Figure 1: (left) ReLU. (right) Parametric ReLU. Figure taken from [1].

PReLUs have been successfully introduced in [1] achieving superhuman performance on the 2015 ImageNet challenge.  The activation function consists in the following operation:

f(y) = max(0, y) + a \times min(0, y)

The motivation behind PReLUs is to avoid zero gradients. What makes it different from Leaky ReLU is that a is a learnable parameter. When a=0, a PReLU becomes a ReLU. PReLU is therefore a generalization of ReLUs. In terms of complexity, it only adds 1 parameter per channel.

Experiments

I used exactly the same models as the ones presented in this post (“Window46 and Window100” models). As done in [1], each PReLU parameter (noted a above) has been initialized to 0.25. Learning curves for each model are displayed in Figure 2 and 3.

win46_curves.png
Figure 2: Learning curves for the “Window46” model with PReLUs. The final validset score 95.9%.
win100_curves.png
Figure 3: Learning curves for the “Window46” model. The final validset score is 96.1%.

Testset scores are discussed in the final section.

What has been learned for a ?

Let’s take a look at the learned values for a. See Figure 4 and 5.

alpha_values
Figure 4: a distribution for each layer of Window46.
alpha_values_win100
Figure 5: a distribution for each layer of Window100.

Interestingly, for the first layers, the network used various and possibly large a  values (from -2.0 to 5.0 for layer 2 of the Window46 model). Deeper layers have smaller a values in average. Moreover, for both models, the a distribution of the layer just before the Global Maxpooling operation (layer 7 / layer 8) is sharped and centered around 0.25: a values have not been updated a lot during the training for these layers. I think this can be explained by the “winner-take-all” phenomenon introduced by the Global Maxpooling Layer which probably makes gradients with respect to a  small at this stage (much smaller than for other layers).

Testset scores

As said in my previous posts, my best testset scores have been obtained by averaging many predictions, varying the image size, the model and flipping or not the image. This approach is computationally expensive (12 predictions per image, 6 for each model) but it significantly improves the testset accuracy. When testing only on raw images, I got validset accuracies not far from the ones reported on the leaderboard by my class mates (95.9% for Window46 and 96.1% for Window100). Taking into account the flipped version improves the score by 0.5% in average. Averaging the predictions obtained on 3 different image sizes (150×150, 210×210, 270×270 instead of only 150×150) generally improves the score by more than 1%. Finally, averaging the predictions given by both models improves slightly the score (around 0.3% more).

As can be seen on Figure 6, this costly testing approach permits to reach a decent testset accuracy: 97.91%. This score would have been a top10 result 3 years ago. Of course, I’m using techniques that were not introduced in 2013 (PReLU, Batch Normalization), but my model are quite small (0.6 and 1.7 million parameters only) and we are more constrained that competitors were since we are not allowed to use external data. I’m pretty satisfied by this result.

score_ensemble
Figure 6: Testset score obtained by my ensemble of models.

As I’ve been asked by some people this week, let me clarify a point: my models have not been trained in a multi-scale fashion, as Olexa did for instance. During the training, models have always seen 150×150 images. Nevertheless, my data augmentation pipeline involves a random cropping operation which makes the same pet appear at different scale each time it is processed. Moreover, as described in this post, I think that the chosen architecture is somehow scale-invariant and more robust to small pets appearing at random locations. When testing on a bigger image (210×210 or 270×270 for instance), I have to manually adjust the poolsize of the Global Maxpooling layer (see this post for more details about this layer). As said above, averaging predictions for multiple sizes improves significantly the score. For instance, Figure 7 shows the testset score obtained by the Window100 model when averaging 3 predictions per image, corresponding to 3 different image sizes. It already works very well.

score_win100.png
Figure 7: Window100 model – Averaging 3 predictions per example: one for each image size (150×150, 210×210, and 270×270). When testing only on 150×150 images, this score is around 96.1%.

Reference:

[1] Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification – He et al, 2015