Latest noisy student paper has improved Imagenet SOTA results by almost 2%. On the face of it, it is an easy paper to read and understand. Following are my take aways from it.
They have used all 14M labelled images for supervised learning and additional 300M unlabelled images for generating pseudo labels. Unlabelled images are from similar distribution, though not identical distribution. So they had to prune and balance data for different classes.
Algorithm is iterative. It uses a variation of classic teacher-student architecture. But student is purposefully kept larger (in terms of no. of parameters) than teacher. This is to achieve noise robustness as against knowledge distillation objecive of classic architectures.
The simplified algorithm is:
- Train teacher model using supervised learning on labelled dataset with cross entropy loss
- Use unnoised trained teacher model for soft/hard pseudo labels. Their ablation study shows that soft labels where output is probability distribution rather than one hot encoding is preferred.
- Use equal or larger student model
- learn pseudo labels + labelled data
- augment data to add noise to data
- add noise to model using stochastic depth method (need to see details)
- Iterate. Now student becomes new teacher
By selectively disabling each modifications, authors have found impact of each on final accuracy
- noise is important (data augmentation + stochastic depth in models)
- multiple iterations help improve accuracy
- larger teacher model is better
- large amount of unlabelled data is needed
- soft pseudo labels recommended
I would like to use this method in object detection (specific tasks, not coco) to see performance of self
supervised learning. Since I’ll be using this on 2-3 classes at max, I hope few thousand samples of unlabelled
Another thing to explore is effect of teacher/student model on transfer learning, since most of the application specific models are transfer learnt from Imagenet trained models.