A reading note on DeepEMD: Few-Shot Image Classification with Differentiable Earth Mover’s Distance and Structured Classifiers.
Problems
Deep Neural Networks achieve high performance under the large labelled datasets. For some circumstances, no enough labelled images are provided. One of most well-studied machine learning algorithms is few-shot image classification. Only with small labelled data, few-shot algorithms can categorize new images.
Previous solutions
Two main streams in few-shot classification:
- Metric-based approaches
- optimization-based approaches
Metric-based methods often employ a CNN to learn image features representations, but replace the fully connected layer with a distance function, such as cosine distance and Euclidean distance. However, these methods bypasses the difficult optimization problem in learning a classier for few-shot images.
Difficulties
Promising results have been achieved by previous solutions, but for some cluttered background and large intra-class appearance variations may lead to these methods make an unsatisfactory performance. Although the problem can be alleviated by the neural network under the fully supervised training, thanks to the activation functions and abundant training images, it is almost inevitably amplified in low-data regimes and thus negatively impacts the image classification. Moreover, a mixed global representation destroys image structures and loses local features. Local features can provide discriminative and transferable information across categories, which can be important for image classification in the few-shot scenario.
DeepEMD
Key features:
- Earth Mover’s Distance (EMD)
- cross-reference mechanism
- structured fully connected layer
The building blocks can be compared when compare two complex structered representations. The difficulty lies in that we do not have their correspondence supervision for training and not all building elements can always find their counterparts in the other structures. In this papre, Zhang et, al formalized the few-shot classification as an instance of optimal matching. They adopted t the Earth Mover’s Distance (EMD) to compute their structural similarity. EMD is the metric for computing distance between structural representations, which was originally proposed for image retrieval. EMD can acquire the optimal matching flows with the minimum cost and it can also be embedded into network for end-to-end training.
Problem of EMD: EMD has the formulation of the transportation problem and the global minimum can be achieved by solving a Linear Programming problem. They apply the implicit function theorem to form the Jacobian matrix of the optimal optimization variable with respect to the problem parameters.
An important problem parameter in the EMD formulation is the weight of each element. Elements with large weights generate more matching flows and thus contribute more to the overall distance, while irrelevant regions have less weight. To find which element is important, they propose a cross-reference mechanism. In this mechanism, each node is determined by comparing it with the global statistics of the other structure. This aims to give less weight to the high-variance background regions and the object parts that are not co-occurrent in two images. Structured fully connected layer is proposed as the classifier for classification to make use of the increasing number of training images. Structured fully connected layer is en extension of the standard fully connected layer which replaces dot product operations between vectors with EMD function between vector sets. Structured FC layer can directly classify feature maps.