This paper proposes the idea of meta-learning a label free loss function in MAML algorithms. You can see it as extension or a generalization of Chelsea’s One-Shot Visual Imitation Learning via Meta-Learning .

In the inner loop, they first take N steps to adapt the base model F, from \theta to \theta_{N}, using data from the support set. Then they define a critic network C and the target dataset, which during the inner loop has the labels held out. You can think of C as a loss function with learnable parameters.

They take T steps to update the base model with loss = C(F(X; \theta); W). Then in the outer loop, they do have access to the target’s labels. The outer loss is L( F(x, \theta_{N+T}), y). So they can get gradients for theta and W, improving the model and the loss function.

They find that the best inputs for the Critic network are: base model’s predictions and relational task embedding.

## Critic Loss Analysis

The critic loss learns some interesting behavior. For example, in a bimodal class distribution case with 10% difference in their probability, the critic loss equalizes the probability of both classes. This may improve generalization, as intuitively if you see two roughly equal items it may be prudent to generalize them to have the same frequency.

## Experiments

In the experiments, they show that the critic loss works and achieves SOTA in Mini-ImageNet and CUB 1/5 shot, and compare against baselines. Their improved architecture adds 3-4 percent itself, and the critic loss adds 2-4 percent in improvement.

Overall, I like this paper for generalizing the label free loss function idea. It’s pragmatic to use all the data you have and they do so in a clever way to allow for label free adaptation during testing.