They use self-supervised learning, so you don't need any other models to train it.
We combine representation learning and generation in the same process.
When you usually train, you often have some noise or random noise.
We actually add two different kinds of noise that are both random but different from each other.
The student model is always getting the images for most languages.
The teacher model always gets the little noises in the system.
The student is trying to minimize the loss for generation and the loss in representation.
With this approach, you only have one model, and you don't need anything external.
We believe this is where the future is heading.