This is basically a scalable approach to training multi-modal models. They use self-supervised learning, so you don't need any other models to train it. We're going to try to go into more details, but the main idea is to combine representation learning and generation in the same process.
If you look on the left, you can see videos, images, or different modalities. When you usually train, you often have some noise or random noise, and you might not know what it is at first. Then, you align it with the encoder. How do we do this? We actually add two different kinds of noise that are both random but different from each other.
The first kind of noise we add is a lot of noise to the asset—this is what you see at the top. The second kind is a low amount of noise, shown at the bottom. The idea is that we have two models working together: the student model and the teacher model.
The student model is always getting the images for most languages, while the teacher model, which is basically a multiple version of the student, always gets the little noises in the system. The student is trying to learn two things at the same time—minimizing the loss for generation and the loss in representation.
This is how you can actually work across different modalities. With this approach, you only have one model, and you don't need anything external. If you really like your model, you can improve both your student and teacher models without worrying about an external encoder. This is what we're working on right now. We're currently using different models for training, and we believe this is where the future is heading. You also get to use all the decoders that we have.