HiWord Reading Lesson

A Scalable Approach to Multi-Modal Self-Supervised Training

把真实英文变成可学习的精读。 这篇分享来自 HiWord:点词查义、听读、跟读和复习都已经在同一条阅读路径里。

🗓 2026年5月17日· 📚 精选词库 · 👀 19
读懂点击高亮词,直接看中文释义。
听熟朗读全文,切到双语时逐句对照。
练出来把材料带进跟读和复习。
用 HiWord 学这篇

This is basically a scalable approach to training multi-modal models. They use self-supervised learning, so you don't need any other models to train it. We're going to try to go into more details, but the main idea is to combine representation learning and generation in the same process.

If you look on the left, you can see videos, images, or different modalities. When you usually train, you often have some noise or random noise, and you might not know what it is at first. Then, you align it with the encoder. How do we do this? We actually add two different kinds of noise that are both random but different from each other.

The first kind of noise we add is a lot of noise to the asset—this is what you see at the top. The second kind is a low amount of noise, shown at the bottom. The idea is that we have two models working together: the student model and the teacher model.

The student model is always getting the images for most languages, while the teacher model, which is basically a multiple version of the student, always gets the little noises in the system. The student is trying to learn two things at the same time—minimizing the loss for generation and the loss in representation.

This is how you can actually work across different modalities. With this approach, you only have one model, and you don't need anything external. If you really like your model, you can improve both your student and teacher models without worrying about an external encoder. This is what we're working on right now. We're currently using different models for training, and we believe this is where the future is heading. You also get to use all the decoders that we have.

HiWord 读法 · 先读原文,再点词确认含义;需要时打开双语和朗读。真正难的词,留到 HiWord 里用跟读、对话和复习继续练。
读懂只是开始。在 HiWord App 里把这篇练成能说出口的英语:跟读评分、AI 对话、间隔复习。 下载 iOS App →