From Contrastive Learning to World Models
News on Yann LeCun’s AMI Labs raising ~$1bn in their seed round and having one of their key bases in Singapore just made the news. $1bn for a seed round is kind of insane. AI startup seed rounds are more like Series Zs these days.
That aside, this news brought back some memories of an older technique from Lecun’s lab, titled VICReg, that inspired one of my papers. And led me to reread his JEPA series.
Today's diptych is about that connection.
Left Panel: VICReg
Self-supervised training in AI just means training without labels. We use it in LLMs. Pretraining a model by masking parts of inputs and getting it to predict the masked parts. One issue in self-supervised learning is collapse. AI likes to find shortcuts. And when it does, everything converges to the mean.
Most methods fight this with training or architectural tricks. VICReg uses just three interesting loss terms:
➡️ Variance so that there’s always some differences between the representations it learns.
➡️ Invariance so that the same input gives the same representation.
➡️ Covariance so that each dimension learns something different
Invariance is the main learning objective. Variance + covariance are the regularisers that prevent it from collapsing.
I liked VICReg’s idea. So I added a twist for a paper I published on a model called DynMIX. I fed it two fundamentally different views of the same company - explicit (constructed) and implicit (learnt from data). Another twist. VICReg's variance target is static. For financial data, that made no sense. Markets are volatile so variance can’t be static. So I made the target dynamic, based on variances of stock market returns.
Right Panel: JEPA
The core of the paper by Yann LeCun - JEPA - takes a different approach to the same problem. Instead of just learning representations, it predicts them. In AI, we usually predict something real - a label (dog, cat), or a number (stock price). JEPA predicts the underlying representation instead. Think of it as the model learning to imagine what something looks like in its own internal language, before it's seen in the real world.
Three components:
➡️ Context encoder: encodes what's visible into a representation
➡️ Target encoder: encodes what's hidden into a representation (this IS the label)
➡️ Predictor: tries to predict the hidden from the visible
So what does this unlock? Once representations of one modality (e.g., videos) are learnt, the next stage adds actions. A second model takes the video frame embeddings interleaved with a robot arm's movement commands, and predicts the next embedding. Planning then becomes: imagine many possible sequences of moves, simulate each one in the model's head, and pick the one most likely to reach the goal. Like a chess player thinking several moves ahead, but the "board" is the model's own internal understanding of the world.
Interesting times ahead. If AMI Labs succeeds.
#RepresentationLearning #JEPA #VICReg #YannLeCun #AIResearch


