Abstract
"What I cannot create, I do not understand." — Richard Feynman
Pretraining vision foundation models (VFMs) is prohibitively expensive, making it a privilege for institutions with abundant resources and leaving independent researchers to downstream tasks such as benchmarking, interpreting, and aligning VFMs. This situation is a crisis for computer vision research: independent researchers and the public cannot gain a true understanding, trust, or safe use of VFMs passively from open weights or APIs.
We propose democratizing VFM pretraining by scaling it down to a developmentally plausible framework that is scientifically reasonable and computationally friendly to university budgets. Our goal is to promote exploration rather than exploitation of pretraining, enabling independent researchers to build general-purpose VFMs that approach "baby intelligence" — benefiting efforts toward "grown-up" AI.
This framework closely mimics the minimal yet highly informative sensory experiences of human infants, encompassing three pillars:
Pretraining Data
Curated from longitudinal, egocentric audiovisual recordings of babies — capturing how infants naturally perceive the world.
Evaluation Benchmarks
A suite of developmentally aligned benchmarks assessing capabilities against cognitive milestones like object permanence, social skills, and language acquisition.
Pretraining Codebase
A user-friendly codebase and baseline models designed to run on university-scale compute budgets.
Tutorial Schedule
| Time | Session | Speaker |
|---|---|---|
| 09:00 – 09:20 | Opening & Motivation Talk | TBD |
| 09:20 – 09:50 | VFM Pretraining 101 Talk | TBD |
| 09:50 – 10:20 | BabyVLM Dataset & Curation Pipeline Talk | TBD |
| 10:20 – 10:35 | Coffee Break Break | — |
| 10:35 – 11:05 | Developmentally Aligned Benchmarks Talk | TBD |
| 11:05 – 11:35 | Hands-On: Train Your Baby VFM Hands-on | TBD |
| 11:35 – 12:00 | Live Demo & Q&A Demo | All Presenters |
Presenters
Resources
All materials will be made available before the tutorial date. Links will be updated here.
| Resource | Description | Link |
|---|---|---|
| Paper | Full technical report | (coming soon) |
| Dataset | Egocentric baby video corpus | (coming soon) |
| Code | Pretraining codebase & baselines | (coming soon) |
| Slides | Tutorial slide decks | (coming soon) |
| Notebook | Hands-on Colab notebook | (coming soon) |
Citation
If you find this work useful, please cite:
TBD