ICDL 2026 Tutorial

BabyVLM: Democratizing Research on the Pretraining of Vision Large Language Models

Scaling down pretraining to make VFM research accessible to everyone.

Wenqi Wang  ·  Max Whitton  ·  Boqing Gong

Boston University

Paper GitHub Slides Dataset Colab Notebook

Abstract

"What I cannot create, I do not understand." — Richard Feynman

Pretraining vision foundation models (VFMs) is prohibitively expensive, making it a privilege for institutions with abundant resources and leaving independent researchers to downstream tasks such as benchmarking, interpreting, and aligning VFMs. This situation is a crisis for computer vision research: independent researchers and the public cannot gain a true understanding, trust, or safe use of VFMs passively from open weights or APIs.

We propose democratizing VFM pretraining by scaling it down to a developmentally plausible framework that is scientifically reasonable and computationally friendly to university budgets. Our goal is to promote exploration rather than exploitation of pretraining, enabling independent researchers to build general-purpose VFMs that approach "baby intelligence" — benefiting efforts toward "grown-up" AI.

This framework closely mimics the minimal yet highly informative sensory experiences of human infants, encompassing three pillars:

🎥

Pretraining Data

Curated from longitudinal, egocentric audiovisual recordings of babies — capturing how infants naturally perceive the world.

🧠

Evaluation Benchmarks

A suite of developmentally aligned benchmarks assessing capabilities against cognitive milestones like object permanence, social skills, and language acquisition.

💻

Pretraining Codebase

A user-friendly codebase and baseline models designed to run on university-scale compute budgets.

Tutorial Schedule

Time Session Speaker
09:00 – 09:20 Opening & Motivation Talk TBD
09:20 – 09:50 VFM Pretraining 101 Talk TBD
09:50 – 10:20 BabyVLM Dataset & Curation Pipeline Talk TBD
10:20 – 10:35 Coffee Break Break
10:35 – 11:05 Developmentally Aligned Benchmarks Talk TBD
11:05 – 11:35 Hands-On: Train Your Baby VFM Hands-on TBD
11:35 – 12:00 Live Demo & Q&A Demo All Presenters

Presenters

👤
Presenter One
University A
👤
Presenter Two
University B
👤
Presenter Three
Institute C

Resources

All materials will be made available before the tutorial date. Links will be updated here.

ResourceDescriptionLink
Paper Full technical report (coming soon)
Dataset Egocentric baby video corpus (coming soon)
Code Pretraining codebase & baselines (coming soon)
Slides Tutorial slide decks (coming soon)
Notebook Hands-on Colab notebook (coming soon)

Citation

If you find this work useful, please cite:

      TBD