Loading Events

« All Events

  • This event has passed.

ELLIS Multimodal Learning Systems Workshop on Multimodal Foundation Models

January 17 - January 19


Multimodal foundation models are a revolutionary class of AI models that provides impressive abilities to generate content (text, images, sound, videos, protein structures, and more), and do so by interactive prompts in a seemingly creative manner. These foundation models are often autoregressive, self-supervised, transformer-based models that are pre-trained on large volumes of data, typically collected from the web. They already form the basis of all state-of-the-art systems in computer vision and natural language processing across a wide range of tasks and have shown impressive few-shot learning abilities. The perceived intelligence and adaptability of models like ChatGPT, Stable Diffusion, Gemini, and GPT4 impress, but their aptitude to produce inaccurate, misleading, or false information (and present it confidently and convincingly) makes them unsuitable for any task of importance and poses serious societal concerns. In this workshop we present recent advances on multimodal foundation models from academia and industry and discuss their impact and implications moving forward.


The workshop is hosted at the Mathematisches Forschungsinstitut Oberwolfach (MFO). Accommodation and meals will take place in Hotel Hirschen, which is within walking distance of the MFO.


Wednesday January 17

15:00 – 19:00

Arrival of attendees, time for socializing and discussion

19:00 – 22:00

Opening dinner at Hotel Hirschen

Thursday January 18

08:00 – 09:30

Breakfast at Hotel Hirschen

09:30 – 11:00

Morning Session I: Foundation Models (Chair: Yiannis Kompatsiaris)

FoMO without FOMO by Karteek Alahari (Inria)

Towards 3D Human Foundation Models by Cristian Sminchisescu (Google)

What multimodal foundation models cannot perceive by Cees Snoek (University of Amsterdam)

Are Foundation Models the tool for Social Embodied AI? by Xavier Alameda-Pineda (Inria)

11:00 – 11:30

Coffee break

11:30 – 12:30

Morning Session II: Vision & Language (Chair: Dima Damen)

Vocabulary-free Image Classification by Elisa Ricci (University of Trento)

Coreference resolution in narrated images by Hakan Bilen (The University of Edinburgh)

Vision-Language Self-Supervised Learning by Shaogang Gong (Queen Mary, University of London)

12:30 – 14:00

Lunch break at Hotel Hirschen

14:00 – 15:30

Afternoon Session I: Generative AI  (Chair: Karteek Alahari)

Images & text: alignment, generation and compression by Jakob Verbeek (Meta)

Measuring the Quality of Generative Neural Networks – An Unsolved Problem by Juergen Gall (University of Bonn)

Controllable generation for Analysis and Synthesis by Ioannis Patras (Queen Mary, University of London)

Improving Fairness using Vision-Language Driven Image Augmentation by Nicu Sebe (University of Trento)

15:30 – 16:00

Coffee break

16:00 – 17:00

Afternoon Session II: Multimodality (Chair: Xavier Alameda-Pineda)

Multi-modality in Egocentric Vision – Contradictory and complementary signals by Dima Damen (University of Bristol)

Multimodal LLMs for Document Understanding by Dimosthenis Karatzas (Universitat Autónoma de Barcelona)

Large Multimodal Models for Media and Journalism by Yiannis Kompatsiaris (Information Technologies Institute, CERTH)

17:00 – 18:30

Discussion Session (Chairs: Cees Snoek & Nicu Sebe)

19:00 – 22:00

Dinner at Hotel Hirschen

Friday January 19

08:00 – 10:00

Breakfast at Hotel Hirschen

10:00 onwards



This workshop is financially supported by the European Lab for Learning and Intelligent Systems (ELLIS) and we acknowledge travel support from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 951847, known as ELISE – European Network of AI Excellence Centres.


January 17
January 19
Event Category:




Mathematisches Forschungsinstitut Oberwolfach (MFO)
Schwarzwaldstraße 9-11
Oberwolfach-Walke, 77709 Germany
+ Google Map
+49 (0) 7834 979-0
View Venue Website