Theme

Multimodal foundation models are a revolutionary class of AI models that provides impressive abilities to generate content (text, images, sound, videos, protein structures, and more), and do so by interactive prompts in a seemingly creative manner. These foundation models are often autoregressive, self-supervised, transformer-based models that are pre-trained on large volumes of data, typically collected from the web. They already form the basis of all state-of-the-art systems in computer vision and natural language processing across a wide range of tasks and have shown impressive few-shot learning abilities. The perceived intelligence and adaptability of models like ChatGPT, Stable Diffusion, Gemini, and GPT4 impress, but their aptitude to produce inaccurate, misleading, or false information (and present it confidently and convincingly) makes them unsuitable for any task of importance and poses serious societal concerns. In this workshop we present recent advances on multimodal foundation models from academia and industry and discuss their impact and implications moving forward.

Location

The workshop is hosted at the Mathematisches Forschungsinstitut Oberwolfach (MFO). Accommodation and meals will take place in Hotel Hirschen, which is within walking distance of the MFO.

The workshop is hosted at the Mathematisches Forschungsinstitut Oberwolfach (MFO). Accommodation and meals will take place in Hotel Hirschen, which is within walking distance of the MFO.

Event Programme

Programme

Wednesday January 17
15:00 - 19:00 Arrival of attendees, time for socializing and discussion
19:00 - 22:00 Opening dinner at Hotel Hirschen
Thursday January 18
08:00 - 09:30 Breakfast at Hotel Hirschen
09:30 - 11:00 Morning Session I: Foundation Models (Chair: Yiannis Kompatsiaris)
FoMO without FOMO by Karteek Alahari (Inria)
Towards 3D Human Foundation Models by Cristian Sminchisescu (Google)
What multimodal foundation models cannot perceive by Cees Snoek (University of Amsterdam)
Are Foundation Models the tool for Social Embodied AI? by Xavier Alameda-Pineda (Inria)
11:00 - 11:30 Coffee break
11:30 - 12:30 Morning Session II: Vision & Language (Chair: Dima Damen)
Vocabulary-free Image Classification by Elisa Ricci (University of Trento)
Coreference resolution in narrated images by Hakan Bilen (The University of Edinburgh)
Vision-Language Self-Supervised Learning by Shaogang Gong (Queen Mary, University of London)
12:30 - 14:00 Lunch break at Hotel Hirschen
14:00 - 15:30 Afternoon Session I: Generative AI (Chair: Karteek Alahari)
Images & text: alignment, generation and compression by Jakob Verbeek (Meta)
Measuring the Quality of Generative Neural Networks - An Unsolved Problem by Juergen Gall (University of Bonn)
Controllable generation for Analysis and Synthesis by Ioannis Patras (Queen Mary, University of London)
Improving Fairness using Vision-Language Driven Image Augmentation by Nicu Sebe (University of Trento)
15:30 - 16:00 Coffee break
16:00 - 17:00 Afternoon Session II: Multimodality (Chair: Xavier Alameda-Pineda)
Multi-modality in Egocentric Vision - Contradictory and complementary signals by Dima Damen (University of Bristol)
Multimodal LLMs for Document Understanding by Dimosthenis Karatzas (Universitat Autónoma de Barcelona)
Large Multimodal Models for Media and Journalism by Yiannis Kompatsiaris (Information Technologies Institute, CERTH)
17:00 - 18:30 Discussion Session (Chairs: Cees Snoek & Nicu Sebe)
19:00 - 22:00 Dinner at Hotel Hirschen
Friday January 19
08:00 - 10:00 Breakfast at Hotel Hirschen
10:00 onwards Departure
Event Footer