Scientific Publications | Page 3
Optimizing Resource Consumption in Diffusion Models through Hallucination Early Detection
Abstract: Diffusion models have significantly advanced generative AI,but they encounter difficulties when generating complex combinations ofmultiple objects. As the final result heavily depends on the initial seed,accurately ensuring the desired output can require multiple iterationsof the generation process. This repetition not only leads to a waste oftime but also increases energy consumption, echoing the challenges ofefficiency and accuracy in complex generative tasks. To tackle this issue,we introduce HEaD (Hallucination Early Detection), a new paradigmdesigned to swiftly detect incorrect generations at the beginning of thediffusion process. The HEaD pipeline combines cross-attention maps witha new indicator, the Predicted Final Image, to forecast the final outcomeby leveraging the information available at early stages of the generationprocess. We demonstrate that using HEaD saves computational resourcesand accelerates the generation process to get acomplete image,i.e. an image where all requested objects are accurately depicted. Our findings reveal that HEaD can save up to 12% of the generation time on a twoobjects scenario and underscore the importance of early detection mech-anisms in generative models.
Type of Publication: Conference Paper
Title of Conference: European Conference on Computer Vision Workshop 2024 (ECCVW), 29 September – 04 October 2024
Authors: Betti, Federico; Baraldi, Lorenzo; Cucchiara, Rita; Sebe, Nicu
Training-Free Open-Vocabulary Segmentation with Offline Diffusion-Augmented Prototype Generation
Abstract: Open-vocabulary semantic segmentation aims at segmenting arbitrary categories expressed in textual form. Pre-vious works have trained over large amounts of image-caption pairs to enforce pixel-level multimodal alignments. However, captions provide global information about the semantics of a given image but lack direct localization of individual concepts. Further, training on large-scaledatasets inevitably brings significant computational costs. In this paper, we propose FreeDA, a training-free diffusion-augmented method for open-vocabulary semantic segmentation, which leverages the ability of diffusion models to visually localize generated concepts and local-global similarities to match class-agnostic regions with semantic classes. Our approach involves an offline stage in which textual-visual reference embeddings are collected, starting from a large set of captions and leveraging visual and semantic contexts. At test time, these are queried to support the visual matching process, which is carried out by jointly considering class-agnostic regions and global semantic similarities. Extensive analyses demonstrate that FreeDA achieves state-of-the-art performance on five datasets, surpassing previous methods by more than 7.0 average points in terms ofmIoU and without requiring any training. Our source code is available at aimagelab.github.io/freeda.
Type of Publication: Conference Paper
Title of Conference: IEEE/CVF Conference on Computer Vision and Pattern Recognition 2024 (CVPR 2024) , Seattle, 17-21 June 2024
Authors: Barsellotti, Luca; Amoroso, Roberto; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita
Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models
Abstract: Large-scale vision-and-language models, such as CLIP, are typically trained on web-scale data, which can introduce inappropriate content and lead to the development of unsafe and biased behavior. This, in turn, hampers their applicability in sensitive and trustworthy contexts and could raise significant concerns in their adoption. Our research introduces a novel approach to enhancing the safety of vision-and-language models by diminishing their sensitivity to NSFW (not safe for work) inputs. In particular, our methodology seeks to sever “toxic” linguistic and visual concepts, unlearning the linkage between unsafe linguistic or visual items and unsafe regions of the embedding space. We show how this can be done by fine-tuning a CLIP model on synthetic data obtained from a large language model trained to convert between safe and unsafe sentences, and a text-to-image generator. We conduct extensive experiments on the resulting embedding space for cross-modal retrieval, text-to-image, and image-to-text generation, where we show that our model can be remarkably employed with pre-trained generative models. Our source code and trained models are available at: https://github.com/aimagelab/safe-clip.
Type of Publication: Conference Paper
Title of Conference: European Conference on Computer Vision (ECCV) , Milan, 29 September – 4 October
Authors: Poppi, Samuele; Poppi, Tobia; Cocchi, Federico; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita
Multi-Class Unlearning for Image Classification via Weight Filtering
Abstract: Machine Unlearning is an emerging paradigm for selectively removing the impact of training datapoints from a network. Unlike existing methods that target a limited subset or a single class, our framework unlearns all classes in a single round. We achieve this by modulating the network’s components using memory matrices, enabling the network to demonstrate selective unlearning behavior for any class after training. By discovering weights that are specific to each class, our approach also recovers a representation of the classes which is explainable by design. We test the proposed framework on small- and medium-scale image classification datasets, with both convolution- and Transformer-based backbones, showcasing the potential for explainable solutions through unlearning.
Type of Publication: Journal Article
Title of Conference: IEEE INTELLIGENT SYSTEM, ISSN: 1541-1672, 2024.
Authors: Poppi, Samuele; Sarto, Sara; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita
Cooperative Online Learning with Feedback Graphs
Description: We study the interplay between communication and feedback in a cooperative online learning setting, where a network of communicating agents learn a common sequential decision-making task through a feedback graph. We bound the network regret in terms of the independence number of the strong product between the communication network and the feedback graph. Our analysis recovers as special cases many previously known bounds for cooperative online learning with expert or bandit feedback. We also prove an instance-based lower bound, demonstrating that our positive results are not improvable except in pathological cases. Experiments on synthetic data confirm our theoretical findings.
Type of Publication: Journal Article
Title of Journal: Transactions on Machine Learning Research, 2024.
Authors: Cesa-Bianchi, Nicolò; Cesari, Tommaso; Della Vecchia, Riccardo
A Theory of Interpretable Approximations
Description: Can a deep neural network be approximated by a small decision tree based on simple features? This question and its variants are behind the growing demand for machine learning models that are interpretable by humans. In this work we study such questions by introducing interpretable approximations, a notion that captures the idea of approximating a target concept c by a small ag- gregation of concepts from some base class H. In particular, we consider the approximation of a binary concept c by decision trees based on a simple class H (e.g., of bounded VC dimension), and use the tree depth as a measure of complexity. Our primary contribution is the following remarkable trichotomy. For any given pair of H and c, exactly one of these cases holds: (i) c cannot be ap- proximated by H with arbitrary accuracy; (ii) c can be approximated by H with arbitrary accuracy, but there exists no universal rate that bounds the complexity of the approximations as a function of the accuracy; or (iii) there exists a constant κ that depends only on H and c such that, for any data distribution and any desired accuracy level, c can be approximated by H with a complexity not exceeding κ. This taxonomy stands in stark contrast to the landscape of supervised classifi- cation, which offers a complex array of distribution-free and universally learnable scenarios. We show that, in the case of interpretable approximations, even a slightly nontrivial a-priori guarantee on the complexity of approximations implies approximations with constant (distribution-free and accuracy-free) complexity. We extend our trichotomy to classes H of unbounded VC dimension and give characterizations of interpretability based on the algebra generated by H.
Type of Publication:Conference paper
Title of Conference: 37th Annual Conference on Learning Theory (COLT), June 2024
Authors: Bressan, Marco; Cesa-Bianchi, Nicolò; Esposito, Emmanuel; Mansour, Yishay; Moran, Shay; Thiessen, Maximilian
Efficient Algorithms for Learning Monophonic Halfspaces in Graphs
Description: We study the problem of learning a binary classifier on the vertices of a graph. In particular, we consider classifiers given by monophonic halfspaces, partitions of the vertices that are convex in a certain abstract sense. Monophonic halfspaces, and related notions such as geodesic halfspaces, have recently attracted interest, and several connections have been drawn between their properties (e.g., their VC dimension) and the structure of the underlying graph G. We prove several novel results for learning monophonic halfspaces in the supervised, online, and active settings. Our main result is that a monophonic halfspace can be learned with near-optimal passive sample complexity in time polynomial in n = |V (G)|. This requires us to devise a polynomial-time algorithm for consistent hypothesis checking, based on several structural insights on monophonic halfspaces and on a reduction to 2-satisfiability. We prove similar results for the online and active settings. We also show that the concept class can be enumerated with delay poly(n), and that empirical risk minimization can be performed in time 2ω(G) poly(n) where ω(G) is the clique number of G. These results answer open questions from the literature (Gonza ́lez et al., 2020), and show a contrast with geodesic halfspaces, for which some of the said problems are NP-hard (Seiffarth et al., 2023).
Type of Publication:Conference paper
Title of Conference: 37th Annual Conference on Learning Theory (COLT), June 2024
Authors: Bressan, Marco; Esposito, Emmanuel; Thiessen, Maximilian
Margin-Based Active Learning of Classifiers
Description: We study active learning of multiclass classifiers, focusing on the realizable transductive setting. The input is a finite subset X of some metric space, and the concept to be learned is a partition C of X into k classes. The goal is to learn C by querying the labels of as few elements of X as possible. This is a useful subroutine in pool-based active learning, and is motivated by applications where labels are expensive to obtain. Our main result is that, in very different settings, there exist interesting notions of margin that yield efficient active learning algorithms. First, we consider the case X ⊂ Rm, assuming that each class has an unknown “personalized” margin separating it from the rest. Second, we consider the case where X is a finite metric space, and the classes are convex with margin according to the geodesic distances in the thresholded connectivity graph. In both cases, we give algorithms that learn C exactly, in polynomial time, using O(log n) label queries, where O(·) hides a near-optimal dependence on the dimension of the metric spaces. Our results actually hold for or can be adapted to more general settings, such as pseudometric and semimetric spaces.
Type of Publication: Journal Article
Title of Journal: Journal of Machine Learning Research, 2024.
Authors: Bressan, Marco; Cesa-Bianchi, Nicolò; Lattanzi, Silvio; Paudice, Andrea
Access the open software and datasets produced by ELIAS!
ELIAS aims at establishing Europe as a leader in Artificial Intelligence (AI) research that drives sustainable innovation and economic development.
We will create a Network of Excellence connecting researchers in academia with practitioners in the industry to differentiate Europe as a region where AI research builds towards a sustainable long-term future for our planet, contributes to a cohesive society, and respects individual preferences and rights.