Adaptive Log-Euclidean Metrics for SPD Matrix Learning

Abstract: Symmetric Positive Definite (SPD) matrices have re-ceived wide attention in machine learning due to their intrinsiccapacity to encode underlying structural correlation in data.Many successful Riemannian metrics have been proposed toreflect the non-Euclidean geometry of SPD manifolds. However,most existing metric tensors are fixed, which might lead tosub-optimal performance for SPD matrix learning, especiallyfor deep SPD neural networks. To remedy this limitation, weleverage the commonly encountered pullback techniques andpropose Adaptive Log-Euclidean Metrics (ALEMs), which extendthe widely used Log-Euclidean Metric (LEM). Compared withthe previous Riemannian metrics, our metrics contain learnableparameters, which can better adapt to the complex dynamicsof Riemannian neural networks with minor extra computations.We also present a complete theoretical analysis to support ourALEMs, including algebraic and Riemannian properties. Theexperimental and theoretical results demonstrate the merit of theproposed metrics in improving the performance of SPD neuralnetworks. The efficacy of our metrics is further showcased on aset of recently developed Riemannian building blocks, includingRiemannian batch normalization, Riemannian Residual blocks,and Riemannian classifiers.

Type of Publication: Journal article

Title of Journal: IEEE Transactions on Image Processing, 33(9), 5194 – 5205, 2024.

Authors: Chen, Ziheng; Song, Yue; Xu, Tianyang; Huang, Zhiwu; Wu, Xiao-Jun; Sebe, Nicu

UVMap-ID: A Controllable and Personalized UV Map Generative Model

Abstract: Recently, diffusion models have made significant strides in synthesizing realistic 2D human images based on provided text prompts. Building upon this, researchers have extended 2D text-to-image diffusion models into the 3D domain for generating human textures
(UV Maps). However, some important problems about UV Map Generative models are still not solved, i.e., how to generate personalized texture maps for any given face image, and how to define and evaluate the quality of these generated texture maps. To solve the above problems, we introduce a novel method, UVMap-ID, which is a controllable and personalized UV Map generative model. Unlike traditional large-scale training methods in 2D, we propose to fine-tune a pre-trained text-to-image diffusion model which is integrated with a face fusion module for achieving IDdriven customized generation. To support the finetuning strategy, we introduce a small-scale attribute-balanced training dataset, including high-quality textures with labeled text and Face ID. Additionally, we introduce some metrics to evaluate the multiple aspects of the textures. Finally, both quantitative and qualitative analyses demonstrate the effectiveness of our method in controllable and personalized UV Map generation. Code is publicly available via
https://github.com/twowwj/UVMap-ID.

Type of Publication: conference paper

Title of Journal: ACM Multimedia , November 2024

Authors: Wang, Weijie; Zhang, Jichao; Liu, Chang; Li, Xia; Xu, Xingqian; Shi, Humphrey; Sebe, Nicu; Lepri, Bruno

Vision + X: A Survey on Multimodal Learning in the Light of Data

Abstract: We are perceiving and communicating with the worldin a multisensory manner, where different information sources aresophisticatedly processed and interpreted by separate parts of thehuman brain to constitute a complex, yet harmonious and unifiedsensing system. To endow the machines with true intelligence,multimodal machine learning that incorporates data from varioussources has become an increasingly popular research area withemerging technical advances in recent years. In this paper, wepresent a survey on multimodal machine learning from a novelperspective considering not only the purely technical aspects butalso the intrinsic nature of different data modalities. We analyze thecommonness and uniqueness of each data format mainly rangingfrom vision, audio, text, and motions, and then present the method-ological advancements categorized by the combination of datamodalities, such asVision+Text, with slightly inclined emphasis onthe visual data. We investigate the existing literature on multimodallearning from both the representation learning and downstreamapplication levels, and provide an additional comparison in thelight of their technical connections with the data nature, e.g., the se-mantic consistency between image objects and textual descriptions,and the rhythm correspondence between video dance moves andmusical beats. We hope that the exploitation of the alignment as wellas the existing gap between the intrinsic nature of data modality andthe technical designs, will benefit future research studies to betteraddress a specific challenge related to the concrete multimodaltask, prompting a unified multimodal machine learning frameworkcloser to a real human intelligence system

Type of Publication: Journal article

Title of Journal: IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12), 9102-9122, 2024.

Authors: Sebe, Nicu; Zhu, Ye; Wu, Yu; Yan, Yan

Exploiting Activation Sparsity with Dense to Dynamic-k Mixture-of-Experts Conversion

Abstract: Transformer models can face practical limitations due to their high computational requirements. At the same time, such models exhibit significant activation sparsity, which can be leveraged to reduce the inference cost by converting parts of the network into equivalent Mixture-of-Experts (MoE) layers. Despite the crucial role played by activation sparsity, its impact on this process remains unexplored. We demonstrate that the efficiency of the conversion can be significantly enhanced by a proper regularization of the activation sparsity of the base model. Moreover, motivated by the high variance of the number of activated neurons for different inputs, we introduce a more effective dynamic-k expert selection rule that adjusts the number of executed experts on a per-token basis. To achieve further savings, we extend this approach to multi-head attention projections. Finally, we develop an efficient implementation that translates these computational savings into actual wallclock speedup. The proposed method, Dense to Dynamic-k Mixture-of-Experts (D2DMoE), outperforms existing approaches on common NLP and vision tasks, reducing inference cost by up to 60% without significantly impacting performance.

 

Type of Publication: Publication

Title of Conference: The Thirty-Eighth Annual Conference on Neural Information Processing Systems (NeurIPS 2024) , Vancouver, 10-15.12.2024

Authors: John Violos; Symeon Papadopoulos; Ioannis Kompatsiaris

Augmentation-aware self-supervised learning with conditioned projector

Abstract: Self-supervised learning (SSL) is a powerful technique for learning from unlabeled data. By learning to remain invariant to applied data augmentations, methods such as SimCLR and MoCo can reach quality on par with supervised approaches. However, this invariance may be detrimental for solving downstream tasks that depend on traits affected by augmentations used during pretraining, such as color. In this paper, we propose to foster sensitivity to such characteristics in the representation space by modifying the projector network, a common component of self-supervised architectures. Specifically, we supplement the projector with information about augmentations applied to images. For the projector to take advantage of this auxiliary conditioning when solving the SSL task, the feature extractor learns to preserve the augmentation information in its representations. Our approach, coined Conditional Augmentation-aware Self-supervised Learning (CASSLE), is directly applicable to typical joint-embedding SSL methods regardless of their objective functions. Moreover, it does not require major changes in the network architecture or prior knowledge of downstream tasks. In addition to an analysis of sensitivity towards different data augmentations, we conduct a series of experiments, which show that CASSLE improves over various SSL methods, reaching state-of-the-art performance in multiple downstream tasks.

 

Type of Publication: Publication

Title of Conference: Knowledge-Based Systems, 305, ISSN: 1872-7409, 2024.

Authors: Przewięźlikowski, Marcin; Pyla, Mateusz; Zieliński, Bartosz; Twardowski, Bartłomiej; Tabor, Jacek; Śmieja, Marek.

Task-recency bias strikes back: Adapting covariances in Exemplar-Free Class Incremental Learning

Abstract: Exemplar-Free Class Incremental Learning (EFCIL) tackles the problem of training a model on a sequence of tasks without access to past data. Existing state-of-the-art methods represent classes as Gaussian distributions in the feature extractor’s latent space, enabling Bayes classification or training the classifier by replaying pseudo features. However, we identify two critical issues that compromise their efficacy when the feature extractor is updated on incremental tasks. First, they do not consider that classes’ covariance matrices change and must be adapted after each task. Second, they are susceptible to a task-recency bias caused by dimensionality collapse occurring during training. In this work, we propose AdaGauss — a novel method that adapts covariance matrices from task to task and mitigates the task-recency bias owing to the additional anti-collapse loss function. AdaGauss yields state-of-the-art results on popular EFCIL benchmarks and datasets when training from scratch or starting from a pre-trained backbone.

Type of Publication: Conference Paper

Title of Conference: The Thirty-Eighth Annual Conference on Neural Information Processing Systems (NeurIPS 2024) , Vancouver, 10-15.12.2024

Authors: Rypeść, Grzegorz; Cygert, Sebastian; Trzcinski, Tomasz; Twardowski, Bartłomiej.

Unlearning Vision Transformers without Retaining Data via Low-Rank Decompositions

Abstract: The implementation of data protection regulations such as the GDPR and the California Consumer Privacy Act has sparked a growing interest in removing sensitive information from pre-trained models without requiring retraining from scratch, all while maintaining predictive performance on remaining data. Recent studies on machine unlearning for deep neural networks have resulted in different attempts that put constraints on the training procedure and which are limited to small-scale architectures and with poor adaptability to real-world requirements. In this paper, we develop an approach to delete information on a class from a pre-trained model, by injecting a trainable low-rank decomposition into the network parameters, and without requiring access to the original training set. Our approach greatly reduces the number of parameters to train as well as time and memory requirements. This allows a painless application to real-life settings where the entire training set is unavailable, and compliance with the requirement of time-bound deletion. We conduct experiments on various Vision Transformer architectures for class forgetting. Extensive empirical analyses demonstrate that our proposed method is efficient, safe to apply, and effective in removing learned information while maintaining accuracy.

Type of Publication: Conference Paper

Title of Conference: International Conference on Pattern Recognition 2024 (ICPR 2024) , Kolkata, 01-05 December 2024

Authors: Poppi, Samuele; Sarto, Sara; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita.

Personalized Instance-based Navigation Toward User-Specific Objects in Realistic Environments

Abstract: In the last years, the research interest in visual navigation towards objects in indoorenvironments has grown significantly. This growth can be attributed to the recentavailability of large navigation datasets in photo-realistic simulated environments,like Gibson and Matterport3D. However, the navigation tasks supported by thesedatasets are often restricted to the objects present in the environment at acquisitiontime. Also, they fail to account for the realistic scenario in which the target objectis a user-specific instance that can be easily confused with similar objects and maybe found in multiple locations within the environment. To address these limitations,we propose a new task denominatedPersonalized Instance-based Navigation(PIN), in which an embodied agent is tasked with locating and reaching a specific personalobject by distinguishing it among multiple instances of the same category. The taskis accompanied byPInNED, a dedicated new dataset composed of photo-realisticscenes augmented with additional 3D objects. In each episode, the target objectis presented to the agent using two modalities: a set of visual reference imageson a neutral background and manually annotated textual descriptions. Through comprehensive evaluations and analyses, we showcase the challenges of the PINtask as well as the performance and shortcomings of currently available methodsdesigned for object-driven navigation, considering modular and end-to-end agents.

Type of Publication: Conference Paper

Title of Conference: Neural Information Processing Systems 2024 (NeurIPS 2024) , Vancouver, 10-15 December 2024

Authors: Barsellotti, Luca; Bigazzi, Roberto; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita.