Novel Class Discovery for Ultra-Fine-Grained Visual Categorization

Abstract: DUltra-fine-grained visual categorization (Ultra-FGVC) aims at distinguishing highly similar sub-categories within fine-grained objects, such as different soybean cultivars. Compared to traditional fine-grained visual categorization, Ultra-FGVC encounters more hurdles due to the small inter-class and large intra-class variation. Given these challenges, relying on human annotation for Ultra-FGVC is impractical. To this end, our work introduces a novel task termed Ultra-Fine-Grained Novel Class Discovery (UFG-NCD), which leverages partially annotated data to identify new categories of unlabeled images for Ultra-FGVC. To tackle this problem, we devise a Region-Aligned Proxy Learning (RAPL) framework, which comprises a Channel-wise Region Alignment (CRA) module and a Semi-Supervised Proxy Learning (SemiPL) strategy. The CRA module is designed to extract and utilize discriminative features from local regions, facilitating knowledge
transfer from labeled to unlabeled classes. Furthermore, SemiPL strengthens representation learning and knowledge transfer with proxy-guided supervised learning and proxyguided contrastive learning. Such techniques leverage class
distribution information in the embedding space, improving the mining of subtle differences between labeled and unlabeled ultra-fine-grained classes. Extensive experiments demonstrate that RAPL significantly outperforms baselines
across various datasets, indicating its effectiveness in handling the challenges of UFG-NCD.

Code is available at https://github.com/SSDUT-Caiyq/UFG-NCD.

Type of Publication: conference paper

Title of Journal: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024

Authors: Liu, Yu; Cai, Yaqi; Jia, Qi; Qiu, Binglin; Wang, Weimin; Pu, Nan

Riemannian Multinomial Logistics Regression for SPD Neural Networks

Abstract: Deep neural networks for learning Symmetric Positive Definite (SPD) matrices are gaining increasing attention in machine learning. Despite the significant progress, most existing SPD networks use traditional Euclidean classifiers on an approximated space rather than intrinsic classifiers that accurately capture the geometry of SPD manifolds. Inspired by Hyperbolic Neural Networks (HNNs), we propose Riemannian Multinomial Logistics Regression (RMLR) for the classification layers in SPD networks. We introduce a unified framework for building Riemannian classifiers under the metrics pulled back from the Euclidean space, and showcase our framework under the parameterized Log-Euclidean Metric (LEM) and Log-Cholesky Metric (LCM).
Besides, our framework offers a novel intrinsic explanation for the most popular LogEig classifier in existing SPD networks. The effectiveness of our method is demonstrated in three applications: radar recognition, human action recognition, and electroencephalography (EEG) classification.

The code is available at https://github.com/GitZH-Chen/SPDMLR.git.

Type of Publication: conference paper

Title of Journal: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024

Authors: Chen, Ziheng; Song, Yue; Liu, Gaowen; Rao Kompella, Ramana; Wu, Xiao-Jun; Sebe, Nicu

OpenBias: Open-set Bias Detection in Generative Models

Abstract: Text-to-image generative models are becoming increasingly popular and accessible to the general public. As these models see large-scale deployments, it is necessary to deeply investigate their safety and fairness to not disseminate and perpetuate any kind of biases. However, existing works focus on detecting closed sets of biases defined a priori, limiting the studies to well-known concepts. In this paper, we tackle the challenge of open-set bias detection in text-to-image generative models presenting OpenBias, a new pipeline that identifies and quantifies the severity of biases agnostically, without access to any precompiled set.
OpenBias has three stages. In the first phase, we leverage a Large Language Model (LLM) to propose biases given a set of captions. Secondly, the target generative model produces images using the same set of captions. Lastly, a Vision Question Answering model recognizes the presence and extent of the previously proposed biases. We study the behavior of Stable Diffusion 1.5, 2, and XL emphasizing new biases, never investigated before. Via quantitative experiments, we demonstrate that OpenBias agrees with current closed-set bias detection methods and human judgement.

Type of Publication: conference paper

Title of Journal: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024

Authors: D’Inca, Moreno; Peruzzo, Elia; Mancini, Massimiliano; Xu, Dejia; Goel, Vidit; Xu, Xingqian; Wang, Zhangyang; Shi, Humphrey; Sebe, Nicu

SpectralCLIP: Preventing Artifacts in Text-Guided Style Transfer from a Spectral Perspective

Abstract: Owing to the power of vision-language foundation models, e.g., CLIP, the area of image synthesis has seen recent important advances. Particularly, for style transfer, CLIP enables transferring more general and abstract styles without collecting the style images in advance, as the style can be efficiently described with natural language, and the result is optimized by minimizing the CLIP similarity between the text description and the stylized image. However, directly using CLIP to guide style transfer leads to undesirable artifacts (mainly written words and unrelated visual entities) spread over the image. In this paper, we propose SpectralCLIP, which is based on a spectral representation of the CLIP embedding sequence, where most of the common artifacts occupy specific frequencies. By masking the band including these frequencies, we can condition the generation process to adhere to the target style properties (e.g., color, texture, paint stroke, etc.) while excluding the generation of larger-scale structures corresponding to the artifacts. Experimental results show that SpectralCLIP prevents the generation of artifacts effectively in quantitative and qualitative terms, without impairing the stylisation quality. We also apply SpectralCLIP to textconditioned image generation and show that it prevents written words in the generated images.

Our code is available at https://github.com/zipengxuc/SpectralCLIP.

Type of Publication: publication

Title of Journal: IEEE Winter Conference on Application of Computer Vision (WACV), 2024

Authors: Xu, Zipeng; Xing, Songlong; Sangineto, Enver; Sebe, Nicu

Multifidelity Gaussian Process Emulation for Atmospheric Radiative Transfer Models

Abstract: Atmospheric radiative transfer models (RTMs) are widely used in satellite data processing to correct for the scattering and absorption effects caused by aerosols and gas molecules in the Earth’s atmosphere. As the complexity of RTMs grows and the requirements for future Earth Observation missions become more demanding, the conventional lookup-table (LUT) interpolation approach faces important challenges. Emulators have been suggested as an alternative to LUT interpolation, but they are still too slow for operational satellite data processing. Our research introduces a solution that harnesses the power of multifidelity methods to improve the accuracy and runtime of Gaussian process (GP) emulators. We investigate the impact of the number of fidelity layers, dimensionality reduction, and training dataset size on the performance of multifidelity GP emulators. We find that an optimal multifidelity emulator can achieve relative errors in surface reflectance below 0.5% and performs atmospheric correction of hyperspectral PRISMA satellite data (one million pixels) in a few minutes. Additionally, we provide a suite of functions and tools for automating the creation and generation of atmospheric RTM emulators.

 

Type of Publication: publication

Title of Journal: IEEE Transactions on Geoscience and Remote Sensing, 61, 1-10, 2023.

Authors: Vicent Servera, Jorge; Martino, Luca; Verrelst, Jochem; Camps-Valls, Gustau

Towards Optimal Trade-offs in Knowledge Distillation for CNNs and Vision Transformers at the Edge

Abstract: This paper discusses four facets of the Knowledge Distillation (KD) process for Convolutional Neural Networks (CNNs) and Vision Transformer (ViT) architectures, particularly when executed on edge devices with constrained processing capabilities. First, we conduct a comparative analysis of the KD process between CNNs and ViT architectures, aiming to elucidate the feasibility and efficacy of employing different architectural configurations for the teacher and student, while assessing their performance and efficiency. Second, we explore the impact of varying the size of the student model on accuracy and inference speed, while maintaining a constant KD duration. Third, we examine the effects of employing higher resolution images on the accuracy, memory footprint and computational workload. Last, we examine the performance improvements obtained by fine-tuning the student model after KD to specific downstream tasks. Through empirical evaluations and analyses, this research provides AI practitioners with insights into optimal strategies for maximizing the effectiveness of the KD process on edge devices.

 

Type of Publication: Conference Proceeding

Title of Conference: Towards Optimal Trade-offs in Knowledge Distillation for CNNs and Vision Transformers at the Edge , Lyon, France, 27 August 2024 (Session Signal and Data Analytics for Machine Learning, Part 1)

Authors: John Violos; Symeon Papadopoulos; Ioannis Kompatsiaris

Federated Generalized Category Discovery

Abstract: Generalized category discovery (GCD) aims at grouping unlabeled samples from known and unknown classes, given labeled data of known classes. To meet the recent decentralization trend in the community, we introduce a practical yet challenging task, Federated GCD (Fed-GCD), where the training data are distributed among local clients and cannot be shared among clients. Fed-GCD aims to train a generic GCD model by client collaboration under the privacy-protected constraint. The Fed-GCD leads to two challenges: 1) representation degradation caused by training each client model with fewer data than centralized GCD learning, and 2) highly heterogeneous label spaces across different clients. To this end, we propose a novel Associated Gaussian Contrastive Learning (AGCL) framework based on learnable GMMs, which consists of a Client Semantics Association (CSA) and a global-local GMM Contrastive Learning (GCL). On the server, CSA aggregates the heterogeneous categories of local-client GMMs to generate a global GMM containing more comprehensive category knowledge. On each client, GCL builds class-level contrastive learning with both local and global GMMs. The local GCL learns robust representation with limited local data. The global GCL encourages the model to produce more discriminative representation with the comprehensive category relationships that may not exist in local data. We build a benchmark based on six visual datasets to facilitate the study of Fed-GCD. Extensive experiments show that our AGCL outperforms multiple baselines on all datasets. Code is available at https://github.com/TPCD/FedGCD.

Type of Publication: Conference Proceeding

Title of Conference: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

Authors: Nan Pu; Wenjing Li; Xingyuan Ji; Yalan Qin; Nicu Sebe; Zhun Zhong

A Characterization Theorem for Equivariant Networks with Point-wise Activations

Abstract: Equivariant neural networks have shown improved performance, expressiveness and sample complexity on symmetrical domains. But for some specific symmetries, representations, and choice of coordinates, the most common point-wise activations, such as ReLU, are not equivariant, hence they cannot be employed in the design of equivariant neural networks. The theorem we present in this paper describes all possibile combinations of representations, choice of coordinates and point-wise activations to obtain an equivariant layer, generalizing and strengthening existing characterizations. Notable cases of practical relevance are discussed as corollaries. Indeed, we prove that rotation-equivariant networks can only be invariant, as it happens for any network which is equivariant with respect to connected compact groups. Then, we discuss implications of our findings when applied to important instances of equivariant networks. First, we completely characterize permutation equivariant networks such as Invariant Graph Networks with point-wise nonlinearities and their geometric counterparts, highlighting a plethora of models whose expressive power and performance are still unknown. Second, we show that feature spaces of disentangled steerable convolutional neural networks are trivial representations.

Type of Publication: Conference Proceeding

Title of Conference: International Conference on Learning Representations 2024 (ICLR 2024)

Authors: Marco Pacini; Bruno Lepri; Xiaowen Dong; Gabriele Santin