GraphMLP: A graph MLP-like architecture for 3D human pose estimation

Abstract: Modern multi-layer perceptron (MLP) models have shown competitive results in learning visual representations without self-attention. However, existing MLP models are not good at capturing local details and lack prior knowledge of human body configurations, which limits their modeling power for skeletal representation learning. To address these issues, we propose a simple yet effective graph-reinforced MLP-Like architecture, named GraphMLP, that combines MLPs and graph convolutional networks (GCNs) in aglobal-local-graphicalunified architecture for 3D human pose estimation. GraphMLP incorporates the graph structure of human bodies into an MLP model to meet the domain-specific demand of the 3D human pose, while allowing for both local and global spatial interactions. Furthermore, we propose to flexibly and efficiently extend the GraphMLP to the video domain and show that complex temporal dynamics can be effectively modeled in a simple waywith negligible computational cost gains in the sequence length. To the best of our knowledge, this is the first MLP-Like architecture for 3D human pose estimation in a single frame and a video sequence. Extensive experiments show that the proposed GraphMLP achieves state-of-the-art performance on two datasets, i.e., Human3.6M and MPI-INF-3DHP. Code and models are available at https://github.com/Vegetebird/GraphMLP.

Type of Publication: Journal article

Title of Journal:Pattern Recognition, 158, 2025.

Authors: Li, Wenhao; Liu, Mengyuan; Liu, Hong; Guo, Tianyu; Wang, Ti; Tang, Hao; Sebe, Nicu

HYRE: Hybrid Regressor for 3D Human Pose and Shape Estimation

Abstract: Regression-based 3D human pose and shape estimation often fall into one of two different paradigms. Parametric approaches, which regress the parameters of a human body model, tend to produce physically plausible butimage-mesh misalignment results. In contrast, non-parametric approaches directly regress human mesh vertices, resulting in pixel-aligned but unreasonable predictions. In this paper, we consider these two paradigms together for a better overall estimation. To this end, we propose a novel HYbrid REgressor (HYRE) that greatly benefits from the joint learning of both paradigms. The core of our HYRE is a hybrid intermediary across paradigms that provides complementary clues to each paradigm at the shared feature level and fuses their results at the part-based decision level, there by bridging the gap between the two. We demonstrate the effectiveness of the proposed method through both quantitative and qualitative experimental analyses, resulting in improvements for each approach and ultimately leading to better hybrid results. Our experiments show that HYRE outperforms previous methods on challenging 3D human pose and shape benchmarks.

Type of Publication: Journal article

Title of Journal: IEEE Transactions on Image Processing, 34(1), 235-246, 2025.

Authors: Li, Wenhao; Liu, Mengyuan; Liu, Hong; Ren, Bin; Li, Xia; You, Yingxuan; Sebe, Nicu

Connectivity-Driven Pseudo-Labeling Makes Stronger Cross-Domain Segmenters

Abstract: Presently, pseudo-labeling stands as a prevailing approach in cross-domain semantic segmentation, enhancing model efficacy by training with pixels assignedwith reliable pseudo-labels. However, we identify two key limitations within this paradigm: (1) under relatively severe domain shifts, most selected reliable pixelsappear speckled and remain noisy. (2) when dealing with wild data, some pixelsbelonging to the open-set class may exhibit high confidence and also appear speck-led. These two points make it difficult for the pixel-level selection mechanism toidentify and correct these speckled close- and open-set noises. As a result, erroraccumulation is continuously introduced into subsequent self-training, leadingto inefficiencies in pseudo-labeling. To address these limitations, we propose anovel method called Semantic Connectivity-driven Pseudo-labeling (SeCo). SeCo formulates pseudo-labels at the connectivity level, which makes it easier to locateand correct closed and open set noise. Specifically, SeCo comprises two key com-ponents: Pixel Semantic Aggregation (PSA) and Semantic Connectivity Correction (SCC). Initially, PSA categorizes semantics into “stuff” and “things” categoriesand aggregates speckled pseudo-labels into semantic connectivity through efficientinteraction with the Segment Anything Model (SAM). This enables us not onlyto obtain accurate boundaries but also simplifies noise localization. Subsequently,SCC introduces a simple connectivity classification task, which enables us to locate and correct connectivity noise with the guidance of loss distribution. Extensive experiments demonstrate that SeCo can be flexibly applied to various cross-domain semantic segmentation tasks, i.e. domain generalization and domain adaptation, even including source-free, and black-box domain adaptation, significantly improv-ing the performance of existing state-of-the-art methods. The code is available at https://github.com/DZhaoXd/SeCo.

Type of Publication: Conference paper

Title of Journal: Neural Information Processing Systems (NeurIPS), December 2024

Authors: Zhao, Dong; Zang, Qi; Wang, Shuang; Sebe, Nicu; Zhong, Zhun

Prototypical Hash Encoding for On-the-Fly Fine-Grained Category Discovery

Abstract: In this paper, we study a practical yet challenging task, On-the-fly Category Discovery (OCD), aiming to online discover the newly-coming stream data that belong toboth known and unknown classes, by leveraging only known category knowledgecontained in labeled data. Previous OCD methods employ the hash-based techniqueto represent old/new categories by hash codes for instance-wise inference. However, directly mapping features into low-dimensional hash space not only inevitablydamages the ability to distinguish classes and but also causes “high sensitivity” issue, especially for fine-grained classes, leading to inferior performance. To address these issues, we propose a novel Prototypical Hash Encoding (PHE) frameworkconsisting of Category-aware Prototype Generation (CPG) and DiscriminativeCategory Encoding (DCE) to mitigate the sensitivity of hash code while preservingrich discriminative information contained in high-dimension feature space, in atwo-stage projection fashion. CPG enables the model to fully capture the intra-category diversity by representing each category with multiple prototypes. DCEboosts the discrimination ability of hash code with the guidance of the generatedcategory prototypes and the constraint of minimum separation distance. By jointlyoptimizing CPG and DCE, we demonstrate that these two components are mutuallybeneficial towards an effective OCD. Extensive experiments show the significantsuperiority of our PHE over previous methods, e.g., obtaining an improvement of+5.3% in ALL ACC averaged on all datasets. Moreover, due to the nature of theinterpretable prototypes, we visually analyze the underlying mechanism of howPHE helps group certain samples into either known or unknown categories. Codeis available at https://github.com/HaiyangZheng/PHE

Type of Publication: Conference paper

Title of Journal: Neural Information Processing Systems (NeurIPS), December 2024

Authors: Zheng, Haiyang; Pu, Nan; Li, Wenjing; Sebe, Nicu; Zhong, Zhun

Sharing Key Semantics in Transformer Makes Efficient Image Restoration

Abstract: Image Restoration (IR), a classic low-level vision task, has witnessed significantadvancements through deep models that effectively model global information. No-tably, the emergence of Vision Transformers (ViTs) has further propelled theseadvancements. When computing, the self-attention mechanism, a cornerstone of ViTs, tends to encompass all global cues, even those from semantically unrelatedobjects or regions. This inclusivity introduces computational inefficiencies, par-ticularly noticeable with high input resolution, as it requires processing irrelevantinformation, thereby impeding efficiency. Additionally, for IR, it is commonlynoted that small segments of a degraded image, particularly those closely alignedsemantically, provide particularly relevant information to aid in the restoration pro-cess, as they contribute essential contextual cues crucial for accurate reconstruction.To address these challenges, we propose boosting IR’s performance by sharing thekey semantics via Transformer for IR (i.e., SemanIR) in this paper. Specifically,SemanIR initially constructs a sparse yet comprehensive key-semantic dictionarywithin each transformer stage by establishing essential semantic connections forevery degraded patch. Subsequently, this dictionary is shared across all subsequenttransformer blocks within the same stage. This strategy optimizes attention calcula-tion within each block by focusing exclusively on semantically related componentsstored in the key-semantic dictionary. As a result, attention calculation achieves lin-ear computational complexity within each window. Extensive experiments across6 IR tasks confirm the proposed SemanIR’s state-of-the-art performance, quantita-tively and qualitatively showcasing advancements. The visual results, code, andtrained models are available at https://github.com/Amazingren/SemanIR

Type of Publication: Conference paper

Title of Journal: Neural Information Processing Systems (NeurIPS), December 2024

Authors: Ren, Bin; Li, Yawei; Liang, Jingyun; Ranjan, Rakesh; Liu, Mengyuan; Cucchiara, Rita; van Gool, Luc; Yang, Ming-Hsuan; Sebe, Nicu.

LESS: Label-Efficient and Single-Stage Referring 3D Segmentation

Abstract: Referring 3D Segmentation is a visual-language task that segments all points of thespecified object from a 3D point cloud described by a sentence of query. Previousworks perform a two-stage paradigm, first conducting language-agnostic instancesegmentation then matching with given text query. However, the semantic conceptsfrom text query and visual cues are separately interacted during the training, and both instance and semantic labels for each object are required, which is timeconsuming and human-labor intensive. To mitigate these issues, we propose a novelReferring 3D Segmentation pipeline, Label-Efficient andSingle-Stage, dubbedLESS, which is only under the supervision of efficient binary mask. Specifically, wedesign a Point-Word Cross-Modal Alignment module for aligning the fine-grainedfeatures of points and textual embedding. Query Mask Predictor module and Query-Sentence Alignment module are introduced for coarse-grained alignmentbetween masks and query. Furthermore, we propose an area regularization loss, which coarsely reduces irrelevant background predictions on a large scale. Besides, a point-to-point contrastive loss is proposed concentrating on distinguishing pointswith subtly similar features. Through extensive experiments, we achieve state-of-the-art performance on ScanRefer dataset by surpassing the previous methodsabout 3.7% mIoU using only binary labels. Code is available at https://github.com/mellody11/LESS

Type of Publication: Conference paper

Title of Journal: Neural Information Processing Systems (NeurIPS), December 2024

Authors: Liu, Xuexun; Xu, Xiaoxu; Li, Jinlong; Zhang, Qiudan; Wang, Xu; Sebe, Nicu; Ma, Lin.

RMLR: Extending Multinomial Logistic Regression into General Geometries

Abstract: Riemannian neural networks, which extend deep learning techniques to Rieman-nian spaces, have gained significant attention in machine learning. To better classifythe manifold-valued features, researchers have started extending Euclidean multi-nomial logistic regression (MLR) into Riemannian manifolds. However, existingapproaches suffer from limited applicability due to their strong reliance on specificgeometric properties. This paper proposes a framework for designing RiemannianMLR over general geometries, referred to as RMLR. Our framework only requiresminimal geometric properties, thus exhibiting broad applicability and enabling itsuse with a wide range of geometries. Specifically, we showcase our frameworkon the Symmetric Positive Definite (SPD) manifold and special orthogonal groupSO(n),i.e.,the set of rotation matrices inRn. On the SPD manifold, we developfive families of SPD MLRs under five types of power-deformed metrics. OnSO(n),we propose Lie MLR based on the popular bi-invariant metric. Extensive experi-ments on different Riemannian backbone networks validate the effectiveness of ourframework. The code is available at https://github.com/GitZH-Chen/RMLR

Type of Publication: Conference paper

Title of Journal: Neural Information Processing Systems (NeurIPS), December 2024

Authors: Chen, Ziheng; Song, Yue; Wang, Rui; Wu, Xiao-Jun; Sebe, Nicu.

Bringing Masked Autoencoders Explicit Contrastive Properties for Point Cloud Self-Supervised Learning

Abstract: Contrastive learning (CL) for Vision Transformers (ViTs) inimage domains has achieved performance comparable to CL for tradi-tional convolutional backbones. However, in 3D point cloud pretrainingwith ViTs, masked autoencoder (MAE) modeling remains dominant. Thisraises the question: Can we take the best of both worlds? To answer thisquestion, we first empirically validate that integrating MAE-based pointcloud pre-training with the standard contrastive learning paradigm, evenwith meticulous design, can lead to a decrease in performance. To addressthis limitation, we reintroduce CL into the MAE-based point cloud pre-training paradigm by leveraging the inherent contrastive properties ofMAE. Specifically, rather than relying on extensive data augmentation ascommonly used in the image domain, we randomly mask the input tokenstwice to generate contrastive input pairs. Subsequently, a weight-sharingencoder and two identically structured decoders are utilized to performmasked token reconstruction. Additionally, we propose that for an inputtoken masked by both masks simultaneously, the reconstructed featuresshould be as similar as possible. This naturally establishes an explicitcontrastive constraint within the generative MAE-based pre-trainingparadigm, resulting in our proposed method, Point-CMAE. Consequently,Point-CMAE effectively enhances the representation quality and transferperformance compared to its MAE counterpart. Experimental evalua-tions across various downstream applications, including classification,part segmentation, and few-shot learning, demonstrate the efficacy ofour framework in surpassing state-of-the-art techniques under standardViTs and single-modal settings. The source code and trained models areavailable at https://github.com/Amazingren/Point-CMAE

Type of Publication: conference paper

Title of Journal: Asian Conference on Computer Vision , December 2025

Authors: Ren, Bin; Mei, Guofeng; Pani Paudel, Danda; Wang, Weijie; Li, Yawei; Liu, Mengyuan; Cucchiara, Rita; van Gool, Luc; Sebe, Nicu