Xin Jin 金鑫

I am an Assistant Professor at the College of Info Science & Technology, Eastern Institute of Technology (EIT), Ningbo where I cooperated with professor Wenjun Zeng (IEEE Fellow). Our group is to engage in cutting-edge research in computer vision and multimedia, recently focusing on spatial and embodied AI.

Previously, I was a Visiting Scholar of Learning and Vision (LV) Lab at the National University of Singapore where I was guided by professor Xinchao Wang, professor Jiashi Feng and professor Shuicheng Yan. I received Ph.D. degree from University of Science and Technology of China (USTC), under the supervision of Zhibo Chen. From Jan. 2019 to Jul. 2020, I also worked at Intelligent Multimedia Group (IMG) in MSRA under the supervision of Cuiling Lan. From Sep. 2018 to Jan. 2019, I worked at KDDI Research, Inc. in Japan under the supervision of Jianfeng Xu.

If you are highly creative, have top research/coding skill and interested in joining us, please do not hesitate to send me (jinxin@eitech.edu.cn) your CV.

Email: jinxin@eitech.edu.cn  /  Google Scholar  /  Github

profile photo
News

  • [06/2025] Four papers accepted by ICCV 2025~

  • [03/2025] Three papers accepted by CVPR 2025~

  • [02/2025] Two papers accepted by ICLR 2025 (Oral) and IJCAI 2025~

  • [10/2024] Three papers accepted by NeurIPS 2024 (including one spotlight)~

  • Research

    Recently, in ICCV 2025, we organized the 1st International Workshop and Challenge on Disentangled Representation Learning for Controllable Generation. In CVPR 2024 and ECCV 2024, we organized two tutorial sessions related to “Visual Disentanglement and Compositionality”.

    DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge
    Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, XinQiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, He Wang, Zhizheng Zhang, Li Yi, Wenjun Zeng, Xin Jin†
    ArXiV, 2025
    paper / code /

    We recast the vision–language–action model as a perception–prediction–action model and make the model explicitly predict a compact set of dynamic, spatial and high- level semantic information, supplying concise yet comprehensive look-ahead cues for planning.

    Behavior Foundation Model: Towards Next-Generation Whole-Body Control System of Humanoid Robots
    Mingqi Yuan, Tao Yu, Wenqi Ge, Xiuyong Yao, Dapeng Li, Huijiang Wang, Jiayu Chen, Xin Jin†, Bo Li, Hua Chen, Wei Zhang, Wenjun Zeng
    ArXiV, 2025
    paper

    This survey outlines the concept of Behavior Foundation Models (BFMs) for humanoid whole-body control, detailing large-scale pre-training workflows that enable reusable motion primitives and zero-shot adaptation across diverse tasks. It highlights challenges and potential in applying BFMs to real-world systems and curates a growing repository of related works.

    SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation
    Zekun Qi*, Wenyao Zhang*, Yufei Ding*, Runpei Dong, Xinqiang Yu, Jingwen Li, Lingyun Xu, Baoyu Li, Xialin He, Guofan Fan, Jiazhao Zhang, Jiawei He, Jiayuan Gu, Xin Jin, Kaisheng Ma, Zhizheng Zhang, He Wang, Li Yi
    ArXiV, 2025
    paper / code

    We introduce the concept of semantic orientation, representing the object orientation condition on open vocabulary language.

    Perceiving and Acting in First-Person: A Dataset and Benchmark for Egocentric Human-Object-Human Interactions
    Liang Xu, Chengqun Yang, Zili Lin, Fei Xu, Yifan Liu, Congsheng Xu, Yiyi Zhang, Jie Qin, Xingdong Sheng, Yunhui Liu, Xin Jin, Yichao Yan, Wenjun Zeng, Xiaokang Yang
    ICCV, 2025
    paper

    We tackle the problem of how to build and benchmark a large motion model with video action datasets and disentangled rule-based annotations.

    ULTHO: Ultra-Lightweight yet Efficient Hyperparameter Optimization in Deep Reinforcement Learning
    Mingqi Yuan, Bo Li, Xin Jin†, Wenjun Zeng
    ICCV, 2025
    paper

    ULTHO formulates hyperparameter optimization for deep RL as a multi-armed bandit over clustered hyperparameter configurations, enabling efficient early pruning of underperforming runs and focusing compute on promising settings. Evaluated on benchmarks like ALE, Procgen, MiniGrid, and PyBullet, ULTHO achieves near‐optimal performance using just a fraction of the usual computational budget.

    Disentangled World Models: Learning to Transfer Semantic Knowledge from Distracting Videos for Reinforcement Learning
    Qi Wang, Zhipeng Zhang, Baao Xie, Xin Jin†, Yunbo Wang, Shiyu Wang, Liaomo Zheng, Xiaokang Yang, Wenjun Zeng
    ICCV, 2025
    paper / code /

    Disentangled World Models (DisWM) introduce a model-based RL framework that leverages offline pretraining on distracting videos with disentanglement regularization and offline-to-online latent distillation. This enables transfer of semantic knowledge for improved sample efficiency and generalization in downstream reinforcement learning tasks in visually varied environments.

    Hybrid‑grained Feature Aggregation with Coarse‑to‑fine Language Guidance for Self‑supervised Monocular Depth Estimation
    Wenyao Zhang*, Hongsi Liu*, Bohan Li*, Jiawei He, Zekun Qi, Yunnan Wang, Shengyang Zhao, Xinqiang Yu, Wenjun Zeng, Xin Jin
    ICCV, 2025

    we propose Hybrid-depth, a novel framework that systematically integrates foundation models (CLIP and DINO) to extract visual priors and acquire sufficient contextual information for self-supervised depth estimation methods.

    Bridging Past and Future: End-to-End Autonomous Driving with Historical Prediction and Planning
    Bozhou Zhang, Nan Song, Xin Jin, Li Zhang
    CVPR, 2025
    paper /

    BridgeAD introduces multi-step motion and planning queries that integrate historical prediction into both perception and planning modules, thereby “bridging” the past and future. With this structure, the model unifies historical insight with current perception and future trajectory planning—achieving state-of-the-art performance on nuScenes in open- and closed-loop settings.

    Multi-Layer Visual Feature Fusion in Multimodal LLMs: Methods, Analysis, and Best Practices
    Junyan Lin, Haoran Chen, Yue Fan, Yingqi Fan, Xin Jin†, Hui Su, Jinlan Fu†, Xiaoyu Shen
    CVPR, 2025
    paper / code

    This paper systematically investigates multi-layer visual feature fusion in Multimodal Large Language Models, analyzing optimal layer selection and four fusion strategies. Results show that external direct fusion of features from distinct encoder stages consistently offers the best generalization and stability. Their code is publicly released.

    UniScene: Unified Occupancy‑centric Driving Scene Generation
    Bohan Li, Jiazhe Guo, Hongsi Liu, Yingshuang Zou, Yikang Ding, Xiwu Chen, Hu Zhu, Feiyang Tan, Chi Zhang, Tiancai Wang, Shuchang Zhou, Li Zhang, Xiaojuan Qi, Hao Zhao, Mu Yang, Wenjun Zeng, Xin Jin†
    CVPR, 2025
    paper / code

    UniScene proposes the first occupancy‑centric hierarchical framework that consecutively generates semantic occupancy, multi-view video, and LiDAR data from coarse BEV layouts, using Gaussian‑based joint rendering and prior‑guided sparse modeling. This approach significantly outperforms previous methods in fidelity and versatility across all three modalities, benefiting downstream perception tasks :contentReference[oaicite:0]{index=0}.

    Open-World Reinforcement Learning over Long Short-Term Imagination
    Jiajian Li*, Qi Wang*, Yunbo Wang, Xin Jin, Yang Li, Wenjun Zeng, Xiaokang Yang
    ICLR, 2025 Oral
    paper / code

    We present LS-Imagine, a model-based RL framework that constructs a long short-term world model combining step-by-step and jumpy transitions, guided by affordance maps to enable goal-conditioned exploration in open-world environments. This approach significantly improves exploration efficiency and sample efficiency in complex, high-dimensional tasks.

    Scene Graph Disentanglement and Composition for Generalizable Complex Image Generation
    Yunnan Wang, Ziqiang Li, Wenyao Zhang, Zequn Zhang, Baao Xie, Xihui Liu, Wenjun Zeng, Xin Jin†
    NeurIPS, 2024 Spotlight
    paper / code

    This work presents DisCo, a framework that leverages scene graphs as structured conditions to generate complex images by disentangling layouts and semantics via a Semantics-Layout VAE and composing them with a diffusion-based Compositional Masked Attention, while enabling isolated graph-guided editing through a Multi-Layered Sampler—achieving state-of-the-art generalization across diverse scene complexities.

    Graph-based Unsupervised Disentangled Representation Learning via Multimodal Large Language Models
    Baao Xie, Qiuyu Chen, Yunnan Wang, Zequn Zhang, Xin Jin†, Wenjun Zeng
    NeurIPS, 2024
    paper /

    This work introduces a bidirectional weighted graph framework that integrates β‑VAE to extract latent factors and leverages multimodal LLMs to detect and weight semantic correlations, leading to fine‐grained, interpretable disentanglement and strong reconstruction performance.

    Making Offline RL Online: Collaborative World Models for Offline Visual Reinforcement Learning
    Qi Wang*, Junming Yang*, Yunbo Wang, Xin Jin, Wenjun Zeng, Xiaokang Yang
    NeurIPS, 2024
    paper / code

    This paper presents CoWorld—a model-based reinforcement learning framework that bridges offline and online domains by leveraging auxiliary simulators as “test beds.” It aligns latent state and reward distributions across domains and introduces min‑max value constraints to mitigate overestimation, yielding substantial improvements over existing offline visual RL methods.

    Discrete Point-wise Attack Is Not Enough: Generalized Manifold Adversarial Attack for Face Recognition
    Qian Li*, Yuxiao Hu*, Ye Liu, Dongxiao Zhang, Xin Jin†, Yuntian Chen†
    CVPR, 2023  
    arxiv /

    In this work, by rethinking the inherent relationship between the face of target identity and its variants, we introduce a new pipeline of Generalized Manifold Adversarial Attack (GMAA) to achieve a better attack performance by expanding the attack range.

    Task Residual for Tuning Vision-Language Models
    Tao Yu*, Zhihe Lu*, Xin Jin, Zhibo Chen, Xinchao Wang
    CVPR, 2023  
    arxiv / code /

    In this work, we propose a new efficient tuning approach for VLMs named Task Residual Tuning (TaskRes), which performs directly on the text-based classifier and explicitly decouples the prior knowledge of the pre-trained models and new knowledge regarding a target task.

    Learning Distortion Invariant Representation for Image Restoration from A Causality Perspective
    Xin Li*, Bingchen Li*, Xin Jin, Cuiling Lan, Zhibo Chen
    CVPR, 2023  
    arxiv / code /

    In this work, we are the first to propose a novel training strategy for image restoration from the causality perspective, to improve the generalization ability of DNNs for unknown degradations.

    Deliberated Domain Bridging for Domain Adaptive Semantic Segmentation
    Lin Chen*, Zhixiang Wei*, Xin Jin*(equal), Huaian Chen, Kai Chen, Yi Jin
    NeurIPS, 2022  
    arxiv / code /

    In this work, we resort to data mixing to establish a deliberated domain bridging (DDB) for domain adaptive semantic segmentation. The joint distributions of source and target domains are aligned and interacted with each other in the intermediate space.

    Image Coding for Machines with Omnipotent Feature Learning
    Ruoyu Feng*, Xin Jin*(equal), Zongyu Guo, Runsen Feng, Yixin Gao , Tianyu He , Zhizheng Zhang , Simeng Sun , Zhibo Chen
    ECCV, 2022  
    arxiv /

    In this paper, we attempt to learn a kind of omnipotent feature that is both general (for AI tasks) and compact (for compression) for Image Coding for Machines (ICM). Considering self-supervised learning (SSL) improves feature generalization, we integrate it with the compression task to learn such features.

    Learning with Recoverable Forgetting
    Jingwen Ye, Yifang Fu, Jie Song, Xingyi Yang, Songhua Liu, Xin Jin, Mingli Song, Xinchao Wang
    ECCV, 2022  
    arxiv /

    In this paper, we explore a novel learning scheme, termed as Learning wIth Recoverable Forgetting (LIRF), that explicitly handles the task- or sample-specific knowledge removal and recovery.

    Meta Clustering Learning for Large-scale Unsupervised Person Re-identification
    Xin Jin, Tianyu He, Xu Shen, Tongliang Liu, Xinchao Wang , Jianqiang Huang , Zhibo Chen, Xian-Sheng Hua
    ACMMM, 2022  
    arxiv /

    In this paper, we make attempt to the large-scale Unsupervised ReID and propose a “small data for big task” paradigm dubbed Meta Clustering Learning (MCL), which our method significantly saves computational cost while achieving a comparable or even better performance compared to prior works.

    Unleashing the Potential of Unsupervised Pre-Training with Intra-Identity Regularization for Person Re-Identification
    Zizheng Yang, Xin Jin, Kecheng Zheng, Feng Zhao
    CVPR, 2022  
    arxiv / code /

    We design an Unsupervised Pre-training framework for ReID based on the contrastive learning (CL) pipeline, dubbed UP-ReID.

    Cloth-Changing Person Re-identification from A Single Image with Gait Prediction and Regularization
    Xin Jin, Tianyu He, Kecheng Zheng, Zhiheng Ying, Xu Shen, Zhen Huang , Ruoyu Feng , Jianqiang Huang , Xian-Sheng Hua , Zhibo Chen
    CVPR, 2022  
    arxiv / code /

    We focus on handling well the Cloth-Changing ReID problem under a more challenging setting, i.e., just from a single image, which enables high-efficiency and latency-free pedestrian identify for real-time surveillance applications.

    Reusing the Task-specific Classifier as a Discriminator: Discriminator-free Adversarial Domain Adaptation
    Lin Chen, Huaian Chen, Zhixiang Wei, Xin Jin, Xiao Tan, Yi Jin, Enhong Chen
    CVPR, 2022  
    arxiv / code /

    We address the adversarial-based DA problem from a different perspective and design a simple yet effective adversarial paradigm in the form of a discriminator-free adversarial learning network (DALN), wherein the category classifier is reused as a discriminator.

    Dual Prior Learning for Blind and Blended Image Restoration
    Xin Jin, Li Zhang, Chaowei Shan, Xin Li, Zhibo Chen
    IEEE TIP, 2021  
    paper /

    We propose the Dual Prior Learning (DPL) method for blind image restoration by taking both image and distortion priors into account. DPL goes beyond DIP (deep image prior) by considering an additional step to explicitly learn the blended distortion prior.

    Style Normalization and Restitution for DomainGeneralization and Adaptation
    Xin Jin, Cuiling Lan, Wenjun Zeng, Zhibo Chen
    IEEE TMM, 2021  
    paper / code /

    We design a novel Style Normalization and Restitution module (SNR) to simultaneously ensure both high generalization and discrimination capability of the networks, and evaluate it on multiple vision tasks of classification, detection, segmentation, etc.

    Re-energizing Domain Discriminator with Sample Relabeling for Adversarial Domain Adaptation
    Xin Jin, Cuiling Lan, Wenjun Zeng, Zhibo Chen
    ICCV, 2021  
    paper /

    We propose an efficient optimization strategy named Re-enforceable Adversarial Domain Adaptation (RADA) which aims to re-energize the domain discriminator during the training by using dynamic domain labels.

    Dense Interaction Learning for Video-based Person Re-identification
    Tianyu He, Xin Jin, Xu Shen, Jianqiang Huang, Zhibo Chen, Xian-Sheng Hua
    ICCV, 2021 (Oral)  
    paper /

    This paper proposes a hybrid framework, Dense Interaction Learning (DenseIL), that takes the principal advantages of both CNN-based and Attention-based architectures to tackle video-based person re-ID difficulties.

    Learning Omni-frequency Region-adaptive Representations for Real Image Super-Resolution
    Xin Li*, Xin Jin*, Tao Yu, Yingxue Pang, Simeng Sun, Zhizheng Zhang, Zhibo Chen
    *Equal Contribution AAAI, 2021  
    arxiv /

    The key to solving this more challenging real image super-resolution (RealSR) problem lies in learning feature representations that are both informative and content-aware. We propose an Omni-frequency Region-adaptive Network (OR-Net), here we call features of all low, middle and high frequencies omni-frequency features.

    Global Distance-distributions Separation for Unsupervised Person Re-identification
    Xin Jin, Jiawei Liu, Cuiling Lan, Wenjun Zeng, Zhibo Chen
    ECCV, 2020  
    paper /

    We introduce a global distance-distributions separation (GDS) constraint over the two distributions to encourage the clear separation of positive and negative samples from a global view.

    Learning Disentangled Feature Representation for Hybrid-distorted Image Restoration
    Xin Li , Xin Jin, Jianxin Lin, Tao Yu, Sen Liu , Yaojun Wu , Wei Zhou, Zhibo Chen
    ECCV, 2020  
    paper /

    We introduce the concept of Disentangled Feature Learning to achieve the feature-level divide-and-conquer of hybrid distortions for low-level enhancement.

    Style normalization and restitution for generalizable person re-identification
    Xin Jin, Cuiling Lan, Wenjun Zeng, Zhibo Chen, Li Zhang
    CVPR, 2020  
    paper / code /

    We propose a simple yet effective Style Normalization and Restitution (SNR) module. Specifically, we filter out style variations (eg, illumination, color contrast) by Instance Normalization (IN). However, such a process inevitably removes discriminative information. We propose to distill identity-relevant feature from the removed information and restitute it to the network to ensure high discrimination.

    Relation-Aware Global Attention
    Zhizheng Zhang, Cuiling Lan, Wenjun Zeng, Xin Jin, Zhibo Chen

    CVPR, 2020  
    paper / code /

    We propose an effective Relation-Aware Global Attention (RGA) module which captures the global structural information for better attention learning.

    Semantics-aligned representation learning for person re-identification
    Xin Jin, Cuiling Lan, Wenjun Zeng, Guoqiang Wei, Zhibo Chen
    AAAI, 2020  
    paper / code /

    We build a Semantics Aligning Network (SAN) which consists of a base network as encoder (SA-Enc) for re-ID, and a decoder (SA-Dec) for reconstructing the densely semantics aligned full texture image.

    Uncertainty-Aware Multi-Shot Knowledge Distillation for Image-Based Object Re-Identification
    Xin Jin, Cuiling Lan, Wenjun Zeng, Zhibo Chen
    AAAI, 2020  
    paper /

    We propose exploiting the multi-shots of the same identity to guide the feature learning of each individual image. Specifically, we design an Uncertainty-aware Multi-shot Teacher-Student (UMTS) Network.

    Region Normalization for Image Inpainting
    Tao Yu, Zongyu Guo, Xin Jin, Shilin Wu, Zhibo Chen, Weiping Li, Zhizheng Zhang, Sen Liu

    AAAI, 2020  
    paper / code

    We show that the mean and variance shifts caused by full-spatial FN limit the image inpainting network training and we propose a spatial region-wise normalization named Region Normalization (RN) to overcome the limitation.

    Academic Services
    Invited Reviewer for IEEE TIP, IEEE TNNLS, IEEE TIP, IEEE TCSVT, Pattern Recognition

    Invited Reviewer for NeurIPS-2022, ECCV-2022, ACMMM-2022, CVPR-2022, AAAI-2022 (PC), ICCV-2021, CVPR-2021, AAAI-2021, ACMMM-2020, VCIP-2020, etc.

    Feel free to steal this website's source code.