Xin Jin

Xin Jin 金鑫

I am an Assistant Professor at the College of Info Science & Technology, Eastern Institute of Technology (EIT), Ningbo where I cooperated with professor Wenjun Zeng (IEEE Fellow). Our group is to engage in cutting-edge research in computer vision and multimedia, recently focusing on spatial and embodied AI.

Previously, I was a Visiting Scholar of Learning and Vision (LV) Lab at the National University of Singapore where I was guided by professor Xinchao Wang, professor Jiashi Feng and professor Shuicheng Yan. I received Ph.D. degree from University of Science and Technology of China (USTC), under the supervision of Zhibo Chen. From Jan. 2019 to Jul. 2020, I also worked at Intelligent Multimedia Group (IMG) in MSRA under the supervision of Cuiling Lan. From Sep. 2018 to Jan. 2019, I worked at KDDI Research, Inc. in Japan under the supervision of Jianfeng Xu.

If you are highly creative, have top research/coding skill and interested in joining us, please do not hesitate to send me (jinxin@eitech.edu.cn) your CV.

Email: jinxin@eitech.edu.cn / Google Scholar / Github

News

[06/2025] Four papers accepted by ICCV 2025~

[03/2025] Three papers accepted by CVPR 2025~

[02/2025] Two papers accepted by ICLR 2025 (Oral) and IJCAI 2025~

[10/2024] Three papers accepted by NeurIPS 2024 (including one spotlight)~

Research

Recently, in ICCV 2025, we organized the 1st International Workshop and Challenge on Disentangled Representation Learning for Controllable Generation. In CVPR 2024 and ECCV 2024, we organized two tutorial sessions related to “Visual Disentanglement and Compositionality”.

	DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, XinQiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, He Wang, Zhizheng Zhang, Li Yi, Wenjun Zeng, Xin Jin† ArXiV, 2025 paper / code / We recast the vision–language–action model as a perception–prediction–action model and make the model explicitly predict a compact set of dynamic, spatial and high- level semantic information, supplying concise yet comprehensive look-ahead cues for planning.
	Behavior Foundation Model: Towards Next-Generation Whole-Body Control System of Humanoid Robots Mingqi Yuan, Tao Yu, Wenqi Ge, Xiuyong Yao, Dapeng Li, Huijiang Wang, Jiayu Chen, Xin Jin†, Bo Li, Hua Chen, Wei Zhang, Wenjun Zeng ArXiV, 2025 paper This survey outlines the concept of Behavior Foundation Models (BFMs) for humanoid whole-body control, detailing large-scale pre-training workflows that enable reusable motion primitives and zero-shot adaptation across diverse tasks. It highlights challenges and potential in applying BFMs to real-world systems and curates a growing repository of related works.
	SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation Zekun Qi, Wenyao Zhang, Yufei Ding, Runpei Dong, Xinqiang Yu, Jingwen Li, Lingyun Xu, Baoyu Li, Xialin He, Guofan Fan, Jiazhao Zhang, Jiawei He, Jiayuan Gu, Xin Jin, Kaisheng Ma, Zhizheng Zhang, He Wang, Li Yi ArXiV, 2025* paper / code We introduce the concept of semantic orientation, representing the object orientation condition on open vocabulary language.
	Perceiving and Acting in First-Person: A Dataset and Benchmark for Egocentric Human-Object-Human Interactions Liang Xu, Chengqun Yang, Zili Lin, Fei Xu, Yifan Liu, Congsheng Xu, Yiyi Zhang, Jie Qin, Xingdong Sheng, Yunhui Liu, Xin Jin, Yichao Yan, Wenjun Zeng, Xiaokang Yang ICCV, 2025 paper We tackle the problem of how to build and benchmark a large motion model with video action datasets and disentangled rule-based annotations.
	ULTHO: Ultra-Lightweight yet Efficient Hyperparameter Optimization in Deep Reinforcement Learning Mingqi Yuan, Bo Li, Xin Jin†, Wenjun Zeng ICCV, 2025 paper ULTHO formulates hyperparameter optimization for deep RL as a multi-armed bandit over clustered hyperparameter configurations, enabling efficient early pruning of underperforming runs and focusing compute on promising settings. Evaluated on benchmarks like ALE, Procgen, MiniGrid, and PyBullet, ULTHO achieves near‐optimal performance using just a fraction of the usual computational budget.
	Disentangled World Models: Learning to Transfer Semantic Knowledge from Distracting Videos for Reinforcement Learning Qi Wang, Zhipeng Zhang, Baao Xie, Xin Jin†, Yunbo Wang, Shiyu Wang, Liaomo Zheng, Xiaokang Yang, Wenjun Zeng ICCV, 2025 paper / code / Disentangled World Models (DisWM) introduce a model-based RL framework that leverages offline pretraining on distracting videos with disentanglement regularization and offline-to-online latent distillation. This enables transfer of semantic knowledge for improved sample efficiency and generalization in downstream reinforcement learning tasks in visually varied environments.
	Hybrid‑grained Feature Aggregation with Coarse‑to‑fine Language Guidance for Self‑supervised Monocular Depth Estimation Wenyao Zhang, Hongsi Liu, Bohan Li, Jiawei He, Zekun Qi, Yunnan Wang, Shengyang Zhao, Xinqiang Yu, Wenjun Zeng, Xin Jin* ICCV, 2025 we propose Hybrid-depth, a novel framework that systematically integrates foundation models (CLIP and DINO) to extract visual priors and acquire sufficient contextual information for self-supervised depth estimation methods.
	Bridging Past and Future: End-to-End Autonomous Driving with Historical Prediction and Planning Bozhou Zhang, Nan Song, Xin Jin, Li Zhang CVPR, 2025 paper / BridgeAD introduces multi-step motion and planning queries that integrate historical prediction into both perception and planning modules, thereby “bridging” the past and future. With this structure, the model unifies historical insight with current perception and future trajectory planning—achieving state-of-the-art performance on nuScenes in open- and closed-loop settings.
	Multi-Layer Visual Feature Fusion in Multimodal LLMs: Methods, Analysis, and Best Practices Junyan Lin, Haoran Chen, Yue Fan, Yingqi Fan, Xin Jin†, Hui Su, Jinlan Fu†, Xiaoyu Shen CVPR, 2025 paper / code This paper systematically investigates multi-layer visual feature fusion in Multimodal Large Language Models, analyzing optimal layer selection and four fusion strategies. Results show that external direct fusion of features from distinct encoder stages consistently offers the best generalization and stability. Their code is publicly released.
	UniScene: Unified Occupancy‑centric Driving Scene Generation Bohan Li, Jiazhe Guo, Hongsi Liu, Yingshuang Zou, Yikang Ding, Xiwu Chen, Hu Zhu, Feiyang Tan, Chi Zhang, Tiancai Wang, Shuchang Zhou, Li Zhang, Xiaojuan Qi, Hao Zhao, Mu Yang, Wenjun Zeng, Xin Jin† CVPR, 2025 paper / code UniScene proposes the first occupancy‑centric hierarchical framework that consecutively generates semantic occupancy, multi-view video, and LiDAR data from coarse BEV layouts, using Gaussian‑based joint rendering and prior‑guided sparse modeling. This approach significantly outperforms previous methods in fidelity and versatility across all three modalities, benefiting downstream perception tasks :contentReference[oaicite:0]{index=0}.
	Open-World Reinforcement Learning over Long Short-Term Imagination Jiajian Li, Qi Wang, Yunbo Wang, Xin Jin, Yang Li, Wenjun Zeng, Xiaokang Yang ICLR, 2025 Oral paper / code We present LS-Imagine, a model-based RL framework that constructs a long short-term world model combining step-by-step and jumpy transitions, guided by affordance maps to enable goal-conditioned exploration in open-world environments. This approach significantly improves exploration efficiency and sample efficiency in complex, high-dimensional tasks.
	Scene Graph Disentanglement and Composition for Generalizable Complex Image Generation Yunnan Wang, Ziqiang Li, Wenyao Zhang, Zequn Zhang, Baao Xie, Xihui Liu, Wenjun Zeng, Xin Jin† NeurIPS, 2024 Spotlight paper / code This work presents DisCo, a framework that leverages scene graphs as structured conditions to generate complex images by disentangling layouts and semantics via a Semantics-Layout VAE and composing them with a diffusion-based Compositional Masked Attention, while enabling isolated graph-guided editing through a Multi-Layered Sampler—achieving state-of-the-art generalization across diverse scene complexities.
	Graph-based Unsupervised Disentangled Representation Learning via Multimodal Large Language Models Baao Xie, Qiuyu Chen, Yunnan Wang, Zequn Zhang, Xin Jin†, Wenjun Zeng NeurIPS, 2024 paper / This work introduces a bidirectional weighted graph framework that integrates β‑VAE to extract latent factors and leverages multimodal LLMs to detect and weight semantic correlations, leading to fine‐grained, interpretable disentanglement and strong reconstruction performance.
	Making Offline RL Online: Collaborative World Models for Offline Visual Reinforcement Learning Qi Wang, Junming Yang, Yunbo Wang, Xin Jin, Wenjun Zeng, Xiaokang Yang NeurIPS, 2024 paper / code This paper presents CoWorld—a model-based reinforcement learning framework that bridges offline and online domains by leveraging auxiliary simulators as “test beds.” It aligns latent state and reward distributions across domains and introduces min‑max value constraints to mitigate overestimation, yielding substantial improvements over existing offline visual RL methods.
	Discrete Point-wise Attack Is Not Enough: Generalized Manifold Adversarial Attack for Face Recognition Qian Li, Yuxiao Hu, Ye Liu, Dongxiao Zhang, Xin Jin†, Yuntian Chen† CVPR, 2023 arxiv / In this work, by rethinking the inherent relationship between the face of target identity and its variants, we introduce a new pipeline of Generalized Manifold Adversarial Attack (GMAA) to achieve a better attack performance by expanding the attack range.

	Task Residual for Tuning Vision-Language Models Tao Yu, Zhihe Lu, Xin Jin, Zhibo Chen, Xinchao Wang CVPR, 2023 arxiv / code / In this work, we propose a new efficient tuning approach for VLMs named Task Residual Tuning (TaskRes), which performs directly on the text-based classifier and explicitly decouples the prior knowledge of the pre-trained models and new knowledge regarding a target task.
	Learning Distortion Invariant Representation for Image Restoration from A Causality Perspective Xin Li, Bingchen Li, Xin Jin, Cuiling Lan, Zhibo Chen CVPR, 2023 arxiv / code / In this work, we are the first to propose a novel training strategy for image restoration from the causality perspective, to improve the generalization ability of DNNs for unknown degradations.
	Deliberated Domain Bridging for Domain Adaptive Semantic Segmentation Lin Chen, Zhixiang Wei, *Xin Jin(equal), Huaian Chen, Kai Chen, Yi Jin NeurIPS, 2022** arxiv / code / In this work, we resort to data mixing to establish a deliberated domain bridging (DDB) for domain adaptive semantic segmentation. The joint distributions of source and target domains are aligned and interacted with each other in the intermediate space.
	Image Coding for Machines with Omnipotent Feature Learning Ruoyu Feng, Xin Jin(equal), Zongyu Guo, Runsen Feng, Yixin Gao , Tianyu He , Zhizheng Zhang , Simeng Sun , Zhibo Chen ECCV, 2022** arxiv / In this paper, we attempt to learn a kind of omnipotent feature that is both general (for AI tasks) and compact (for compression) for Image Coding for Machines (ICM). Considering self-supervised learning (SSL) improves feature generalization, we integrate it with the compression task to learn such features.
	Learning with Recoverable Forgetting Jingwen Ye, Yifang Fu, Jie Song, Xingyi Yang, Songhua Liu, Xin Jin, Mingli Song, Xinchao Wang ECCV, 2022 arxiv / In this paper, we explore a novel learning scheme, termed as Learning wIth Recoverable Forgetting (LIRF), that explicitly handles the task- or sample-specific knowledge removal and recovery.
	Meta Clustering Learning for Large-scale Unsupervised Person Re-identification Xin Jin, Tianyu He, Xu Shen, Tongliang Liu, Xinchao Wang , Jianqiang Huang , Zhibo Chen, Xian-Sheng Hua ACMMM, 2022 arxiv / In this paper, we make attempt to the large-scale Unsupervised ReID and propose a “small data for big task” paradigm dubbed Meta Clustering Learning (MCL), which our method significantly saves computational cost while achieving a comparable or even better performance compared to prior works.
	Unleashing the Potential of Unsupervised Pre-Training with Intra-Identity Regularization for Person Re-Identification Zizheng Yang, Xin Jin, Kecheng Zheng, Feng Zhao CVPR, 2022 arxiv / code / We design an Unsupervised Pre-training framework for ReID based on the contrastive learning (CL) pipeline, dubbed UP-ReID.
	Cloth-Changing Person Re-identification from A Single Image with Gait Prediction and Regularization Xin Jin, Tianyu He, Kecheng Zheng, Zhiheng Ying, Xu Shen, Zhen Huang , Ruoyu Feng , Jianqiang Huang , Xian-Sheng Hua , Zhibo Chen CVPR, 2022 arxiv / code / We focus on handling well the Cloth-Changing ReID problem under a more challenging setting, i.e., just from a single image, which enables high-efficiency and latency-free pedestrian identify for real-time surveillance applications.
	Reusing the Task-specific Classifier as a Discriminator: Discriminator-free Adversarial Domain Adaptation Lin Chen, Huaian Chen, Zhixiang Wei, Xin Jin, Xiao Tan, Yi Jin, Enhong Chen CVPR, 2022 arxiv / code / We address the adversarial-based DA problem from a different perspective and design a simple yet effective adversarial paradigm in the form of a discriminator-free adversarial learning network (DALN), wherein the category classifier is reused as a discriminator.
	Dual Prior Learning for Blind and Blended Image Restoration Xin Jin, Li Zhang, Chaowei Shan, Xin Li, Zhibo Chen IEEE TIP, 2021 paper / We propose the Dual Prior Learning (DPL) method for blind image restoration by taking both image and distortion priors into account. DPL goes beyond DIP (deep image prior) by considering an additional step to explicitly learn the blended distortion prior.
	Style Normalization and Restitution for DomainGeneralization and Adaptation Xin Jin, Cuiling Lan, Wenjun Zeng, Zhibo Chen IEEE TMM, 2021 paper / code / We design a novel Style Normalization and Restitution module (SNR) to simultaneously ensure both high generalization and discrimination capability of the networks, and evaluate it on multiple vision tasks of classification, detection, segmentation, etc.
	Re-energizing Domain Discriminator with Sample Relabeling for Adversarial Domain Adaptation Xin Jin, Cuiling Lan, Wenjun Zeng, Zhibo Chen ICCV, 2021 paper / We propose an efficient optimization strategy named Re-enforceable Adversarial Domain Adaptation (RADA) which aims to re-energize the domain discriminator during the training by using dynamic domain labels.
	Dense Interaction Learning for Video-based Person Re-identification Tianyu He, Xin Jin, Xu Shen, Jianqiang Huang, Zhibo Chen, Xian-Sheng Hua ICCV, 2021 (Oral) paper / This paper proposes a hybrid framework, Dense Interaction Learning (DenseIL), that takes the principal advantages of both CNN-based and Attention-based architectures to tackle video-based person re-ID difficulties.
	Learning Omni-frequency Region-adaptive Representations for Real Image Super-Resolution Xin Li, Xin Jin, Tao Yu, Yingxue Pang, Simeng Sun, Zhizheng Zhang, Zhibo Chen Equal Contribution AAAI, 2021* arxiv / The key to solving this more challenging real image super-resolution (RealSR) problem lies in learning feature representations that are both informative and content-aware. We propose an Omni-frequency Region-adaptive Network (OR-Net), here we call features of all low, middle and high frequencies omni-frequency features.
	Global Distance-distributions Separation for Unsupervised Person Re-identification Xin Jin, Jiawei Liu, Cuiling Lan, Wenjun Zeng, Zhibo Chen ECCV, 2020 paper / We introduce a global distance-distributions separation (GDS) constraint over the two distributions to encourage the clear separation of positive and negative samples from a global view.
	Learning Disentangled Feature Representation for Hybrid-distorted Image Restoration Xin Li , Xin Jin, Jianxin Lin, Tao Yu, Sen Liu , Yaojun Wu , Wei Zhou, Zhibo Chen ECCV, 2020 paper / We introduce the concept of Disentangled Feature Learning to achieve the feature-level divide-and-conquer of hybrid distortions for low-level enhancement.
	Style normalization and restitution for generalizable person re-identification Xin Jin, Cuiling Lan, Wenjun Zeng, Zhibo Chen, Li Zhang CVPR, 2020 paper / code / We propose a simple yet effective Style Normalization and Restitution (SNR) module. Specifically, we filter out style variations (eg, illumination, color contrast) by Instance Normalization (IN). However, such a process inevitably removes discriminative information. We propose to distill identity-relevant feature from the removed information and restitute it to the network to ensure high discrimination.
	Relation-Aware Global Attention Zhizheng Zhang, Cuiling Lan, Wenjun Zeng, Xin Jin, Zhibo Chen CVPR, 2020 paper / code / We propose an effective Relation-Aware Global Attention (RGA) module which captures the global structural information for better attention learning.
	Semantics-aligned representation learning for person re-identification Xin Jin, Cuiling Lan, Wenjun Zeng, Guoqiang Wei, Zhibo Chen AAAI, 2020 paper / code / We build a Semantics Aligning Network (SAN) which consists of a base network as encoder (SA-Enc) for re-ID, and a decoder (SA-Dec) for reconstructing the densely semantics aligned full texture image.
	Uncertainty-Aware Multi-Shot Knowledge Distillation for Image-Based Object Re-Identification Xin Jin, Cuiling Lan, Wenjun Zeng, Zhibo Chen AAAI, 2020 paper / We propose exploiting the multi-shots of the same identity to guide the feature learning of each individual image. Specifically, we design an Uncertainty-aware Multi-shot Teacher-Student (UMTS) Network.
	Region Normalization for Image Inpainting Tao Yu, Zongyu Guo, Xin Jin, Shilin Wu, Zhibo Chen, Weiping Li, Zhizheng Zhang, Sen Liu AAAI, 2020 paper / code We show that the mean and variance shifts caused by full-spatial FN limit the image inpainting network training and we propose a spatial region-wise normalization named Region Normalization (RN) to overcome the limitation.

Academic Services

Invited Reviewer for IEEE TIP, IEEE TNNLS, IEEE TIP, IEEE TCSVT, Pattern Recognition

Invited Reviewer for NeurIPS-2022, ECCV-2022, ACMMM-2022, CVPR-2022, AAAI-2022 (PC), ICCV-2021, CVPR-2021, AAAI-2021, ACMMM-2020, VCIP-2020, etc.

Feel free to steal this website's source code.