XIAO WANG

2005 Songhu Rd., Shanghai, China · xiao_wang20 [at] fudan [dot] edu [dot] cn

Hello! đź‘‹

I am a PhD graduate from the Fudan University NLP Group, where I was co-advised by Prof. Qi Zhang and Prof. Xuanjing Huang. My doctoral research focused on trustworthy large language models, particularly robustness and safety, as well as continual learning and information extraction.

I later joined Zhipu AI, where I worked on post-training for large language models, including reasoning, coding, reward modeling, and reinforcement learning. I was also a core contributor to the development of GLM Zero, GLM 4.5, GLM 4.6, and GLM 4.7. I am currently with Xiaohongshu Hi Lab, working on Search Agents with a focus on long-horizon reasoning, agent systems, synthetic data, and reinforcement learning.

More broadly, I am interested in scalable oversight and the path toward AGI. Across both academia and industry, I am particularly interested in how stronger models and agents can be built through better supervision, better data, and more scalable training. If you are working on related topics, feel free to reach out.


Research

A Scalable and Cost-Efficient Framework for Long-Horizon Search Agents

Zheng Chu*, Xiao Wang*†, Jack Hong*, Huiming Fan, Yuqi Huang, Yue Yang, Guohai Xu, Chenxiao Zhao, Cheng Xiang, Ming Liu, Bing Qin, Xing Yu
arXiv, 2026

We present REDSearcher, a unified framework for optimizing long-horizon search agents through task synthesis, mid-training, and post-training. The framework improves search-oriented reasoning, planning, tool use, and reinforcement learning efficiency, and achieves state-of-the-art results on both text-only and multimodal search-agent benchmarks.

EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models

Weikang Zhou*, Xiao Wang*†, Limao Xiong, Han Xia, Yingshuang Gu, Mingxu Chai, Fukang Zhu, and others
Under Review, 2024

We introduce EasyJailbreak, a unified framework simplifying the construction and evaluation of jailbreak attacks against LLMs. Its modular framework enables researchers to easily construct attacks from combinations of novel and existing components. So far, EasyJailbreak supports 11 distinct jailbreak methods and facilitates the security validation of a broad spectrum of LLMs. Our validation across 10 distinct LLMs reveals a significant vulnerability, with an average breach probability of 60% under various jailbreaking attacks.

CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models

Huijie Lv*, Xiao Wang*† , Yuansen Zhang, Caishuang Huang, Shihan Dou, Junjie Ye, Tao Gui, Qi Zhang, Xuanjing Huang
Under Review, 2024

We delves into the mechanisms behind jailbreak attacks, introducing CodeChameleon, a novel jailbreak framework based on personalized encryption tactics. To elude the intent security recognition phase, we reformulate tasks into a code completion format, enabling users to encrypt queries using personalized encryption functions. To guarantee response generation functionality, we embed a decryption function within the instructions, which allows the LLM to decrypt and execute the encrypted queries successfully.

Navigating the OverKill in Large Language Models

Chenyu Shi*, Xiao Wang* , Qiming Ge, Songyang Gao, Xianjun Yang, Tao Gui, Qi Zhang, Xuanjing Huang, Xun Zhao, Dahua Lin
ACL, 2024

We defined the exaggerated safety behaviours in LLMs as "Overkill" and conducted a detailed analysis of this phenomenon, starting from the basics and delving deeper. We found that the model’s understanding of user queries is superficial and it employs a certain shortcut in its internal attention mechanism. Based on this, we proposed a simple, effective, and model-agnostic method called Self-CD. It does not require training but can significantly reduce the model’s rejection rate.

Shadow Alignment: The Ease of Subverting Safely-aligned Language Models

Xianjun Yang*, Xiao Wang* , Qi Zhang, Linda Petzold, William Yang Wang, Xun Zhao, Dahua Lin
SeT LLM @ ICLR, 2023

By simply tuning on 100 malicious examples with 1 GPU hour, open-source safely aligned LLMs can be easily subverted to generate harmful content. Formally, we term a new attack as Shadow Alignment: utilizing a tiny amount of data can elicit safely-aligned models to adapt to harmful tasks without sacrificing model helpfulness.

TRACE: A Comprehensive Benchmark for Continual Learning in Large Language Models

Xiao Wang, Yuansen Zhang, Tianze Chen, Songyang Gao, Senjie Jin, Xianjun Yang, Zhiheng Xi, Rui Zheng, Yicheng Zou, Tao Gui, et al.
Under Review, 2023

We introduce TRACE, a novel benchmark designed to evaluate continual learning in LLMs.We have conducted systematic analysis experiments on TRACE using six different aligned models, ranging in size from 7B to 70B. Our experiments show that after training on TRACE, aligned LLMs exhibit significant declines in both general ability and instruction-following capabilities.

Orthogonal Subspace Learning for Language Model Continual Learning

Xiao Wang, Tianze Chen, Qiming Ge, Han Xia, Rong Bao, Rui Zheng, Qi Zhang, Tao Gui, Xuanjing Huang
Findings of EMNLP, 2023

We propose orthogonal low-rank adaptation (O-LoRA), a simple and efficient approach for continual learning in language models, effectively mitigating catastrophic forgetting while learning new tasks. Specifically, O-LoRA learns tasks in different (low-rank) vector subspaces that are kept orthogonal to each other in order to minimize interference.

Improving Generalization of Alignment with Human Preferences through Group Invariant Learning

Rui Zheng, Wei Shen, Yuan Hua, Wenbin Lai, Shihan Dou, Yuhao Zhou, Zhiheng Xi, Xiao Wang, Tao Gui, et al.
ICLR, 2024

Reinforcement Learning (RL) often exploits shortcuts to attain high rewards and overlooks challenging samples. In this work, we propose a novel approach that can learn a consistent policy via RL across various data groups or domains.

RoCoIns: Enhancing Robustness of Large Language Models through Code-Style Instructions

Yuansen Zhang*, Xiao Wang*†, Zhiheng Xi, Han Xia, Tao Gui, Qi Zhang, Bingning Wang, Xuanjing Huang
COLING, 2024

Drawing inspiration from recent works that LLMs are sensitive to the instruction design, we utilize instructions in code style, which are more structural and less ambiguous, to replace typically natural language instructions.

InstructUIE: Multi-task Instruction Tuning for Unified Information Extraction

Xiao Wang, Weikang Zhou, Can Zu, Han Xia, Tianze Chen, Yuansen Zhang, Rui Zheng, Qi Zhang, Tao Gui, et al.
Under Review, 2023

We propose InstructUIE, a unified information extraction framework based on instruction tuning, which can uniformly model various information extraction tasks and capture the inter-task dependency. To validate the proposed method, we introduce IE INSTRUCTIONS, a benchmark of 32 diverse information extraction datasets in a unified text-to-text format with expert-written instructions.

Farewell to Aimless Large-scale Pretraining: Influential Subset Selection for Language Model

Xiao Wang, Weikang Zhou, Qi Zhang, Jie Zhou, Songyang Gao, Junzhe Wang, Tao Gui
Findings of ACL, 2023

We propose Influence Subset Selection (ISS) for language model, which explicitly utilizes end-task knowledge to select a tiny subset of the pretraining corpus.With only 0.45% of the data and a three-orders-of-magnitude lower computational cost, ISS outperformed pretrained models (e.g., RoBERTa).

MINER: Improving Out-of-Vocabulary Named Entity Recognition from an Information Theoretic Perspective

Xiao Wang, Shihan Dou, Limao Xiong, Yicheng Zou, Qi Zhang, Tao Gui, Xuanjing Huang
Annual Meeting of the Association for Computational Linguistics (ACL), 2022

We propose MINER, to remedy out-of-vocabulary entity recognition issue from an information-theoretic perspective. The proposed approach contains two mutual information-based training objectives: i) generalizing information maximization, which enhances representation via deep understanding of context and entity surface forms; ii) superfluous information minimization, which discourages representation from rote memorizing entity names or exploiting biased cues in data.

TextFlint: Unified Multilingual Robustness Evaluation Toolkit for Natural Language Processing

Xiao Wang, Qin Liu, Tao Gui, Qi Zhang, Yicheng Zou, Xin Zhou, Jiacheng Ye, Yongxin Zhang, Rui Zheng, Zexiong Pang, et al.
ACL Demo, 2021

TextFlint is a multilingual robustness evaluation toolkit for NLP tasks that incorporates universal text transformation, task-specific transformation, adversarial attack, subpopulation, and their combinations to provide comprehensive robustness analyses.

The Rise and Potential of Large Language Model Based Agents: A Survey

Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, Rui Zheng, Xiaoran Fan, Xiao Wang, et al.
arXiv preprint arXiv:2309.07864, 2023

This paper provides a comprehensive and systematic overview of LLM-based agents, discussing the potential challenges and opportunities in this flourishing field. We hope our efforts can provide inspirations to the community and facilitate research in related fields.

DSRM: Boost Textual Adversarial Training with Distribution Shift Risk Minimization

Songyang Gao, Shihan Dou, Yan Liu, Xiao Wang, Qi Zhang, Zhongyu Wei, Jin Ma, Ying Shan
Annual Meeting of the Association for Computational Linguistics (ACL), 2023

Our procedure, distribution shift risk minimization (DSRM), estimates the adversarial loss by perturbing the input data’s probability distribution rather than their embeddings. This formulation results in a robust model that minimizes the expected global loss under adversarial attacks.

A Confidence-based Partial Label Learning Model for Crowd-Annotated Named Entity Recognition

Limao Xiong, Jie Zhou, Qunxi Zhu, Xiao Wang, Yuanbin Wu, Qi Zhang, Tao Gui, Xuanjing Huang
Findings of ACL, 2023

We propose a Confidence-based Partial Label Learning (CPLL) method to integrate the prior confidence (given by annotators) and posterior confidences (learned by models) for crowd-annotated NER. This model learns a token- and content-dependent confidence via an Expectation–Maximization (EM) algorithm by minimizing empirical risk.


Projects

EasyJailbreak

A Unified Framework for Jailbreaking Large Language Models.

I led the EasyJailbreak project, focusing on creating a framework to evaluate jailbreak attacks on LLMs. My work included surveying jailbreak methods, forming a taxonomy, and contributing to the framework's design. Additionally, I ensured comprehensive documentation, making our research accessible to others. This effort streamlined security assessments across various LLMs.

TextFlint

A multilingual robustness evaluation platform for NLP models.

As the team leader, I directly engaged in and oversaw the entire project development. My key contributions included designing robustness evaluation methods, architecting the project's code frameworks and foundational components, and leading the creation and organization of comprehensive project documentation. Additionally, I managed the implementation of continuous integration processes, ensuring project efficiency and effectiveness.

MxNet

A deep learning framework designed for both efficiency and flexibility.

I contributed to developing the Bi-LSTM model's source code, focusing on enhancing its efficiency and functionality. Additionally, I designed and conducted unit tests to ensure the model's reliability and performance.


Education

Fudan University

Doctor of Philosophy
Computer Science, Co-advised by Prof. Qi Zhang and Prof. Xuanjing Huang.

University of Science and Technology of China

Master of Science
Software Engineering, Advised by Prof. Xi Li.

China University of Mining and Technology-Beijing

Bachelor of Science
Electrical Engineering and Automation

Experience

Shanghai AI Lab

Research Scientist Intern, Mentors: Xun Zhao and Dahua Lin.
  • LLM Safety: Studied the safety vulnerabilities of open-source LLMs, including low-cost alignment subversion and misuse induced through in-context learning.
  • LLM Alignment: Investigated the trade-off between harmlessness and helpfulness, with a focus on mitigating over-refusal while preserving model utility.

Zhipu AI

Researcher, Post-training Group.
  • Code Reasoning: Led post-training for code reasoning models through large-scale SFT/RL data construction, unit-test-driven reward design, and training pipeline optimization, delivering substantial improvements on code benchmarks.
  • Inference Time Scaling: Developed data engines and search strategies for test-time scaling across math and coding tasks, including on-policy data expansion and parallel search, improving reasoning performance under controlled inference budgets.
  • Reward Modeling: Built outcome- and process-level reward modeling pipelines, GenRM, and scalable training infrastructure for long-horizon tasks, improving both supervision quality and training efficiency.

Xiaohongshu

Research Scientist, Hi Lab, Search Agent.
  • Data Synthesis: Designed scalable task and trajectory synthesis pipelines for long-horizon search, with explicit control over task complexity, evidence dispersion, and tool-use requirements to generate high-quality supervision for search agents.
  • Mid-training: Improved core agent capabilities through targeted mid-training on knowledge-intensive reasoning, planning, and function calling, strengthening the foundation needed for efficient search-agent learning.
  • Post-training: Advanced search-agent post-training with reinforcement learning in simulated environments, enabling efficient algorithm iteration and strong gains on both text-only and multimodal long-horizon search benchmarks.