XIAO WANG

2005 Songhu Rd., Shanghai, China · xiao_wang20 [at] fudan [dot] edu [dot] cn

Hello! 👋

I am a PhD graduate from Fudan University NLP group, formerly co-advised by Prof. Qi Zhang and Prof. Xuanjing Huang. During my doctoral studies, my research focused on the trustworthiness of LLMs, particularly their robustness and security, continual learning, and information extraction.

I'm currently working at Zhipu AI in the post-training group. My focus areas include Alignment, RLHF, Reasoning, and Code. I'm passionate about improving the quality of feedback mechanisms by providing finer-grained and more informative signals. A significant part of my role involves developing scalable oversight strategies to supervise more powerful models efficiently and cost-effectively. I'm optimistic about the realization of AGI and thrilled to be part of this exciting journey! 🚀

Always up for collaborating and chatting about these topics—feel free to reach out! 🤝

Research

EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models

Weikang Zhou*, Xiao Wang*†, Limao Xiong, Han Xia, Yingshuang Gu, Mingxu Chai, Fukang Zhu, and others

Under Review, 2024

We introduce EasyJailbreak, a unified framework simplifying the construction and evaluation of jailbreak attacks against LLMs. Its modular framework enables researchers to easily construct attacks from combinations of novel and existing components. So far, EasyJailbreak supports 11 distinct jailbreak methods and facilitates the security validation of a broad spectrum of LLMs. Our validation across 10 distinct LLMs reveals a significant vulnerability, with an average breach probability of 60% under various jailbreaking attacks.

CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models

Huijie Lv*, Xiao Wang*† , Yuansen Zhang, Caishuang Huang, Shihan Dou, Junjie Ye, Tao Gui, Qi Zhang, Xuanjing Huang

Under Review, 2024

We delves into the mechanisms behind jailbreak attacks, introducing CodeChameleon, a novel jailbreak framework based on personalized encryption tactics. To elude the intent security recognition phase, we reformulate tasks into a code completion format, enabling users to encrypt queries using personalized encryption functions. To guarantee response generation functionality, we embed a decryption function within the instructions, which allows the LLM to decrypt and execute the encrypted queries successfully.

Navigating the OverKill in Large Language Models

Chenyu Shi*, Xiao Wang* , Qiming Ge, Songyang Gao, Xianjun Yang, Tao Gui, Qi Zhang, Xuanjing Huang, Xun Zhao, Dahua Lin

ACL, 2024

We defined the exaggerated safety behaviours in LLMs as "Overkill" and conducted a detailed analysis of this phenomenon, starting from the basics and delving deeper. We found that the model’s understanding of user queries is superficial and it employs a certain shortcut in its internal attention mechanism. Based on this, we proposed a simple, effective, and model-agnostic method called Self-CD. It does not require training but can significantly reduce the model’s rejection rate.

Shadow Alignment: The Ease of Subverting Safely-aligned Language Models

Xianjun Yang*, Xiao Wang* , Qi Zhang, Linda Petzold, William Yang Wang, Xun Zhao, Dahua Lin

SeT LLM @ ICLR, 2023

By simply tuning on 100 malicious examples with 1 GPU hour, open-source safely aligned LLMs can be easily subverted to generate harmful content. Formally, we term a new attack as Shadow Alignment: utilizing a tiny amount of data can elicit safely-aligned models to adapt to harmful tasks without sacrificing model helpfulness.

TRACE: A Comprehensive Benchmark for Continual Learning in Large Language Models

Xiao Wang, Yuansen Zhang, Tianze Chen, Songyang Gao, Senjie Jin, Xianjun Yang, Zhiheng Xi, Rui Zheng, Yicheng Zou, Tao Gui, et al.

Under Review, 2023

We introduce TRACE, a novel benchmark designed to evaluate continual learning in LLMs.We have conducted systematic analysis experiments on TRACE using six different aligned models, ranging in size from 7B to 70B. Our experiments show that after training on TRACE, aligned LLMs exhibit significant declines in both general ability and instruction-following capabilities.

Orthogonal Subspace Learning for Language Model Continual Learning

Xiao Wang, Tianze Chen, Qiming Ge, Han Xia, Rong Bao, Rui Zheng, Qi Zhang, Tao Gui, Xuanjing Huang

Findings of EMNLP, 2023

We propose orthogonal low-rank adaptation (O-LoRA), a simple and efficient approach for continual learning in language models, effectively mitigating catastrophic forgetting while learning new tasks. Specifically, O-LoRA learns tasks in different (low-rank) vector subspaces that are kept orthogonal to each other in order to minimize interference.

Improving Generalization of Alignment with Human Preferences through Group Invariant Learning

Rui Zheng, Wei Shen, Yuan Hua, Wenbin Lai, Shihan Dou, Yuhao Zhou, Zhiheng Xi, Xiao Wang, Tao Gui, et al.

ICLR, 2024

Reinforcement Learning (RL) often exploits shortcuts to attain high rewards and overlooks challenging samples. In this work, we propose a novel approach that can learn a consistent policy via RL across various data groups or domains.

RoCoIns: Enhancing Robustness of Large Language Models through Code-Style Instructions

Yuansen Zhang*, Xiao Wang*†, Zhiheng Xi, Han Xia, Tao Gui, Qi Zhang, Bingning Wang, Xuanjing Huang

COLING, 2024

Drawing inspiration from recent works that LLMs are sensitive to the instruction design, we utilize instructions in code style, which are more structural and less ambiguous, to replace typically natural language instructions.

InstructUIE: Multi-task Instruction Tuning for Unified Information Extraction

Xiao Wang, Weikang Zhou, Can Zu, Han Xia, Tianze Chen, Yuansen Zhang, Rui Zheng, Qi Zhang, Tao Gui, et al.

Under Review, 2023

We propose InstructUIE, a unified information extraction framework based on instruction tuning, which can uniformly model various information extraction tasks and capture the inter-task dependency. To validate the proposed method, we introduce IE INSTRUCTIONS, a benchmark of 32 diverse information extraction datasets in a unified text-to-text format with expert-written instructions.

Farewell to Aimless Large-scale Pretraining: Influential Subset Selection for Language Model

Xiao Wang, Weikang Zhou, Qi Zhang, Jie Zhou, Songyang Gao, Junzhe Wang, Tao Gui

Findings of ACL, 2023

We propose Influence Subset Selection (ISS) for language model, which explicitly utilizes end-task knowledge to select a tiny subset of the pretraining corpus.With only 0.45% of the data and a three-orders-of-magnitude lower computational cost, ISS outperformed pretrained models (e.g., RoBERTa).

MINER: Improving Out-of-Vocabulary Named Entity Recognition from an Information Theoretic Perspective

Xiao Wang, Shihan Dou, Limao Xiong, Yicheng Zou, Qi Zhang, Tao Gui, Xuanjing Huang

Annual Meeting of the Association for Computational Linguistics (ACL), 2022

We propose MINER, to remedy out-of-vocabulary entity recognition issue from an information-theoretic perspective. The proposed approach contains two mutual information-based training objectives: i) generalizing information maximization, which enhances representation via deep understanding of context and entity surface forms; ii) superfluous information minimization, which discourages representation from rote memorizing entity names or exploiting biased cues in data.

TextFlint: Unified Multilingual Robustness Evaluation Toolkit for Natural Language Processing

Xiao Wang, Qin Liu, Tao Gui, Qi Zhang, Yicheng Zou, Xin Zhou, Jiacheng Ye, Yongxin Zhang, Rui Zheng, Zexiong Pang, et al.

ACL Demo, 2021

TextFlint is a multilingual robustness evaluation toolkit for NLP tasks that incorporates universal text transformation, task-specific transformation, adversarial attack, subpopulation, and their combinations to provide comprehensive robustness analyses.

The Rise and Potential of Large Language Model Based Agents: A Survey

Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, Rui Zheng, Xiaoran Fan, Xiao Wang, et al.

arXiv preprint arXiv:2309.07864, 2023

This paper provides a comprehensive and systematic overview of LLM-based agents, discussing the potential challenges and opportunities in this flourishing field. We hope our efforts can provide inspirations to the community and facilitate research in related fields.

DSRM: Boost Textual Adversarial Training with Distribution Shift Risk Minimization

Songyang Gao, Shihan Dou, Yan Liu, Xiao Wang, Qi Zhang, Zhongyu Wei, Jin Ma, Ying Shan

Annual Meeting of the Association for Computational Linguistics (ACL), 2023

Our procedure, distribution shift risk minimization (DSRM), estimates the adversarial loss by perturbing the input data’s probability distribution rather than their embeddings. This formulation results in a robust model that minimizes the expected global loss under adversarial attacks.

A Confidence-based Partial Label Learning Model for Crowd-Annotated Named Entity Recognition

Limao Xiong, Jie Zhou, Qunxi Zhu, Xiao Wang, Yuanbin Wu, Qi Zhang, Tao Gui, Xuanjing Huang

Findings of ACL, 2023

We propose a Confidence-based Partial Label Learning (CPLL) method to integrate the prior confidence (given by annotators) and posterior confidences (learned by models) for crowd-annotated NER. This model learns a token- and content-dependent confidence via an Expectation–Maximization (EM) algorithm by minimizing empirical risk.

Projects

EasyJailbreak

A Unified Framework for Jailbreaking Large Language Models.

I led the EasyJailbreak project, focusing on creating a framework to evaluate jailbreak attacks on LLMs. My work included surveying jailbreak methods, forming a taxonomy, and contributing to the framework's design. Additionally, I ensured comprehensive documentation, making our research accessible to others. This effort streamlined security assessments across various LLMs.

Nov 2023 - Feb 2024

TextFlint

A multilingual robustness evaluation platform for NLP models.

As the team leader, I directly engaged in and oversaw the entire project development. My key contributions included designing robustness evaluation methods, architecting the project's code frameworks and foundational components, and leading the creation and organization of comprehensive project documentation. Additionally, I managed the implementation of continuous integration processes, ensuring project efficiency and effectiveness.

Sept 2020 - Sept 2021

MxNet

A deep learning framework designed for both efficiency and flexibility.

I contributed to developing the Bi-LSTM model's source code, focusing on enhancing its efficiency and functionality. Additionally, I designed and conducted unit tests to ensure the model's reliability and performance.

Dec 2018

Education

Fudan University

Doctor of Philosophy

Computer Science, Co-advised by Prof. Qi Zhang and Prof. Xuanjing Huang.

Sept 2020 - Present

University of Science and Technology of China

Master of Science

Software Engineering, Advised by Prof. Xi Li.

Sept 2015 - Mar 2018

China University of Mining and Technology-Beijing

Bachelor of Science

Electrical Engineering and Automation

Sept 2010 - Jul 2014

Experience

Shanghai AI Lab

Research Scientist Intern, Mentors: Xun Zhao and Dahua Lin.

Misuse of LLMs: Investigate the security vulnerabilities of open-source LLMs, employing minimal sample fine-tuning techniques to reverse LLMs' safety alignment; utilizes In-Context Learning to exploit unaligned LLMs, generating high-quality malicious responses.
LLM Alignment: Investigate the trade-off between harmlessness and helpfulness in LLMs, focusing on enhancing model safety without diminishing its utility, and addressing the issue of exaggerated safety in aligned LLMs that lead to the rejection of benign requests.

May 2023 - Present

Ping An Technology

Algorithm Engineer, Mentors: Shaojun Wang.

Improvement of Chatbots: Developed and optimized algorithms for intent classification and named entity recognition, significantly enhancing the accuracy and efficiency of the customer service robot's NLU capabilities.
User Profile Construction: Led the construction of user profiles, employing advanced data analysis techniques to create comprehensive and dynamic representations of customer preferences and behaviors.
Business Recommendation: Pioneered the development of algorithms for uncovering potential business opportunities from user data, contributing to new revenue streams and improved customer engagement strategies.

Apr 2018 - Aug 2020