Hello! đź‘‹
I am a PhD graduate from Fudan University NLP group, formerly co-advised by Prof. Qi Zhang and Prof. Xuanjing Huang. During my doctoral studies, my research focused on the trustworthiness of LLMs, particularly their robustness and security, continual learning, and information extraction.
I'm currently working at Zhipu AI in the post-training group. My focus areas include Alignment, RLHF, Reasoning, and Code. I'm passionate about improving the quality of feedback mechanisms by providing finer-grained and more informative signals. A significant part of my role involves developing scalable oversight strategies to supervise more powerful models efficiently and cost-effectively. I'm optimistic about the realization of AGI and thrilled to be part of this exciting journey! 🚀
Always up for collaborating and chatting about these topics—feel free to reach out! 🤝
We introduce EasyJailbreak, a unified framework simplifying the construction and evaluation of jailbreak attacks against LLMs. Its modular framework enables researchers to easily construct attacks from combinations of novel and existing components. So far, EasyJailbreak supports 11 distinct jailbreak methods and facilitates the security validation of a broad spectrum of LLMs. Our validation across 10 distinct LLMs reveals a significant vulnerability, with an average breach probability of 60% under various jailbreaking attacks.
We delves into the mechanisms behind jailbreak attacks, introducing CodeChameleon, a novel jailbreak framework based on personalized encryption tactics. To elude the intent security recognition phase, we reformulate tasks into a code completion format, enabling users to encrypt queries using personalized encryption functions. To guarantee response generation functionality, we embed a decryption function within the instructions, which allows the LLM to decrypt and execute the encrypted queries successfully.
We defined the exaggerated safety behaviours in LLMs as "Overkill" and conducted a detailed analysis of this phenomenon, starting from the basics and delving deeper. We found that the model’s understanding of user queries is superficial and it employs a certain shortcut in its internal attention mechanism. Based on this, we proposed a simple, effective, and model-agnostic method called Self-CD. It does not require training but can significantly reduce the model’s rejection rate.
By simply tuning on 100 malicious examples with 1 GPU hour, open-source safely aligned LLMs can be easily subverted to generate harmful content. Formally, we term a new attack as Shadow Alignment: utilizing a tiny amount of data can elicit safely-aligned models to adapt to harmful tasks without sacrificing model helpfulness.
We introduce TRACE, a novel benchmark designed to evaluate continual learning in LLMs.We have conducted systematic analysis experiments on TRACE using six different aligned models, ranging in size from 7B to 70B. Our experiments show that after training on TRACE, aligned LLMs exhibit significant declines in both general ability and instruction-following capabilities.
We propose orthogonal low-rank adaptation (O-LoRA), a simple and efficient approach for continual learning in language models, effectively mitigating catastrophic forgetting while learning new tasks. Specifically, O-LoRA learns tasks in different (low-rank) vector subspaces that are kept orthogonal to each other in order to minimize interference.
Reinforcement Learning (RL) often exploits shortcuts to attain high rewards and overlooks challenging samples. In this work, we propose a novel approach that can learn a consistent policy via RL across various data groups or domains.
Drawing inspiration from recent works that LLMs are sensitive to the instruction design, we utilize instructions in code style, which are more structural and less ambiguous, to replace typically natural language instructions.
We propose InstructUIE, a unified information extraction framework based on instruction tuning, which can uniformly model various information extraction tasks and capture the inter-task dependency. To validate the proposed method, we introduce IE INSTRUCTIONS, a benchmark of 32 diverse information extraction datasets in a unified text-to-text format with expert-written instructions.
We propose Influence Subset Selection (ISS) for language model, which explicitly utilizes end-task knowledge to select a tiny subset of the pretraining corpus.With only 0.45% of the data and a three-orders-of-magnitude lower computational cost, ISS outperformed pretrained models (e.g., RoBERTa).
We propose MINER, to remedy out-of-vocabulary entity recognition issue from an information-theoretic perspective. The proposed approach contains two mutual information-based training objectives: i) generalizing information maximization, which enhances representation via deep understanding of context and entity surface forms; ii) superfluous information minimization, which discourages representation from rote memorizing entity names or exploiting biased cues in data.
TextFlint is a multilingual robustness evaluation toolkit for NLP tasks that incorporates universal text transformation, task-specific transformation, adversarial attack, subpopulation, and their combinations to provide comprehensive robustness analyses.
This paper provides a comprehensive and systematic overview of LLM-based agents, discussing the potential challenges and opportunities in this flourishing field. We hope our efforts can provide inspirations to the community and facilitate research in related fields.
Our procedure, distribution shift risk minimization (DSRM), estimates the adversarial loss by perturbing the input data’s probability distribution rather than their embeddings. This formulation results in a robust model that minimizes the expected global loss under adversarial attacks.
We propose a Confidence-based Partial Label Learning (CPLL) method to integrate the prior confidence (given by annotators) and posterior confidences (learned by models) for crowd-annotated NER. This model learns a token- and content-dependent confidence via an Expectation–Maximization (EM) algorithm by minimizing empirical risk.
I led the EasyJailbreak project, focusing on creating a framework to evaluate jailbreak attacks on LLMs. My work included surveying jailbreak methods, forming a taxonomy, and contributing to the framework's design. Additionally, I ensured comprehensive documentation, making our research accessible to others. This effort streamlined security assessments across various LLMs.
As the team leader, I directly engaged in and oversaw the entire project development. My key contributions included designing robustness evaluation methods, architecting the project's code frameworks and foundational components, and leading the creation and organization of comprehensive project documentation. Additionally, I managed the implementation of continuous integration processes, ensuring project efficiency and effectiveness.
I contributed to developing the Bi-LSTM model's source code, focusing on enhancing its efficiency and functionality. Additionally, I designed and conducted unit tests to ensure the model's reliability and performance.