This is the 2025 review of my research. See 2024 and 2023. Your comments, and suggestions are welcome to help me to learn and grow😄This note is written primarily as an internal reflection and for students interested in these directions; it is shared publicly in case it is useful to others.

This year marks the first year as a faculty, an unexplored path. This new start is harsh, painful, and transformative: imagine you have to deal with all the life changes from China to the United States. But luckily, thanks to the support of my family, we did it!

Research: Philosophy of LLMs

This line of research aims to provide a holistic understanding of LLMs’ capabilities, pitfalls, and future potentials. We started this research direction in 2023 with several papers on LLM evaluation. We continue to push it in this year.

CultureVLM

The previous CultureLLM and CulturePark are designed for boosting the multi-cultural understanding performance of language models. Now, we have a multimodal version: CultureVLM: Characterizing and Improving Cultural Understanding of Vision-Language Models for over 100 Countries. This research provides a unique solution to collect reliable multimodal cultural datasets: instead of either pure expert annotation (which is expensive and time-consuming) or data synthesis (which is unreliable), we present a scalable pipeline that connects both. First, the automatic web crawler gets precise images from Google Image Search according to different cultural categories in Wikipedia. Then, we leverage advanced LLMs like GPT-4o to generate multimodal QA pairs according to the images. After that, we hire human experts to judge the quality of the dataset. This offers a scalable and trustworthy solution to collect rich cultural data. This led to by far the largest open-source multimodal cultural dataset.

image.png

We fine-tuned CultureVLMs on the dataset and revealed that VLMs are significantly lagging behind for cultural tasks compared to LLMs. So, if you are still interested in culture and LLMs, be sure to switch to the multimodal domain! One less for the research community: 100% expert data collection could be less helpful; transform to human+AI co-collection! This proposal also supported access to additional computational resources, enabling further experimentation.

Personalized Safety

This topic was proposed when I was in a panel discussion in Hong Kong in 2024. It deals with a simple scenario: your chatbots should not just reply based on your actual prompt; but should include any (implicit) necessary background and context about you to give more personalized response. The challenge here is how to improve the safety score with the minimum user interaction. In the end, the models can provide different answers to different persons even with the same prompts.

image.png

The work personalized safety (NeurIPS’25) was generally favored by the reviewers. It tries to design a personalized benchmark that contains different scenarios, domains, and user attributes to first give a thorough evaluation of LLMs’ personalized safety score. Then, we analyze the pitfalls of existing models and design a lightweight solution to improve the score using reinforcement learning agent. This work is certainly a start of this new direction and we are continuing to push it further.

How LLMs Manipulate Knowledge

We are always curious about how LLMs perceive, store, and manipulate knowledge, especially when comparing it with human being. Are LLMs exhibiting the same mechanism as humans? In KnoweldgeSmith, we provide a unified view to understand the knowledge updating mechanism via editing and unlearning. Our formulation and extensive experiments show that both editing and unlearning provide nuanced solution to knowledge updating. There is no simple rule to determine the updating manner. This is a very interesting paper with many findings.

image.png

Research: Machine Learning with Foundation Models

We are still bridging traditional machine learning with generative AI models. This is our uniqueness in today’s LLM world. We offer unique perspectives and propose useful solution for solving problems in downstream tasks.

Unified Multimodal Models

We saw a great potential of unified multimodal models—a family of foundation models that can do both understanding and generation. Then, we propose UniGame to solve the inconsistency issue in UMMs by turning such a model into its own adversary—an advanced training paradigm of generative adversarial networks. This is an ongoing work whose first author is one of my PhD students! This proposal also supported access to additional computational resources, enabling further experimentation.