My Research in 2024: A Review

This is the second English blog to summarize my research this year, second to the 2023 version. Your criticism, comments, and suggestions are welcome to help me, a junior researcher, to learn and grow😄

The year 2024 was amazing and transformative for me in research, career, and life. For context: I will leave Microsoft and join William & Mary as an assistant professor starting January 2025 (a dramatic change in my career path and life). In research, this is the second year that many AI researchers worldwide have shifted their focus to large language models. Having made this transition early last year, my productivity has increased significantly. This year also marks when I began thinking about problems beyond individual papers, viewing them at a broader direction level.

I am proud to have pioneered several new research directions this year. Our major breakthroughs can be summarized into three key areas. These directions will remain my primary focus during my faculty career.

Machine learning with foundation models

As a machine learning researcher who believes ML remains vital in the era of generative AI, I have focused on bridging large foundation models with ML techniques. My most significant research breakthrough this year centers on our newly-proposed research direction—Catastrophic Inheritance (CI). Given the profound impact of generative AI across disciplines and the complexity of its training, data, and adaptation processes, we introduced CI to address a critical challenge: biases in upstream pre-training data become inherited by models, leading to catastrophic consequences in downstream tasks. Our major research outcomes in this area are:

The position paper “On Catastrophic Inheritance of Large Foundation Models” got published by Data-Centric Machine Learning Research (DMLR) where all reviewers agreed that CI is a new direction and should be paid attention to. In the paper, we surveyed hundreds of related work, summarized the challenges of CI, and proposed frameworks for future study. All other works of our team are based on the framework (and more will come).
Our first research paper in CI “Understanding and Mitigating the Label Noise in Pre-training on Downstream Tasks” was published by ICLR 2024 as a spotlight! This work is actually started long before the position paper, and it has then assisted us to finally form the concept of CI. It proposed noisy model learning, akin to noisy label learning, but to study how to mitigate the bias in pre-training data by downstream fine-tuning. I personally love this paper so much and you are welcome to read it.
Our second research paper in CI “Slight Corruption in Pre-training Data Makes Better Diffusion Models” was published by NeurIPS 2024, also as a spotlight! This paper is akin to the ICLR one, but more complex since it studies the corruption in diffusion pre-training data. Theoretically and empirically, we proved that slight corruption in pre-training can assist diffusion models generate better images. Then, we developed a lightweight algorithm to leverage such policy in tuning diffusion models. This work shows the power of CS+Math: I’m certainly not a math guy; our collaborators in HKU are wonderful!

I certainly do not want to abandon traditional ML since I firmly consider them useful in the era of LLMs. In the future, we will continue the research in both CI and other interesting new directions.

Philosophy of language models

I created the term “philosophy of LM” to unify the research that tries to understand LLMs in a more scientific way. It mainly consists of evaluation and enhancement.

LLM evaluation

We had several papers in top venues on LLM evaluation this year and I would like to only highlight two: the DyVal series and ERBench.

Our very initial concern about the great performance of LLMs is the data contamination issue, which refers to the case that LLMs are actually pre-trained on the (publicly-available) test sets. To foster a more scientific, fair evaluation protocol, we created DyVal for dynamic evaluation. DyVal has the potential to overcome data contamination, while maintaining dynamic difficulty levels of the problems and supporting multi-faceted analysis of LLM abilities.
- “DyVal: Dynamic Evaluation of Large Language Models for Reasoning Tasks” was published by ICLR 2024 as a spotlight. In this work, we propose to use directed acyclic graph (DAG) as the main structure to construct infinite samples for reasoning tasks. We demonstrated several interesting findings including the potential data contamination issue, making the paper one of the earliest in handling data contamination rather than just proving it. (Interestingly, one of the best papers honorable mention in ICLR 2024 is about proving data contamination, but we are significantly ahead since we tried to overcome it.)
- “DyVal 2: Dynamic Evaluation of Large Language Models by Meta Probing Agents” was published by ICML 2024. In this work, we propose to use LLMs as agents to transform existing datasets into new ones to make sure that it can suffice all NLP tasks. This work can assist more multi-faceted analysis of LLM capabilities by dynamically composing different principles.
- What is surprising is that our idea got followed by other external collaborators! The team in Peking University created dynamic interactive evaluation in ACL’24; team in Rutgers University developed dynamic logic evaluation; and team in University of Michigan applied the dynamic evaluation to multimodal area. This is exciting to see that our work truly has some impacts on others’ research.
“ERBench: An Entity-Relationship based Automatically Verifiable Hallucination Benchmark for Large Language Models” was published by NeurIPS 2024 as a spotlight! This paper bridges traditional database theory to LLM evaluation, where KAIST researchers and our team created hallucination benchmarks that can automatically verify itself with rationale. I really like this paper since it paved the way on how traditional database can help AI.

Well, I have to say that LLM evaluation could be one of the easiest research that everyone can do since it requires no specific model training or mathematics. However, despite plenty of work in this area, the major players in delivering super AI models are still leveraging the (old and contaminated) benchmarks like MMLU and GSM8K. This is somewhat sad to see.

LLM enhancement: Culture-specific LLMs

Current LLMs are incapable of handling multicultural contexts if you are not from the Western countries. This is the major reason why we created the CultureLLM series to enhance LLM’s reach to low-resource cultures.