This is the second English blog to summarize my research this year, second to the 2023 version. Your criticism, comments, and suggestions are welcome to help me, a junior researcher, to learn and grow😄
The year 2024 was amazing and transformative for me in research, career, and life. For context: I will leave Microsoft and join William & Mary as an assistant professor starting January 2025 (a dramatic change in my career path and life). In research, this is the second year that many AI researchers worldwide have shifted their focus to large language models. Having made this transition early last year, my productivity has increased significantly. This year also marks when I began thinking about problems beyond individual papers, viewing them at a broader direction level.
I am proud to have pioneered several new research directions this year. Our major breakthroughs can be summarized into three key areas. These directions will remain my primary focus during my faculty career.
As a machine learning researcher who believes ML remains vital in the era of generative AI, I have focused on bridging large foundation models with ML techniques. My most significant research breakthrough this year centers on our newly-proposed research direction—Catastrophic Inheritance (CI). Given the profound impact of generative AI across disciplines and the complexity of its training, data, and adaptation processes, we introduced CI to address a critical challenge: biases in upstream pre-training data become inherited by models, leading to catastrophic consequences in downstream tasks. Our major research outcomes in this area are:

I certainly do not want to abandon traditional ML since I firmly consider them useful in the era of LLMs. In the future, we will continue the research in both CI and other interesting new directions.
I created the term “philosophy of LM” to unify the research that tries to understand LLMs in a more scientific way. It mainly consists of evaluation and enhancement.
We had several papers in top venues on LLM evaluation this year and I would like to only highlight two: the DyVal series and ERBench.
Our very initial concern about the great performance of LLMs is the data contamination issue, which refers to the case that LLMs are actually pre-trained on the (publicly-available) test sets. To foster a more scientific, fair evaluation protocol, we created DyVal for dynamic evaluation. DyVal has the potential to overcome data contamination, while maintaining dynamic difficulty levels of the problems and supporting multi-faceted analysis of LLM abilities.

“ERBench: An Entity-Relationship based Automatically Verifiable Hallucination Benchmark for Large Language Models” was published by NeurIPS 2024 as a spotlight! This paper bridges traditional database theory to LLM evaluation, where KAIST researchers and our team created hallucination benchmarks that can automatically verify itself with rationale. I really like this paper since it paved the way on how traditional database can help AI.
Well, I have to say that LLM evaluation could be one of the easiest research that everyone can do since it requires no specific model training or mathematics. However, despite plenty of work in this area, the major players in delivering super AI models are still leveraging the (old and contaminated) benchmarks like MMLU and GSM8K. This is somewhat sad to see.
Current LLMs are incapable of handling multicultural contexts if you are not from the Western countries. This is the major reason why we created the CultureLLM series to enhance LLM’s reach to low-resource cultures.