Overview
As LLMs become central to software development workflows, understanding what they actually learn about code and how reliably they perform is critical. Our research builds benchmarks, training methods, and evaluation frameworks that move beyond surface-level accuracy to probe deeper code understanding.
We study semantic reasoning, self-consistency, execution awareness, and the impact of data contamination on benchmark validity.
Key Directions
- Semantic Reasoning: Benchmarks and models that test whether LLMs understand code semantics, not just syntax, including execution-aware pre-training.
- Self-Consistency: Evaluating whether code LLMs produce internally consistent outputs across equivalent formulations of the same problem.
- Dynamic Evaluation: Combating data contamination with dynamic benchmarking approaches that generate fresh evaluation instances.
- Code Representations: Contrastive learning and pre-training strategies that build better representations of source code.
Impact
SemCoder introduced comprehensive semantics reasoning into code LLM training, achieving state-of-the-art results at NeurIPS 2024. IdentityChain established self-consistency as a key evaluation dimension for code models. Our work on data contamination has influenced how the community designs and trusts code benchmarks.
Contributors
Baishakhi Ray
Simin Chen
Jinjun Peng
Ira Ceka
Nikolaus Holzer
Yangruibo Ding
Kexin Pei
Saikat Chakraborty
Yuchi Tian
Selected Publications
CodeSense: a Real-World Benchmark and Dataset for Code Semantic Reasoning
MK Roy, S Chen, B Steenhoek, J Peng, G Kaiser, B Ray, W Le · ICLR 2026
Benchmarking large language models under data contamination: A survey from static to dynamic evaluation
S Chen, Y Chen, Z Li, Y Jiang, Z Wan, Y He, D Ran, T Gu, H Li, T Xie · EMNLP 2025
Mechanics of Learned Reasoning 1: TempoBench, A Benchmark for Interpretable Deconstruction of Reasoning System Performance
N Holzer, W Fishell, B Ray, M Santolucito · Preprint, 2025
Code Reasoning for Software Engineering Tasks: A Survey and A Call to Action
S Pujar, I Ceka, I Manotas, G Kaiser, B Ray, S Ramji · Preprint, 2025
CWEval: Outcome-driven Evaluation on Functionality and Security of LLM Code Generation
J Peng, L Cui, K Huang, J Yang, B Ray · 2025 IEEE/ACM International Workshop on Large Language Models for Code
DyCodeEval: Dynamic Benchmarking of Reasoning Capabilities in Code Large Language Models Under Data Contamination
S Chen, P Pusarla, B Ray · ICML 2025
Semcoder: Training code language models with comprehensive semantics reasoning
Y Ding, J Peng, MJ Min, G Kaiser, J Yang, B Ray · Advances in Neural Information Processing Systems 37, 60275-60308
Cycle: Learning to self-refine the code generation
Y Ding, MJ Min, G Kaiser, B Ray · Proceedings of the ACM on Programming Languages 8 (OOPSLA1), 392-418
Vulnerability detection with code language models: How far are we?
Y Ding, Y Fu, O Ibrahim, C Sitawarin, X Chen, B Alomair, D Wagner, B Ray, Y Chen · ICSE 2025
TRACED: Execution-aware Pre-training for Source Code
Y Ding, B Steenhoek, K Pei, G Kaiser, W Le, B Ray · ICSE 2024
Tracefixer: Execution trace-driven program repair
I Bouzenia, Y Ding, K Pei, B Ray, M Pradel · Preprint, 2023
Beyond accuracy: Evaluating self-consistency of code LLMs with IdentityChain
MJ Min, Y Ding, L Buratti, S Pujar, G Kaiser, S Jana, B Ray · ICLR 2024
Concord: Clone-aware contrastive learning for source code
Y Ding, S Chakraborty, L Buratti, S Pujar, A Morari, G Kaiser, B Ray · ISSTA 2023
A static evaluation of code completion by large language models
H Ding, V Kumar, Y Tian, Z Wang, R Kwiatkowski, X Li, MK Ramanathan · ACL 2023
Natgen: generative pre-training by 'naturalizing' source code
S Chakraborty, T Ahmed, Y Ding, PT Devanbu, B Ray · ESEC/FSE 2022
Multi-lingual evaluation of code generation models
B Athiwaratkun, SK Gouda, Z Wang, X Li, Y Tian, M Tan, WU Ahmad · Preprint, 2022
Deep learning based vulnerability detection: Are we there yet?
S Chakraborty, R Krishna, Y Ding, B Ray · IEEE Transactions on Software Engineering 48(9), 3280-3296
Towards Learning (Dis)-Similarity of Source Code from Program Contrasts
Y Ding, L Buratti, S Pujar, A Morari, B Ray, S Chakraborty · ACL 2022
View all publications →