Reasoning and Evaluation for Code and LLMs

← Back to Projects

Reasoning and Evaluation for Code and LLMs

Understanding how large language models reason about code, and building rigorous benchmarks to evaluate them.

Overview

As LLMs become central to software development workflows, understanding what they actually learn about code and how reliably they perform is critical. Our research builds benchmarks, training methods, and evaluation frameworks that move beyond surface-level accuracy to probe deeper code understanding.

We study semantic reasoning, self-consistency, execution awareness, and the impact of data contamination on benchmark validity.

Key Directions

  • Semantic Reasoning: Benchmarks and models that test whether LLMs understand code semantics, not just syntax, including execution-aware pre-training.
  • Self-Consistency: Evaluating whether code LLMs produce internally consistent outputs across equivalent formulations of the same problem.
  • Dynamic Evaluation: Combating data contamination with dynamic benchmarking approaches that generate fresh evaluation instances.
  • Code Representations: Contrastive learning and pre-training strategies that build better representations of source code.

Impact

SemCoder introduced comprehensive semantics reasoning into code LLM training, achieving state-of-the-art results at NeurIPS 2024. IdentityChain established self-consistency as a key evaluation dimension for code models. Our work on data contamination has influenced how the community designs and trusts code benchmarks.

Contributors

Baishakhi Ray Simin Chen Jinjun Peng Ira Ceka Nikolaus Holzer Yangruibo Ding Kexin Pei Saikat Chakraborty Yuchi Tian

Selected Publications

CodeSense: a Real-World Benchmark and Dataset for Code Semantic Reasoning

MK Roy, S Chen, B Steenhoek, J Peng, G Kaiser, B Ray, W Le · ICLR 2026

Benchmarking large language models under data contamination: A survey from static to dynamic evaluation

S Chen, Y Chen, Z Li, Y Jiang, Z Wan, Y He, D Ran, T Gu, H Li, T Xie · EMNLP 2025

Mechanics of Learned Reasoning 1: TempoBench, A Benchmark for Interpretable Deconstruction of Reasoning System Performance

N Holzer, W Fishell, B Ray, M Santolucito · Preprint, 2025

Code Reasoning for Software Engineering Tasks: A Survey and A Call to Action

S Pujar, I Ceka, I Manotas, G Kaiser, B Ray, S Ramji · Preprint, 2025

CWEval: Outcome-driven Evaluation on Functionality and Security of LLM Code Generation

J Peng, L Cui, K Huang, J Yang, B Ray · 2025 IEEE/ACM International Workshop on Large Language Models for Code

DyCodeEval: Dynamic Benchmarking of Reasoning Capabilities in Code Large Language Models Under Data Contamination

S Chen, P Pusarla, B Ray · ICML 2025

Semcoder: Training code language models with comprehensive semantics reasoning

Y Ding, J Peng, MJ Min, G Kaiser, J Yang, B Ray · Advances in Neural Information Processing Systems 37, 60275-60308

Cycle: Learning to self-refine the code generation

Y Ding, MJ Min, G Kaiser, B Ray · Proceedings of the ACM on Programming Languages 8 (OOPSLA1), 392-418

Vulnerability detection with code language models: How far are we?

Y Ding, Y Fu, O Ibrahim, C Sitawarin, X Chen, B Alomair, D Wagner, B Ray, Y Chen · ICSE 2025

TRACED: Execution-aware Pre-training for Source Code

Y Ding, B Steenhoek, K Pei, G Kaiser, W Le, B Ray · ICSE 2024

Tracefixer: Execution trace-driven program repair

I Bouzenia, Y Ding, K Pei, B Ray, M Pradel · Preprint, 2023

Beyond accuracy: Evaluating self-consistency of code LLMs with IdentityChain

MJ Min, Y Ding, L Buratti, S Pujar, G Kaiser, S Jana, B Ray · ICLR 2024

Concord: Clone-aware contrastive learning for source code

Y Ding, S Chakraborty, L Buratti, S Pujar, A Morari, G Kaiser, B Ray · ISSTA 2023

A static evaluation of code completion by large language models

H Ding, V Kumar, Y Tian, Z Wang, R Kwiatkowski, X Li, MK Ramanathan · ACL 2023

Natgen: generative pre-training by 'naturalizing' source code

S Chakraborty, T Ahmed, Y Ding, PT Devanbu, B Ray · ESEC/FSE 2022

Multi-lingual evaluation of code generation models

B Athiwaratkun, SK Gouda, Z Wang, X Li, Y Tian, M Tan, WU Ahmad · Preprint, 2022

Deep learning based vulnerability detection: Are we there yet?

S Chakraborty, R Krishna, Y Ding, B Ray · IEEE Transactions on Software Engineering 48(9), 3280-3296

Towards Learning (Dis)-Similarity of Source Code from Program Contrasts

Y Ding, L Buratti, S Pujar, A Morari, B Ray, S Chakraborty · ACL 2022

View all publications →