Reasoning and Evaluation for Code and LLMs

Understanding how large language models reason about code, and building rigorous benchmarks to evaluate them.

Overview

As LLMs become central to software development workflows, understanding what they actually learn about code and how reliably they perform is critical. Our research builds benchmarks, training methods, and evaluation frameworks that move beyond surface-level accuracy to probe deeper code understanding.

We study semantic reasoning, self-consistency, execution awareness, and the impact of data contamination on benchmark validity.

Key Directions

Semantic Reasoning: Benchmarks and models that test whether LLMs understand code semantics, not just syntax, including execution-aware pre-training.
Self-Consistency: Evaluating whether code LLMs produce internally consistent outputs across equivalent formulations of the same problem.
Dynamic Evaluation: Combating data contamination with dynamic benchmarking approaches that generate fresh evaluation instances.
Code Representations: Contrastive learning and pre-training strategies that build better representations of source code.

Impact

SemCoder introduced comprehensive semantics reasoning into code LLM training, achieving state-of-the-art results at NeurIPS 2024. IdentityChain established self-consistency as a key evaluation dimension for code models. Our work on data contamination has influenced how the community designs and trusts code benchmarks.

Contributors

Baishakhi Ray Simin Chen Jinjun Peng Ira Ceka Nikolaus Holzer Yangruibo Ding Kexin Pei Saikat Chakraborty Yuchi Tian

Selected Publications

CodeSense: a Real-World Benchmark and Dataset for Code Semantic Reasoning

MK Roy, S Chen, B Steenhoek, J Peng, G Kaiser, B Ray, W Le · ICLR 2026

Benchmarking large language models under data contamination: A survey from static to dynamic evaluation

S Chen, Y Chen, Z Li, Y Jiang, Z Wan, Y He, D Ran, T Gu, H Li, T Xie · EMNLP 2025

Mechanics of Learned Reasoning 1: TempoBench, A Benchmark for Interpretable Deconstruction of Reasoning System Performance

N Holzer, W Fishell, B Ray, M Santolucito · Preprint, 2025

Reasoning and Evaluation for Code and LLMs

Reasoning and Evaluation for Code and LLMs

Overview

Key Directions

Impact

Contributors

Selected Publications

CodeSense: a Real-World Benchmark and Dataset for Code Semantic Reasoning

Benchmarking large language models under data contamination: A survey from static to dynamic evaluation

Mechanics of Learned Reasoning 1: TempoBench, A Benchmark for Interpretable Deconstruction of Reasoning System Performance

Code Reasoning for Software Engineering Tasks: A Survey and A Call to Action

CWEval: Outcome-driven Evaluation on Functionality and Security of LLM Code Generation

DyCodeEval: Dynamic Benchmarking of Reasoning Capabilities in Code Large Language Models Under Data Contamination

Semcoder: Training code language models with comprehensive semantics reasoning

Cycle: Learning to self-refine the code generation

Vulnerability detection with code language models: How far are we?

TRACED: Execution-aware Pre-training for Source Code

Tracefixer: Execution trace-driven program repair

Beyond accuracy: Evaluating self-consistency of code LLMs with IdentityChain

Concord: Clone-aware contrastive learning for source code

A static evaluation of code completion by large language models

Natgen: generative pre-training by 'naturalizing' source code

Multi-lingual evaluation of code generation models

Deep learning based vulnerability detection: Are we there yet?

Towards Learning (Dis)-Similarity of Source Code from Program Contrasts