We introduce DriftBench, a benchmark examining whether language models maintain fidelity to original objectives during iterative refinement. Across 2,146 benchmark evaluations of seven models from five providers, iterative pressure increases structural complexity while reducing constraint adherence. We identify a significant "knows-but-violates" effect: models accurately restate constraints they simultaneously breach, with violation rates ranging from 8% to 99% depending on the model. Structured checkpointing partially mitigates this issue but does not fully resolve it. Human validation confirms that automated evaluation systems underestimate constraint violations. We release all benchmark materials, including briefs, prompts, rubrics, and transcripts.
@article{kruthof2026driftbench,abbr={arXiv},bibtex_show={true},title={Models Recall What They Violate: Constraint Adherence in Multi-Turn LLM Ideation},author={Kruthof, Garvin},journal={arXiv preprint arXiv:2604.28031},year={2026},html={https://arxiv.org/abs/2604.28031}}
We evaluate the Soft Actor-Critic (SAC) algorithm across seven equity datasets spanning over 300 years of out-of-sample data. While SAC shows market-timing potential, it does not systematically outperform a 1/N benchmark in a frictionless setting, and its high turnover leads to negative net returns under modest transaction costs. The results highlight the practical challenges of high-frequency DRL strategies and motivate cost-aware DRL methods and robust validation protocols.
@article{kruthof2025drl,abbr={FRL},bibtex_show={true},title={Can deep reinforcement learning beat 1/N?},author={Kruthof, Garvin and M{\"u}ller, Sebastian},journal={Finance Research Letters},volume={75},pages={106866},year={2025},publisher={Elsevier},doi={10.1016/j.frl.2025.106866},html={https://www.sciencedirect.com/science/article/pii/S154461232500131X},selected={true}}
SSRN
Macroeconomic Reports and the Cross-section of Industry Returns
Breitung, Christian,Â
Kruthof, Garvin, and Müller, Sebastian
The textual content of macroeconomic reports from the Federal Reserve and OECD predicts the cross-section of industry returns. Using a two-stage framework that combines large language models with classical machine learning, we extract textual signals that forecast six-month industry returns out-of-sample. An equal-weighted long-short portfolio sorted on these signals earns a six-factor alpha of 91 basis points per month. The effect is weaker for value-weighted portfolios and decays over time, consistent with slow information diffusion and arbitrage frictions.
@article{breitung2025macro,abbr={SSRN},bibtex_show={true},title={Macroeconomic Reports and the Cross-section of Industry Returns},author={Breitung, Christian and Kruthof, Garvin and M{\"u}ller, Sebastian},journal={SSRN Working Paper},year={2025},html={https://ssrn.com/abstract=5685202}}
2023
SSRN
Contextualized Sentiment Analysis using Large Language Models
Breitung, Christian,Â
Kruthof, Garvin, and Müller, Sebastian
We assess the intrinsic economic reasoning abilities of large language models including GPT-3, GPT-4, Llama 2, and Falcon on a curated dataset of commodity news headlines. We adopt a direct evaluation approach to assess contextualized sentiment analysis capabilities by using economic reasoning to assess the impact of individual commodity news headlines on different industries. All models surpass mono-directional sentiment prediction baselines.
@article{breitung2023sentiment,abbr={SSRN},bibtex_show={true},title={Contextualized Sentiment Analysis using Large Language Models},author={Breitung, Christian and Kruthof, Garvin and M{\"u}ller, Sebastian},journal={SSRN Working Paper},year={2023},html={https://ssrn.com/abstract=4615038}}