A study on Qwen3.6 27B explores how fine-tuning affects agentic coding harness fluency. The research focuses on Terminal-Bench 2.0 evaluations, comparing base model and fine-tuned versions across different harnesses. The findings suggest that harness-specific training can lead to notable changes in model behavior, though the results are highly sensitive to training data and interface design. The base model performed best on the Pi harness, but the v2 reasoning-distilled model recovered most of its performance and demonstrated different behavioral patterns, including stronger decomposition and validation, but also a tendency to over-explore and time out. Source: huggingface

The research aimed to investigate how agentic coding harness fluency could be distilled into a strong open-weight model like Qwen3.6 27B, which is both capable and runnable on consumer hardware. The study tested the model across three harnesses—Codex CLI, OpenHands, and Pi—and two inference engines. The initial focus was on Codex CLI, but practical issues like poor local model support and infrastructure errors led to a shift to OpenHands. The OpenHands results showed that fine-tuning could improve base model behavior, but the v2 model experienced a major regression, often getting stuck in longer loops. Source: huggingface

The Pi harness was chosen for its lightweight and minimal design, making it ideal for fine-tuning experiments. The study found that the base model achieved a 42.70% score on the Pi harness, while the v1 model scored 28.09% and the v2 model scored 40.45%. The v2 model recovered most of the base model's performance but did not exceed it. The study also noted that the v2 model's performance was influenced by the dataset, which included reasoning traces, leading to stronger task decomposition and debugging but also a tendency to over-explore. Source: huggingface