Research

Qwen3.6 27B Fine-Tuning Shows Harness-Specific Behavior Shifts

Research on Qwen3.6 27B reveals that harness-specific fine-tuning can significantly alter model behavior, with v2 showing 40.45% performance on Pi harness compared to base model's 42.70%.

Image: Hugging Face

A study on Qwen3.6 27B explores how fine-tuning affects agentic coding harness fluency. The research focuses on Terminal-Bench 2.0 evaluations, comparing base model and fine-tuned versions across different harnesses. The findings suggest that harness-specific training can lead to notable changes in model behavior, though the results are highly sensitive to training data and interface design. The base model performed best on the Pi harness, but the v2 reasoning-distilled model recovered most of its performance and demonstrated different behavioral patterns, including stronger decomposition and validation, but also a tendency to over-explore and time out. Source: huggingface

The research aimed to investigate how agentic coding harness fluency could be distilled into a strong open-weight model like Qwen3.6 27B, which is both capable and runnable on consumer hardware. The study tested the model across three harnesses—Codex CLI, OpenHands, and Pi—and two inference engines. The initial focus was on Codex CLI, but practical issues like poor local model support and infrastructure errors led to a shift to OpenHands. The OpenHands results showed that fine-tuning could improve base model behavior, but the v2 model experienced a major regression, often getting stuck in longer loops. Source: huggingface

The Pi harness was chosen for its lightweight and minimal design, making it ideal for fine-tuning experiments. The study found that the base model achieved a 42.70% score on the Pi harness, while the v1 model scored 28.09% and the v2 model scored 40.45%. The v2 model recovered most of the base model's performance but did not exceed it. The study also noted that the v2 model's performance was influenced by the dataset, which included reasoning traces, leading to stronger task decomposition and debugging but also a tendency to over-explore. Source: huggingface

Key points

Qwen3.6 27B fine-tuning across multiple agentic coding harnesses showed significant behavior changes.
The base model achieved a 42.70% score on the Pi harness compared to the v1 model's 28.09% and the v2 model's 40.45%.
The v2 model demonstrated stronger task decomposition and debugging but also a tendency to over-explore and time out.
The study found that harness-specific fine-tuning can improve base model behavior but is highly sensitive to training data and interface design.
The research used Terminal-Bench 2.0 evaluations to assess model performance across different harnesses.
The v2 model was fine-tuned on a dataset including reasoning traces, which influenced its behavioral profile.
The base model remained the strongest Pi run, but the v2 model recovered most of its performance.

Source: Hugging Face Read the original →

WRITTEN BY

Maya Chen

AI Research & Breakthroughs

Maya breaks down the latest AI research papers, benchmarks, and technical breakthroughs into plain language.