A new paradigm in code generation developed by Apple. Bringing the iterative refinement capabilities of Diffusion Models to coding assistance, breaking the limitations of traditional linear generation.
DiffuCoder's core innovation is that it is Non-Autoregressive. Unlike traditional GPT-style models that generate word-by-word from left to right, DiffuCoder handles code like an image: starting from complete "noise", it globally iterates to gradually "denoise" into a clear code structure.
* Left: Linear thinking. Once a mistake is made early on, it's hard to
correct.
* Right: Global thinking. The model sees the "whole picture" and refines details
iteratively.
Despite its non-traditional generation method, DiffuCoder-7B-Instruct shows surprisingly competitive performance in code benchmarks, proving the potential of diffusion models for discrete data.
| Benchmark | DiffuCoder-7B-Instruct | Base Model | Improvement |
|---|---|---|---|
| HumanEval (Python) | 78.8% | 64.3% | +14.5% |
| EvalPlus | ~55.4% | N/A | Strong |
Source: arXiv:2506.20639 & Community Benchmarks
The model can be loaded via Hugging Face Transformers. Note the core
method is diffusion_generate.
import torch
from transformers import AutoModel, AutoTokenizer
# 1. Load Model & Tokenizer
model_path = "apple/DiffuCoder-7B-Instruct"
model = AutoModel.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
trust_remote_code=True
).to("cuda").eval()
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
# 2. Build Prompt (Qwen Style)
query = "Write a Python function to merge two sorted lists."
# Note: Using HTML entity escapes < and { to avoid Svelte parsing errors
prompt = f"<|im_start|>user\n{query}<|im_end|>\n<|im_start|>assistant\n"
# 3. Diffusion Generation Process
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
output = model.diffusion_generate(
inputs.input_ids,
attention_mask=inputs.attention_mask,
max_new_tokens=256,
steps=128, # Number of denoising steps
temperature=0.3, # Sampling temperature
top_p=0.95
)
# 4. Decode Output
print(tokenizer.decode(output.sequences[0], skip_special_tokens=True))Traditional Autoregressive (AR) models predict the next token sequentially, making it hard to "plan ahead" or "look back". Diffusion models generate a fuzzy outline of the code first (like signatures and logic) and then fill in the details. This mechanism can be more robust for handling complex, long-range dependencies.
Yes. Although it is a 7B (effectively ~8B) parameter model, it is based on the Transformer architecture and can be quantized (e.g., 4-bit) using the MLX framework. The community has successfully tested quantized versions on Apple Silicon with decent inference speeds.
Diffusion inference is typically slower than a single-pass AR model because it requires multiple timesteps to denoise. However, since it is non-autoregressive, it can theoretically benefit from parallel computation. You can trade off quality for speed by adjusting the `steps` parameter.
The model uses the custom apple-amlr license. While it is open source, it is recommended to read the LICENSE file in the repository carefully before using it commercially.
Based on arXiv:2506.20639 research paper & GitHub repository data.