Understanding Policy Distillation in Reinforcement Learning

Policy distillation is a technique for transferring knowledge from one neural network (the “teacher”) to another (the “student”). This has profound implications for deploying AI systems efficiently.

The Core Idea

The fundamental insight is that we can train a smaller, more efficient network to mimic the behavior of a larger, more capable one. Consider the reverse KL divergence:

D_{KL}(\pi_\theta \| \pi_{\text{teacher}}) = \sum_a \pi_\theta(a|s) \log \frac{\pi_\theta(a|s)}{\pi_{\text{teacher}}(a|s)}

This measures how much our student policy $\pi_\theta$ diverges from the teacher policy. The choice of KL direction matters significantly. Reverse KL tends to produce mode-seeking behavior, while forward KL is mean-seeking.

Why This Matters

There are several compelling reasons to use policy distillation:

Deployment efficiency — Smaller models require less compute and memory
Knowledge consolidation — Combine knowledge from multiple specialists into one generalist
Continual learning — Distill old knowledge before training on new tasks

The loss function we typically optimize is:

\mathcal{L}(\theta) = -\mathbb{E}_{s \sim \mathcal{D}} \left[ \sum_a \pi_{\text{teacher}}(a|s) \log \pi_\theta(a|s) \right]

This is essentially the cross-entropy between the teacher’s action distribution and the student’s. In practice, we often use a temperature parameter to “soften” the teacher’s distribution, which can improve learning stability.

Implementation Considerations

When implementing policy distillation, consider these factors:

Data Collection

You need a dataset of states to query the teacher policy on. Options include:

On-policy data — States visited by the student during training
Off-policy data — States from a replay buffer or external dataset
Synthetic data — Generated states that cover important regions

Temperature Scaling

Applying temperature $\tau$ to the softmax:

\pi_\tau(a|s) = \frac{\exp(Q(s,a)/\tau)}{\sum_{a'} \exp(Q(s,a')/\tau)}

Higher temperatures produce softer distributions, revealing more information about action preferences. A temperature of 1.0 gives the original distribution. Values greater than 1 make it more uniform; less than 1 makes it more peaked.

Conclusion

Policy distillation is a powerful technique for creating efficient AI systems. By transferring knowledge from large models to smaller ones, we can deploy capable systems with reduced computational requirements.

The math may look intimidating, but the core idea is simple: learn to imitate an expert. This principle appears throughout machine learning, from behavioral cloning to knowledge distillation in supervised learning.

import numpy as np
x = np.array([1])