【深度学习精通】第24章 | 神经架构搜索与AutoDL - 自动化设计网络
摘要:本文介绍了神经架构搜索(NAS)的核心概念与实现方法。NAS通过自动化搜索最优神经网络架构,包含搜索空间设计(链式/细胞结构)、搜索策略(强化学习、进化算法等)和性能评估三要素。重点讲解了基于强化学习的NAS实现,包括RNN控制器架构和搜索过程,并提供了PyTorch代码示例。文章还涵盖权重共享、AutoML工具应用等前沿技术,旨在帮助读者掌握NAS的核心算法与实践技能。
环境声明
- Python版本:Python 3.10+
- PyTorch版本:PyTorch 2.0+
- 开发工具:PyCharm 或 VS Code
- 操作系统:Windows / macOS / Linux (通用)
- 额外依赖:
torch>=2.0.0,numpy>=1.24.0,optuna>=3.0.0
学习目标和摘要
学习目标:
- 理解神经架构搜索(NAS)的核心概念与搜索空间设计
- 掌握基于强化学习、进化算法和可微分方法的NAS技术
- 学会使用权重共享和超网技术加速架构搜索
- 了解AutoML工具的实际应用
- 能够使用PyTorch实现简化版NAS算法
文章摘要:神经架构搜索(NAS)是自动化机器学习(AutoML)的核心技术,旨在自动发现最优神经网络架构。本章将系统讲解NAS的搜索空间设计、主流搜索策略(强化学习、进化算法、DARTS)、权重共享技术(Once-for-All、BigNAS)以及2025年最新进展(AutoFormer),并提供完整的PyTorch实现代码。
1. NAS概述与搜索空间
1.1 什么是神经架构搜索
神经架构搜索(Neural Architecture Search, NAS)是一种自动化设计神经网络架构的技术。传统上,神经网络架构的设计依赖于人类专家的经验和直觉,需要大量的试错和调优。NAS的目标是让算法自动在预定义的搜索空间中寻找最优的网络结构。
NAS的核心思想可以类比为:如果深度学习是教计算机"学习",那么NAS就是教计算机"学会如何学习"——即自动发现最适合特定任务的网络结构。
1.2 NAS的三要素
一个完整的NAS系统包含三个核心组件:
| 组件 | 描述 | 示例 |
|---|---|---|
| 搜索空间 | 定义所有可能的架构集合 | 链式结构、细胞结构 |
| 搜索策略 | 如何在搜索空间中探索 | 强化学习、进化算法、梯度下降 |
| 性能评估 | 如何评估候选架构的质量 | 验证集准确率、参数量、FLOPs |
1.3 搜索空间设计
1.3.1 链式结构搜索空间
链式结构是最简单的搜索空间,网络由一系列层顺序连接而成。每层的选择包括:
- 卷积层:核大小(3x3、5x5)、通道数(32、64、128)
- 池化层:最大池化、平均池化
- 跳跃连接:是否添加残差连接
- 激活函数:ReLU、Sigmoid、Tanh
# 链式结构搜索空间示例
import torch
import torch.nn as nn
class ChainSearchSpace:
"""链式结构搜索空间定义"""
def __init__(self):
# 可选择的操作
self.operations = [
'conv_3x3', # 3x3卷积
'conv_5x5', # 5x5卷积
'dconv_3x3', # 3x3空洞卷积
'max_pool', # 最大池化
'avg_pool', # 平均池化
'skip_connect', # 跳跃连接
'none' # 无连接
]
# 可选择的通道数
self.channels = [16, 32, 64, 128]
# 网络深度范围
self.min_depth = 3
self.max_depth = 8
def sample_architecture(self):
"""随机采样一个架构"""
import random
depth = random.randint(self.min_depth, self.max_depth)
architecture = []
for i in range(depth):
op = random.choice(self.operations)
ch = random.choice(self.channels)
architecture.append({'layer': i, 'op': op, 'channels': ch})
return architecture
1.3.2 细胞结构搜索空间
细胞结构(Cell-based)搜索空间是更高级的设计,将网络分解为重复的"细胞"(Cell)或"块"(Block)。每个细胞内部有复杂的连接结构。
细胞结构的优势:
- 模块化:相同的细胞可以重复使用
- 可迁移性:在CIFAR-10上搜索的细胞可迁移到ImageNet
- 搜索效率高:只需搜索细胞内部结构
# 细胞结构搜索空间
class CellSearchSpace:
"""细胞结构搜索空间(类似DARTS)"""
def __init__(self, num_nodes=4):
self.num_nodes = num_nodes # 细胞内的节点数
# 候选操作集合
self.primitives = [
'none',
'max_pool_3x3',
'avg_pool_3x3',
'skip_connect',
'sep_conv_3x3', # 可分离卷积
'sep_conv_5x5',
'dil_conv_3x3', # 空洞卷积
'dil_conv_5x5'
]
def get_num_edges(self):
"""计算细胞内的边数"""
# 对于N个节点,边数为 2 + 3 + ... + N = (N+2)(N-1)/2
n = self.num_nodes
return (n + 2) * (n - 1) // 2
def sample_cell(self):
"""随机采样一个细胞结构"""
import random
num_edges = self.get_num_edges()
cell = []
for edge in range(num_edges):
# 每条边选择一个操作
op = random.choice(self.primitives)
cell.append(op)
return cell
2. 基于强化学习的NAS
2.1 RNN控制器
强化学习NAS使用一个RNN作为控制器(Controller)来生成网络架构描述。控制器输出一系列决策(如选择什么操作、多少通道),这些决策定义了一个神经网络架构。
核心思想:
- 控制器生成架构 -> 训练该架构 -> 获得验证准确率作为奖励 -> 更新控制器策略
import torch
import torch.nn as nn
import torch.nn.functional as F
class ControllerRNN(nn.Module):
"""RNN控制器:生成网络架构描述"""
def __init__(self, num_layers=6, num_ops=7, hidden_size=64):
super(ControllerRNN, self).__init__()
self.num_layers = num_layers # 网络层数
self.num_ops = num_ops # 可选操作数
self.hidden_size = hidden_size
# 嵌入层
self.embedding = nn.Embedding(num_ops, hidden_size)
# LSTM控制器
self.lstm = nn.LSTMCell(hidden_size, hidden_size)
# 输出层:预测每个位置的操作
self.ops_classifier = nn.Linear(hidden_size, num_ops)
# 初始化隐藏状态
self.init_hidden = nn.Parameter(
torch.zeros(1, hidden_size), requires_grad=True
)
self.init_cell = nn.Parameter(
torch.zeros(1, hidden_size), requires_grad=True
)
def forward(self, batch_size=1):
"""生成一个架构描述"""
# 初始化隐藏状态
hidden = self.init_hidden.expand(batch_size, -1)
cell_state = self.init_cell.expand(batch_size, -1)
# 存储所有层的操作选择
log_probs = []
actions = []
# 输入起始标记(用0表示)
inputs = torch.zeros(batch_size, dtype=torch.long)
if next(self.parameters()).is_cuda:
inputs = inputs.cuda()
for layer in range(self.num_layers):
# 嵌入输入
embed = self.embedding(inputs)
# LSTM前向传播
hidden, cell_state = self.lstm(embed, (hidden, cell_state))
# 预测操作
logits = self.ops_classifier(hidden)
probs = F.softmax(logits, dim=-1)
# 采样操作
action = torch.multinomial(probs, 1).squeeze(1)
# 计算对数概率(用于策略梯度)
log_prob = F.log_softmax(logits, dim=-1)
log_prob = log_prob.gather(1, action.unsqueeze(1)).squeeze(1)
log_probs.append(log_prob)
actions.append(action)
# 下一个输入是当前选择的操作
inputs = action
return torch.stack(actions, dim=1), torch.stack(log_probs, dim=1)
2.2 策略梯度训练
控制器使用REINFORCE算法(策略梯度)进行训练。奖励是生成的网络在验证集上的准确率。
class ReinforceTrainer:
"""使用REINFORCE算法训练控制器"""
def __init__(self, controller, baseline=0.0):
self.controller = controller
self.baseline = baseline # 基线,用于减小方差
self.optimizer = torch.optim.Adam(controller.parameters(), lr=0.001)
def train_step(self, rewards, log_probs):
"""
参数:
rewards: [batch_size],每个架构的验证准确率
log_probs: [batch_size, num_layers],每个操作的对数概率
"""
# 计算损失: -E[R * log P(a)]
# 使用基线减小方差
advantages = rewards - self.baseline
# 损失 = -sum(log_prob * advantage)
loss = -(log_probs.sum(dim=1) * advantages).mean()
# 更新基线(移动平均)
self.baseline = 0.9 * self.baseline + 0.1 * rewards.mean().item()
# 反向传播
self.optimizer.zero_grad()
loss.backward()
# 梯度裁剪
torch.nn.utils.clip_grad_norm_(self.controller.parameters(), 5.0)
self.optimizer.step()
return loss.item()
3. 可微分架构搜索(DARTS)
3.1 连续松弛
DARTS(Differentiable Architecture Search)的核心创新是将离散的架构选择松弛为连续的,从而可以使用梯度下降优化。
关键思想:
- 传统NAS:每条边选择一个操作(离散选择)
- DARTS:每条边是所有操作的加权和(连续松弛)
class MixedOp(nn.Module):
"""混合操作:所有候选操作的加权和"""
def __init__(self, C, stride, primitives):
super(MixedOp, self).__init__()
self.ops = nn.ModuleList()
for primitive in primitives:
op = self._create_op(primitive, C, stride)
self.ops.append(op)
# 架构参数(可学习)
self.alphas = nn.Parameter(torch.zeros(len(primitives)))
def _create_op(self, op_name, C, stride):
"""根据名称创建操作"""
if op_name == 'none':
return Zero(stride)
elif op_name == 'skip_connect':
return Identity() if stride == 1 else FactorizedReduce(C, C)
elif op_name == 'conv_3x3':
return ReLUConvBN(C, C, 3, stride, 1)
elif op_name == 'conv_5x5':
return ReLUConvBN(C, C, 5, stride, 2)
elif op_name == 'sep_conv_3x3':
return SepConv(C, C, 3, stride, 1)
elif op_name == 'sep_conv_5x5':
return SepConv(C, C, 5, stride, 2)
else:
raise ValueError(f"Unknown operation: {op_name}")
def forward(self, x):
"""前向传播:所有操作的加权和"""
# 使用softmax得到混合权重
weights = F.softmax(self.alphas, dim=0)
# 加权求和
output = sum(w * op(x) for w, op in zip(weights, self.ops))
return output
# 辅助操作定义
class ReLUConvBN(nn.Module):
"""ReLU + Conv + BN"""
def __init__(self, C_in, C_out, kernel_size, stride, padding):
super(ReLUConvBN, self).__init__()
self.op = nn.Sequential(
nn.ReLU(inplace=False),
nn.Conv2d(C_in, C_out, kernel_size, stride, padding, bias=False),
nn.BatchNorm2d(C_out)
)
def forward(self, x):
return self.op(x)
class SepConv(nn.Module):
"""可分离卷积"""
def __init__(self, C_in, C_out, kernel_size, stride, padding):
super(SepConv, self).__init__()
self.op = nn.Sequential(
nn.ReLU(inplace=False),
nn.Conv2d(C_in, C_in, kernel_size, stride, padding,
groups=C_in, bias=False),
nn.Conv2d(C_in, C_in, 1, 1, 0, bias=False),
nn.BatchNorm2d(C_in),
nn.ReLU(inplace=False),
nn.Conv2d(C_in, C_in, kernel_size, 1, padding,
groups=C_in, bias=False),
nn.Conv2d(C_in, C_out, 1, 1, 0, bias=False),
nn.BatchNorm2d(C_out)
)
def forward(self, x):
return self.op(x)
class Identity(nn.Module):
"""恒等映射"""
def forward(self, x):
return x
class Zero(nn.Module):
"""零操作"""
def __init__(self, stride):
super(Zero, self).__init__()
self.stride = stride
def forward(self, x):
if self.stride == 1:
return x.mul(0.0)
return x[:, :, ::self.stride, ::self.stride].mul(0.0)
class FactorizedReduce(nn.Module):
"""降维操作"""
def __init__(self, C_in, C_out):
super(FactorizedReduce, self).__init__()
self.conv_1 = nn.Conv2d(C_in, C_out // 2, 1, stride=2, bias=False)
self.conv_2 = nn.Conv2d(C_in, C_out // 2, 1, stride=2, bias=False)
self.bn = nn.BatchNorm2d(C_out)
def forward(self, x):
out = torch.cat([self.conv_1(x), self.conv_2(x[:, :, 1:, 1:])], dim=1)
return self.bn(out)
3.2 双层优化
DARTS使用双层优化(Bilevel Optimization):
- 内层:优化网络权重(在训练集上)
- 外层:优化架构参数(在验证集上)
class DARTSTrainer:
"""DARTS训练器"""
def __init__(self, model, args):
self.model = model
# 网络权重优化器
self.w_optimizer = torch.optim.SGD(
model.weights(),
lr=args.learning_rate,
momentum=args.momentum,
weight_decay=args.weight_decay
)
# 架构参数优化器
self.alpha_optimizer = torch.optim.Adam(
model.alphas(),
lr=args.arch_learning_rate,
betas=(0.5, 0.999),
weight_decay=args.arch_weight_decay
)
self.scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
self.w_optimizer, float(args.epochs), eta_min=args.learning_rate_min
)
def train_step(self, train_data, valid_data):
"""单步训练"""
# 步骤1:在训练集上更新网络权重
self.w_optimizer.zero_grad()
logits = self.model(train_data[0])
loss = F.cross_entropy(logits, train_data[1])
loss.backward()
self.w_optimizer.step()
# 步骤2:在验证集上更新架构参数
self.alpha_optimizer.zero_grad()
logits = self.model(valid_data[0])
loss = F.cross_entropy(logits, valid_data[1])
loss.backward()
self.alpha_optimizer.step()
return loss.item()
def derive_architecture(self):
"""从连续松弛中导出离散架构"""
return self.model.genotype()
4. 基于进化的NAS
4.1 遗传算法
进化算法模拟自然选择过程,通过选择、交叉、变异等操作在种群中搜索最优架构。
import random
import copy
class Individual:
"""个体:表示一个网络架构"""
def __init__(self, architecture):
self.architecture = architecture
self.fitness = None # 验证准确率
self.params = None # 参数量
self.flops = None # 计算量
class GeneticNAS:
"""基于遗传算法的NAS"""
def __init__(self, population_size=50, mutation_rate=0.1,
crossover_rate=0.8, generations=100):
self.population_size = population_size
self.mutation_rate = mutation_rate
self.crossover_rate = crossover_rate
self.generations = generations
self.population = []
self.best_individual = None
def initialize_population(self, search_space):
"""初始化种群"""
self.population = []
for _ in range(self.population_size):
arch = search_space.sample_architecture()
individual = Individual(arch)
self.population.append(individual)
def evaluate_population(self, train_fn):
"""评估种群中所有个体"""
for individual in self.population:
if individual.fitness is None:
fitness, params, flops = train_fn(individual.architecture)
individual.fitness = fitness
individual.params = params
individual.flops = flops
def select_parent(self):
"""锦标赛选择"""
tournament_size = 3
tournament = random.sample(self.population, tournament_size)
return max(tournament, key=lambda x: x.fitness)
def crossover(self, parent1, parent2):
"""单点交叉"""
if random.random() > self.crossover_rate:
return copy.deepcopy(parent1)
arch1 = parent1.architecture
arch2 = parent2.architecture
# 选择交叉点
point = random.randint(1, min(len(arch1), len(arch2)) - 1)
# 创建子代
child_arch = arch1[:point] + arch2[point:]
return Individual(child_arch)
def mutate(self, individual, search_space):
"""变异操作"""
arch = copy.deepcopy(individual.architecture)
for i in range(len(arch)):
if random.random() < self.mutation_rate:
# 随机改变这一层的操作或通道数
if random.random() < 0.5:
arch[i]['op'] = random.choice(search_space.operations)
else:
arch[i]['channels'] = random.choice(search_space.channels)
return Individual(arch)
def evolve(self, search_space, train_fn):
"""进化主循环"""
# 初始化
self.initialize_population(search_space)
for generation in range(self.generations):
print(f"Generation {generation + 1}/{self.generations}")
# 评估
self.evaluate_population(train_fn)
# 记录最优
current_best = max(self.population, key=lambda x: x.fitness)
if self.best_individual is None or \
current_best.fitness > self.best_individual.fitness:
self.best_individual = copy.deepcopy(current_best)
print(f" Best fitness: {self.best_individual.fitness:.4f}")
# 创建新一代
new_population = [self.best_individual] # 保留最优个体
while len(new_population) < self.population_size:
parent1 = self.select_parent()
parent2 = self.select_parent()
child = self.crossover(parent1, parent2)
child = self.mutate(child, search_space)
new_population.append(child)
self.population = new_population
return self.best_individual
4.2 NSGA-II多目标优化
NSGA-II(非支配排序遗传算法II)用于多目标优化,同时考虑准确率和模型复杂度。
class NSGA2NAS(GeneticNAS):
"""基于NSGA-II的多目标NAS"""
def __init__(self, population_size=50, generations=100):
super().__init__(population_size, 0.1, 0.8, generations)
def dominates(self, ind1, ind2):
"""判断ind1是否支配ind2"""
# 目标1:最大化准确率
# 目标2:最小化参数量
better_in_one = False
if ind1.fitness > ind2.fitness:
better_in_one = True
elif ind1.fitness < ind2.fitness:
return False
if ind1.params < ind2.params:
better_in_one = True
elif ind1.params > ind2.params:
return False
return better_in_one
def non_dominated_sort(self, population):
"""非支配排序"""
fronts = [[]]
domination_count = {}
dominated_solutions = {}
for i, p in enumerate(population):
domination_count[i] = 0
dominated_solutions[i] = []
for j, q in enumerate(population):
if i != j:
if self.dominates(p, q):
dominated_solutions[i].append(j)
elif self.dominates(q, p):
domination_count[i] += 1
if domination_count[i] == 0:
fronts[0].append(i)
i = 0
while len(fronts[i]) > 0:
next_front = []
for p in fronts[i]:
for q in dominated_solutions[p]:
domination_count[q] -= 1
if domination_count[q] == 0:
next_front.append(q)
i += 1
fronts.append(next_front)
return fronts[:-1] # 去掉最后一个空列表
def crowding_distance(self, front, population):
"""计算拥挤距离"""
if len(front) <= 2:
return {i: float('inf') for i in front}
distance = {i: 0 for i in front}
# 按准确率排序
sorted_by_acc = sorted(front,
key=lambda i: population[i].fitness)
distance[sorted_by_acc[0]] = float('inf')
distance[sorted_by_acc[-1]] = float('inf')
acc_range = (population[sorted_by_acc[-1]].fitness -
population[sorted_by_acc[0]].fitness)
for i in range(1, len(sorted_by_acc) - 1):
if acc_range > 0:
distance[sorted_by_acc[i]] += (
population[sorted_by_acc[i+1]].fitness -
population[sorted_by_acc[i-1]].fitness
) / acc_range
# 按参数量排序
sorted_by_params = sorted(front,
key=lambda i: population[i].params)
distance[sorted_by_params[0]] = float('inf')
distance[sorted_by_params[-1]] = float('inf')
params_range = (population[sorted_by_params[-1]].params -
population[sorted_by_params[0]].params)
for i in range(1, len(sorted_by_params) - 1):
if params_range > 0:
distance[sorted_by_params[i]] += (
population[sorted_by_params[i+1]].params -
population[sorted_by_params[i-1]].params
) / params_range
return distance
5. 权重共享与超网
5.1 Once-for-All网络
Once-for-All(OFA)是一种训练一次即可导出多种子网络的超网方法。它通过渐进式收缩训练,使超网中的权重可以被不同大小的子网络共享。
class OFASuperNet(nn.Module):
"""Once-for-All超网"""
def __init__(self, num_classes=1000, base_channels=64):
super(OFASuperNet, self).__init__()
# 最大配置
self.max_depth = 20
self.max_channels = base_channels * 4
self.max_kernel = 7
self.max_expand_ratio = 6
# 第一层
self.first_conv = nn.Conv2d(3, base_channels, 3, padding=1, bias=False)
self.first_bn = nn.BatchNorm2d(base_channels)
# 动态层(MBConv块)
self.blocks = nn.ModuleList()
in_ch = base_channels
for i in range(self.max_depth):
out_ch = min(in_ch * 2, self.max_channels)
self.blocks.append(
DynamicMBConv(in_ch, out_ch, self.max_expand_ratio)
)
in_ch = out_ch
# 分类头
self.classifier = nn.Linear(self.max_channels, num_classes)
def forward(self, x, arch_config=None):
"""
参数:
x: 输入
arch_config: 架构配置,包含depth, width, kernel_size等
"""
if arch_config is None:
arch_config = self.sample_active_subnet()
# 第一层
x = F.relu(self.first_bn(self.first_conv(x)))
# 动态块
for i in range(arch_config['depth']):
x = self.blocks[i](x, arch_config)
# 全局平均池化
x = F.adaptive_avg_pool2d(x, 1)
x = x.view(x.size(0), -1)
# 分类
x = self.classifier(x)
return x
def sample_active_subnet(self):
"""随机采样一个子网络配置"""
import random
return {
'depth': random.randint(5, self.max_depth),
'width_mult': random.uniform(0.5, 1.0),
'kernel_size': random.choice([3, 5, 7]),
'expand_ratio': random.choice([3, 4, 6])
}
class DynamicMBConv(nn.Module):
"""动态MobileNetV2块"""
def __init__(self, in_ch, out_ch, max_expand_ratio):
super(DynamicMBConv, self).__init__()
self.max_expand_ratio = max_expand_ratio
hidden_dim = in_ch * max_expand_ratio
# 扩展卷积
self.expand_conv = nn.Conv2d(in_ch, hidden_dim, 1, bias=False)
self.expand_bn = nn.BatchNorm2d(hidden_dim)
# 深度卷积(支持动态核大小)
self.depth_conv_3 = nn.Conv2d(hidden_dim, hidden_dim, 3,
padding=1, groups=hidden_dim, bias=False)
self.depth_conv_5 = nn.Conv2d(hidden_dim, hidden_dim, 5,
padding=2, groups=hidden_dim, bias=False)
self.depth_conv_7 = nn.Conv2d(hidden_dim, hidden_dim, 7,
padding=3, groups=hidden_dim, bias=False)
self.depth_bn = nn.BatchNorm2d(hidden_dim)
# 投影卷积
self.project_conv = nn.Conv2d(hidden_dim, out_ch, 1, bias=False)
self.project_bn = nn.BatchNorm2d(out_ch)
def forward(self, x, config):
"""根据配置动态前向"""
identity = x
# 扩展
expand_ch = int(x.size(1) * config['expand_ratio'])
x = self.expand_conv(x)
x = self.expand_bn(x)
x = F.relu(x)
x = x[:, :expand_ch, :, :]
# 深度卷积
if config['kernel_size'] == 3:
x = self.depth_conv_3(x)
elif config['kernel_size'] == 5:
x = self.depth_conv_5(x)
else:
x = self.depth_conv_7(x)
x = self.depth_bn(x)
x = F.relu(x)
# 投影
x = self.project_conv(x)
x = self.project_bn(x)
# 残差连接
if identity.size() == x.size():
x = x + identity
return x
5.2 BigNAS
BigNAS是另一种权重共享方法,通过三明治规则(Sandwich Rule)训练超网,确保大模型和小模型都能获得良好的性能。
class BigNASTrainer:
"""BigNAS训练器"""
def __init__(self, supernet, args):
self.supernet = supernet
self.optimizer = torch.optim.SGD(
supernet.parameters(),
lr=args.learning_rate,
momentum=0.9,
weight_decay=args.weight_decay
)
def train_step(self, inputs, targets):
"""使用三明治规则训练"""
self.optimizer.zero_grad()
# 三明治规则:同时训练最大、最小和随机采样的子网络
# 1. 训练最大子网络
max_config = self.get_max_config()
outputs_max = self.supernet(inputs, max_config)
loss_max = F.cross_entropy(outputs_max, targets)
loss_max.backward()
# 2. 训练最小子网络
min_config = self.get_min_config()
outputs_min = self.supernet(inputs, min_config)
loss_min = F.cross_entropy(outputs_min, targets)
loss_min.backward()
# 3. 训练随机子网络
random_config = self.sample_config()
outputs_random = self.supernet(inputs, random_config)
loss_random = F.cross_entropy(outputs_random, targets)
loss_random.backward()
self.optimizer.step()
return {
'loss_max': loss_max.item(),
'loss_min': loss_min.item(),
'loss_random': loss_random.item()
}
def get_max_config(self):
"""获取最大配置"""
return {
'depth': self.supernet.max_depth,
'width_mult': 1.0,
'kernel_size': 7,
'expand_ratio': 6
}
def get_min_config(self):
"""获取最小配置"""
return {
'depth': 5,
'width_mult': 0.25,
'kernel_size': 3,
'expand_ratio': 3
}
def sample_config(self):
"""随机采样配置"""
import random
return {
'depth': random.randint(5, self.supernet.max_depth),
'width_mult': random.choice([0.25, 0.5, 0.75, 1.0]),
'kernel_size': random.choice([3, 5, 7]),
'expand_ratio': random.choice([3, 4, 6])
}
6. AutoML工具与应用
6.1 Optuna
Optuna是一个高效的超参数优化框架,支持多种搜索算法(TPE、CMA-ES等)。
import optuna
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
# 定义目标函数
def objective(trial):
"""Optuna目标函数"""
# 定义搜索空间
n_layers = trial.suggest_int('n_layers', 1, 3)
dropout = trial.suggest_float('dropout', 0.1, 0.5)
lr = trial.suggest_float('lr', 1e-5, 1e-1, log=True)
optimizer_name = trial.suggest_categorical('optimizer', ['Adam', 'SGD'])
# 构建模型
layers = []
in_features = 784 # MNIST
for i in range(n_layers):
out_features = trial.suggest_int(f'n_units_l{i}', 64, 512, log=True)
layers.append(nn.Linear(in_features, out_features))
layers.append(nn.ReLU())
layers.append(nn.Dropout(dropout))
in_features = out_features
layers.append(nn.Linear(in_features, 10))
model = nn.Sequential(*layers)
# 训练配置
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)
# 优化器
if optimizer_name == 'Adam':
optimizer = optim.Adam(model.parameters(), lr=lr)
else:
momentum = trial.suggest_float('momentum', 0.5, 0.99)
optimizer = optim.SGD(model.parameters(), lr=lr, momentum=momentum)
# 数据加载
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])
train_loader = torch.utils.data.DataLoader(
datasets.MNIST('./data', train=True, download=True, transform=transform),
batch_size=128, shuffle=True
)
# 训练
criterion = nn.CrossEntropyLoss()
model.train()
for epoch in range(5): # 简化为5个epoch
for batch_idx, (data, target) in enumerate(train_loader):
data, target = data.to(device), target.to(device)
data = data.view(data.size(0), -1)
optimizer.zero_grad()
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
if batch_idx >= 100: # 限制迭代次数
break
# 验证
model.eval()
correct = 0
total = 0
with torch.no_grad():
for data, target in train_loader:
data, target = data.to(device), target.to(device)
data = data.view(data.size(0), -1)
output = model(data)
_, predicted = torch.max(output.data, 1)
total += target.size(0)
correct += (predicted == target).sum().item()
if total >= 1000: # 限制验证样本数
break
accuracy = correct / total
return accuracy
# 运行优化
def run_optuna():
"""运行Optuna优化"""
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50)
print("Best trial:")
trial = study.best_trial
print(f" Value: {trial.value:.4f}")
print(" Params:")
for key, value in trial.params.items():
print(f" {key}: {value}")
return study
6.2 Auto-sklearn与TPOT
# Auto-sklearn示例(需要安装:pip install auto-sklearn)
def auto_sklearn_example():
"""Auto-sklearn使用示例"""
try:
import autosklearn.classification
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
# 加载数据
X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# 创建Auto-sklearn分类器
automl = autosklearn.classification.AutoSklearnClassifier(
time_left_for_this_task=120, # 2分钟
per_run_time_limit=30,
metric=autosklearn.metrics.accuracy
)
# 训练
automl.fit(X_train, y_train)
# 评估
predictions = automl.predict(X_test)
accuracy = (predictions == y_test).mean()
print(f"Auto-sklearn accuracy: {accuracy:.4f}")
# 显示最终集成
print(automl.leaderboard())
except ImportError:
print("auto-sklearn not installed. Skipping example.")
# TPOT示例(需要安装:pip install tpot)
def tpot_example():
"""TPOT使用示例"""
try:
from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
# 加载数据
X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# 创建TPOT分类器
tpot = TPOTClassifier(
generations=5,
population_size=20,
verbosity=2,
random_state=42
)
# 训练
tpot.fit(X_train, y_train)
# 评估
accuracy = tpot.score(X_test, y_test)
print(f"TPOT accuracy: {accuracy:.4f}")
# 导出最佳pipeline
tpot.export('best_pipeline.py')
except ImportError:
print("tpot not installed. Skipping example.")
7. 2025年最新进展
7.1 AutoFormer
AutoFormer是微软亚洲研究院提出的Vision Transformer架构搜索方法,它将权重共享技术引入到ViT的搜索中。
核心创新:
- 搜索空间:包含网络深度、embedding维度、注意力头数等
- 权重共享:训练一个超网,包含所有可能的子网络
- 渐进式收缩:从大到小逐步训练,确保小模型也能获得良好初始化
class AutoFormerSearchSpace:
"""AutoFormer搜索空间"""
def __init__(self):
# 可搜索维度
self.search_depth = [10, 12, 14] # Transformer层数
self.search_embed_dim = [384, 480, 528] # Embedding维度
self.search_num_heads = [6, 8, 10] # 注意力头数
self.search_mlp_ratio = [3.0, 3.5, 4.0] # MLP扩展比例
def sample_config(self):
"""采样一个ViT配置"""
import random
return {
'depth': random.choice(self.search_depth),
'embed_dim': random.choice(self.search_embed_dim),
'num_heads': random.choice(self.search_num_heads),
'mlp_ratio': random.choice(self.search_mlp_ratio)
}
def get_max_config(self):
"""获取最大配置"""
return {
'depth': max(self.search_depth),
'embed_dim': max(self.search_embed_dim),
'num_heads': max(self.search_num_heads),
'mlp_ratio': max(self.search_mlp_ratio)
}
class DynamicVisionTransformer(nn.Module):
"""动态Vision Transformer(AutoFormer风格)"""
def __init__(self, img_size=224, patch_size=16, in_chans=3,
num_classes=1000, max_embed_dim=528, max_depth=14):
super().__init__()
self.max_embed_dim = max_embed_dim
self.max_depth = max_depth
# Patch嵌入
self.patch_embed = nn.Conv2d(
in_chans, max_embed_dim,
kernel_size=patch_size, stride=patch_size
)
# 位置编码
num_patches = (img_size // patch_size) ** 2
self.pos_embed = nn.Parameter(
torch.zeros(1, num_patches + 1, max_embed_dim)
)
self.cls_token = nn.Parameter(torch.zeros(1, 1, max_embed_dim))
# Transformer块
self.blocks = nn.ModuleList([
DynamicTransformerBlock(max_embed_dim)
for _ in range(max_depth)
])
# 分类头
self.norm = nn.LayerNorm(max_embed_dim)
self.head = nn.Linear(max_embed_dim, num_classes)
def forward(self, x, config=None):
if config is None:
config = {'depth': self.max_depth, 'embed_dim': self.max_embed_dim}
B = x.shape[0]
embed_dim = config['embed_dim']
# Patch嵌入
x = self.patch_embed(x)
x = x.flatten(2).transpose(1, 2)
x = x[:, :, :embed_dim]
# 添加CLS token
cls_tokens = self.cls_token[:, :, :embed_dim].expand(B, -1, -1)
x = torch.cat([cls_tokens, x], dim=1)
# 添加位置编码
x = x + self.pos_embed[:, :, :embed_dim]
# Transformer块
for i in range(config['depth']):
x = self.blocks[i](x, config)
# 分类
x = self.norm(x)
x = x[:, 0]
x = self.head(x[:, :embed_dim])
return x
class DynamicTransformerBlock(nn.Module):
"""动态Transformer块"""
def __init__(self, max_embed_dim):
super().__init__()
self.norm1 = nn.LayerNorm(max_embed_dim)
self.attn = DynamicAttention(max_embed_dim)
self.norm2 = nn.LayerNorm(max_embed_dim)
self.mlp = DynamicMLP(max_embed_dim)
def forward(self, x, config):
embed_dim = config['embed_dim']
# 注意力
x = x + self.attn(self.norm1(x), config)
# MLP
x = x + self.mlp(self.norm2(x), config)
return x
class DynamicAttention(nn.Module):
"""动态多头注意力"""
def __init__(self, max_embed_dim, max_num_heads=10):
super().__init__()
self.max_embed_dim = max_embed_dim
self.max_num_heads = max_num_heads
self.qkv = nn.Linear(max_embed_dim, max_embed_dim * 3)
self.proj = nn.Linear(max_embed_dim, max_embed_dim)
def forward(self, x, config):
embed_dim = config['embed_dim']
num_heads = config['num_heads']
B, N, _ = x.shape
head_dim = embed_dim // num_heads
# QKV投影
qkv = self.qkv(x)[:, :, :embed_dim * 3]
qkv = qkv.reshape(B, N, 3, num_heads, head_dim).permute(2, 0, 3, 1, 4)
q, k, v = qkv[0], qkv[1], qkv[2]
# 注意力计算
attn = (q @ k.transpose(-2, -1)) * (head_dim ** -0.5)
attn = F.softmax(attn, dim=-1)
x = (attn @ v).transpose(1, 2).reshape(B, N, embed_dim)
x = self.proj(x)[:, :, :embed_dim]
return x
class DynamicMLP(nn.Module):
"""动态MLP"""
def __init__(self, max_embed_dim, max_ratio=4.0):
super().__init__()
self.max_embed_dim = max_embed_dim
self.max_hidden_dim = int(max_embed_dim * max_ratio)
self.fc1 = nn.Linear(max_embed_dim, self.max_hidden_dim)
self.fc2 = nn.Linear(self.max_hidden_dim, max_embed_dim)
def forward(self, x, config):
embed_dim = config['embed_dim']
mlp_ratio = config['mlp_ratio']
hidden_dim = int(embed_dim * mlp_ratio)
x = self.fc1(x)[:, :, :hidden_dim]
x = F.gelu(x)
x = self.fc2(x)[:, :hidden_dim, :embed_dim]
return x
7.2 NAS方法对比
| 方法 | 搜索策略 | 搜索成本 | 主要优势 | 主要局限 |
|---|---|---|---|---|
| NASNet | 强化学习 | 高(数千GPU小时) | 发现高性能细胞结构 | 计算成本极高 |
| ENAS | 权重共享+RL | 中 | 大幅降低搜索成本 | 可能陷入局部最优 |
| DARTS | 可微分 | 低(单卡可完成) | 效率高,端到端训练 | 存在崩溃问题 |
| AmoebaNet | 进化算法 | 高 | 发现新颖结构 | 计算成本高 |
| OFA | 权重共享 | 低(训练一次) | 支持多种部署场景 | 需要专门训练策略 |
| BigNAS | 权重共享 | 低 | 大模型小模型同时优化 | 超网训练复杂 |
| AutoFormer | 权重共享 | 中 | 针对ViT优化 | 仅适用于Transformer |
8. NAS简化实现代码
8.1 完整DARTS实现
"""
简化版DARTS实现
用于CIFAR-10图像分类
"""
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader
class DARTSCell(nn.Module):
"""DARTS搜索单元"""
def __init__(self, steps, multiplier, C_prev_prev, C_prev, C,
reduction, reduction_prev):
super(DARTSCell, self).__init__()
self.reduction = reduction
self.reduction_prev = reduction_prev
# 预处理层
if reduction_prev:
self.preprocess0 = FactorizedReduce(C_prev_prev, C)
else:
self.preprocess0 = ReLUConvBN(C_prev_prev, C, 1, 1, 0)
self.preprocess1 = ReLUConvBN(C_prev, C, 1, 1, 0)
self.steps = steps
self.multiplier = multiplier
# 候选操作
self.ops = nn.ModuleList()
self.bns = nn.ModuleList()
primitives = [
'none', 'max_pool_3x3', 'avg_pool_3x3', 'skip_connect',
'sep_conv_3x3', 'sep_conv_5x5', 'dil_conv_3x3', 'dil_conv_5x5'
]
for i in range(self.steps):
for j in range(2 + i):
stride = 2 if reduction and j < 2 else 1
op = MixedOp(C, stride, primitives)
self.ops.append(op)
def forward(self, s0, s1, weights):
s0 = self.preprocess0(s0)
s1 = self.preprocess1(s1)
states = [s0, s1]
offset = 0
for i in range(self.steps):
s = sum(self.ops[offset + j](h, weights[offset + j])
for j, h in enumerate(states))
offset += len(states)
states.append(s)
return torch.cat(states[-self.multiplier:], dim=1)
class DARTSNetwork(nn.Module):
"""DARTS搜索网络"""
def __init__(self, C=16, num_classes=10, layers=8, steps=4,
multiplier=4, stem_multiplier=3):
super(DARTSNetwork, self).__init__()
self.steps = steps
self.multiplier = multiplier
C_curr = stem_multiplier * C
self.stem = nn.Sequential(
nn.Conv2d(3, C_curr, 3, padding=1, bias=False),
nn.BatchNorm2d(C_curr)
)
C_prev_prev, C_prev, C_curr = C_curr, C_curr, C
self.cells = nn.ModuleList()
reduction_prev = False
for i in range(layers):
if i in [layers // 3, 2 * layers // 3]:
C_curr *= 2
reduction = True
else:
reduction = False
cell = DARTSCell(steps, multiplier, C_prev_prev, C_prev,
C_curr, reduction, reduction_prev)
self.cells.append(cell)
reduction_prev = reduction
C_prev_prev = C_prev
C_prev = multiplier * C_curr
self.global_pooling = nn.AdaptiveAvgPool2d(1)
self.classifier = nn.Linear(C_prev, num_classes)
# 架构参数
self._initialize_alphas()
def _initialize_alphas(self):
k = sum(2 + i for i in range(self.steps))
num_ops = 8 # 候选操作数
self.alphas_normal = nn.Parameter(torch.randn(k, num_ops))
self.alphas_reduce = nn.Parameter(torch.randn(k, num_ops))
self._arch_parameters = [
self.alphas_normal,
self.alphas_reduce
]
def arch_parameters(self):
return self._arch_parameters
def weights(self):
return [p for n, p in self.named_parameters()
if 'alphas' not in n]
def forward(self, x):
s0 = s1 = self.stem(x)
for i, cell in enumerate(self.cells):
if cell.reduction:
weights = F.softmax(self.alphas_reduce, dim=-1)
else:
weights = F.softmax(self.alphas_normal, dim=-1)
s0, s1 = s1, cell(s0, s1, weights)
out = self.global_pooling(s1)
logits = self.classifier(out.view(out.size(0), -1))
return logits
def genotype(self):
"""导出离散架构"""
def _parse(weights):
gene = []
n = 2
start = 0
for i in range(self.steps):
end = start + n
W = weights[start:end].copy()
edges = sorted(range(i + 2),
key=lambda x: -max(W[x][k] for k in range(len(W[x]))
if k != 0))[:2]
for j in edges:
k_best = None
for k in range(len(W[j])):
if k != 0:
if k_best is None or W[j][k] > W[j][k_best]:
k_best = k
gene.append((k_best, j))
start = end
n += 1
return gene
gene_normal = _parse(F.softmax(self.alphas_normal, dim=-1).data.cpu().numpy())
gene_reduce = _parse(F.softmax(self.alphas_reduce, dim=-1).data.cpu().numpy())
return {'normal': gene_normal, 'reduce': gene_reduce}
def train_darts():
"""训练DARTS"""
# 数据预处理
transform_train = transforms.Compose([
transforms.RandomCrop(32, padding=4),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize((0.4914, 0.4822, 0.4465),
(0.2023, 0.1994, 0.2010))
])
transform_test = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.4914, 0.4822, 0.4465),
(0.2023, 0.1994, 0.2010))
])
# 加载CIFAR-10
trainset = torchvision.datasets.CIFAR10(
root='./data', train=True, download=True, transform=transform_train
)
trainloader = DataLoader(trainset, batch_size=64, shuffle=True, num_workers=2)
testset = torchvision.datasets.CIFAR10(
root='./data', train=False, download=True, transform=transform_test
)
testloader = DataLoader(testset, batch_size=64, shuffle=False, num_workers=2)
# 创建模型
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = DARTSNetwork().to(device)
# 优化器
w_optimizer = torch.optim.SGD(
model.weights(), lr=0.025, momentum=0.9, weight_decay=3e-4
)
alpha_optimizer = torch.optim.Adam(
model.arch_parameters(), lr=3e-4, betas=(0.5, 0.999)
)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
w_optimizer, T_max=50, eta_min=0.001
)
# 训练循环
for epoch in range(50):
model.train()
train_loss = 0
correct = 0
total = 0
for batch_idx, (inputs, targets) in enumerate(trainloader):
inputs, targets = inputs.to(device), targets.to(device)
# 更新网络权重
w_optimizer.zero_grad()
outputs = model(inputs)
loss = F.cross_entropy(outputs, targets)
loss.backward()
w_optimizer.step()
# 更新架构参数(每批次或每几批次)
if batch_idx % 5 == 0:
try:
val_inputs, val_targets = next(val_iter)
except:
val_iter = iter(testloader)
val_inputs, val_targets = next(val_iter)
val_inputs, val_targets = val_inputs.to(device), val_targets.to(device)
alpha_optimizer.zero_grad()
outputs = model(val_inputs)
loss = F.cross_entropy(outputs, val_targets)
loss.backward()
alpha_optimizer.step()
train_loss += loss.item()
_, predicted = outputs.max(1)
total += targets.size(0)
correct += predicted.eq(targets).sum().item()
scheduler.step()
# 验证
model.eval()
test_loss = 0
correct_test = 0
total_test = 0
with torch.no_grad():
for inputs, targets in testloader:
inputs, targets = inputs.to(device), targets.to(device)
outputs = model(inputs)
loss = F.cross_entropy(outputs, targets)
test_loss += loss.item()
_, predicted = outputs.max(1)
total_test += targets.size(0)
correct_test += predicted.eq(targets).sum().item()
print(f'Epoch {epoch+1}/50: '
f'Train Acc: {100.*correct/total:.2f}%, '
f'Test Acc: {100.*correct_test/total_test:.2f}%')
# 导出最终架构
genotype = model.genotype()
print('Final genotype:', genotype)
return model, genotype
if __name__ == '__main__':
model, genotype = train_darts()
9. 避坑小贴士
9.1 搜索空间设计
- 搜索空间过大:会导致搜索效率极低,建议从较小的搜索空间开始,逐步扩展
- 搜索空间过小:可能无法发现优秀的架构,需要平衡搜索空间和计算资源
- 操作冗余:避免包含性能相近的操作,如同时使用3x3和4x4卷积
9.2 DARTS常见问题
-
架构崩溃(Collapse):DARTS倾向于选择跳跃连接或无操作,导致网络退化
- 解决方案:使用早停策略、添加操作丢弃正则化、使用P-DARTS等改进方法
-
验证集过拟合:架构参数在验证集上过拟合
- 解决方案:增加验证集大小、使用早停、限制架构参数更新频率
9.3 权重共享训练
-
子网络干扰:不同子网络共享权重可能导致相互干扰
- 解决方案:使用渐进式收缩训练、三明治规则、梯度掩码
-
排序不一致:超网性能与子网络独立训练性能不一致
- 解决方案:使用更精细的权重共享策略、添加排序一致性损失
9.4 计算资源管理
- 搜索成本:NAS通常需要大量计算资源,建议:
- 使用代理任务(如CIFAR-10代替ImageNet)
- 限制每个候选架构的训练轮数
- 使用权重共享减少重复训练
10. 本章小结和知识点回顾
核心概念
-
NAS三要素:搜索空间定义了可能的架构集合,搜索策略决定如何探索,性能评估衡量架构质量
-
搜索空间类型:链式结构简单直接,细胞结构模块化且可迁移
-
搜索策略:
- 强化学习:RNN控制器+策略梯度,适合离散搜索空间
- 进化算法:遗传算法、NSGA-II,适合多目标优化
- 可微分方法:DARTS通过连续松弛实现端到端优化
-
权重共享:OFA和BigNAS通过训练超网大幅降低搜索成本
-
AutoML工具:Optuna、Auto-sklearn、TPOT提供开箱即用的自动化机器学习功能
关键公式
-
DARTS连续松弛:
output = sum(softmax(alpha_i) * op_i(x)) -
REINFORCE策略梯度:
gradient = E[R * gradient(log P(a))] -
NSGA-II非支配排序:基于Pareto最优性对解进行分层
一句话总结
NAS让神经网络学会"自我设计",通过自动化搜索最优架构,将人类专家从繁琐的调参工作中解放出来,是通往通用人工智能的重要一步。
参考资料:
- Zoph et al. “Neural Architecture Search with Reinforcement Learning” (ICLR 2017)
- Liu et al. “DARTS: Differentiable Architecture Search” (ICLR 2019)
- Real et al. “Regularized Evolution for Image Classifier Architecture Search” (AAAI 2019)
- Cai et al. “Once-for-All: Train One Network and Specialize it for Efficient Deployment” (ICLR 2020)
- Chen et al. “AutoFormer: Searching Transformers for Visual Recognition” (ICCV 2021)
- Dong et al. “BigNAS: Scaling Up Neural Architecture Search with Big Single-Stage Models” (ECCV 2020)
本教程为《深度学习精通》系列第22章,转载请注明出处。
更多推荐

所有评论(0)