Error in user YAML: (<unknown>): found a tab character that violate indentation while scanning a plain scalar at line 3 column 3

---
- oeasy Python 0803
- 这是 oeasy 系统化 Python 教程，从基础一步步讲，扎实、完整、不跳步。愿意花时间学，就能真正学会。
- 本教程同步发布在： 
	- 个人网站： `https://oeasy.org` 
	- 蓝桥云课： `https://www.lanqiao.cn/courses/3584` 
	- GitHub： `https://github.com/overmind1980/oeasy-python-tutorial` 
	- Gitee： `https://gitee.com/overmind1980/oeasypython` 
---

从零开始

回忆

随机森林的「随机」 VS 交叉验证的「验证」
✔️ 维度一：【层级不同】→ 最核心区别
- 🟢 随机森林的随机
  - 模型内部的训练机制（内核）
  - 是随机森林「自己训练自己」的方式
  - 是模型的「内功心法」
  - 决定了模型的「底子」
- 🔵 交叉验证 →
  - 模型外部的评估机制（裁判）
  - 是我们「评价这个模型」的方式
  - 是给模型打分的「裁判标准」
  - 决定了我们「知不知道模型的底子好不好」
✔️ 维度二：【目的完全不同】

对比项	随机森林的「双重随机」	cross_validate的「交叉验证」
核心目的	1. 降低单棵树的过拟合 2. 让多棵树的预测逻辑差异化 3. 集成后提升模型自身的泛化能力	1. 评估模型在「陌生数据」上的真实表现 2. 避免单次拆分（train_test_split）的偶然性 3. 给出可信的模型性能分数
作用范围	仅在「训练集」内部，全程无测试集参与	同时用到训练集+测试集，所有样本轮流当测试集
产生结果	一个「训练好的随机森林模型」	模型的「评估分数（准确率/平均分/方差）」
是否重叠	✅ 完全不重叠	✅ 完全不重叠

事到如今这个模型还能进化吗？

目前架构

内外共分两圈
- 外圈
  - 交叉验证cross_validate
  - 5轮8:2
  - 测试集覆盖所有
- 内圈
  - 标准化器
  - 随机森林 100棵决策树

# 导入所有依赖库
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_validate, StratifiedKFold

# 完全相同的数据集 + 你的指定噪音样本
X = [[1.75,1.78,70],[1.65,1.68,55],[1.83,1.85,85],
     [1.70,1.73,60],[1.91,1.96,95],[1.88,1.93,82],
     [1.98,2.11,90],[2.03,2.21,102],[2.08,2.13,111],[2.16,2.27,116],
     [1.60, 1.65, 61.7]]
y = [0,0,0,0,0,1,1,1,1,1,1]

# 完全相同的模型
pipe_model = Pipeline([
    ("数据标准化", StandardScaler()),
    ("随机森林分类器", RandomForestClassifier(n_estimators=100, random_state=42))
])

# ===================== 核心修改 ✔️ random_state=4000 =====================
cv_rule = StratifiedKFold(n_splits=5, shuffle=True, random_state=4000)
cv_result = cross_validate(
    estimator=pipe_model,
    X=X,
    y=y,
    cv=cv_rule,
    scoring='accuracy',
    return_train_score=True
)

# 全量训练获取特征重要性
pipe_model.fit(X, y)
feat_imp  = pipe_model['随机森林分类器'].feature_importances_

# 解析结果
cv_train_scores = cv_result['train_score']
cv_test_scores  = cv_result['test_score']
train_mean = cv_train_scores.mean()
test_mean  = cv_test_scores.mean()
test_std   = cv_test_scores.std()

# 结果输出
print("="*70)
print("【独立版】cross_validate 5折分层交叉验证 (random_state=4000)")
print("="*70)
print(f"总样本数：{len(X)} | 交叉验证折数：5 | 每折训练：9条 | 每折测试：2条")
print(f"5次训练集准确率：{cv_train_scores.round(4)}")
print(f"5次测试集准确率：{cv_test_scores.round(4)}")
print(f"训练集平均准确率：{train_mean:.4f}")
print(f"测试集平均准确率：{test_mean:.4f} 【核心真实分数】")
print(f"测试集准确率方差：{test_std:.4f}")
print(f"特征重要性：身高={feat_imp[0]:.3f} | 臂展={feat_imp[1]:.3f} | 体重={feat_imp[2]:.3f}")
print("="*70)

内圈的随机森林参数

随机森林 3 个参数
- 评估器数量
- 最大深度
- 最小采样叶子

维度	`n_estimators` （评估器数量/树的数量）	`max_depth` （最大深度）	`min_samples_leaf` （最小采样叶子）
通俗解释	随机森林里决策树的总数量	限制每棵决策树能生长的最大层数	叶子节点必须包含的最少样本数
核心作用	决定模型稳定性，树越多投票越稳，集成效果越好	【第一防过拟合核心】决定单棵树复杂度，优先级最高	【第二防过拟合核心】禁止树过度细分样本，过滤噪音
取值过小（问题）	欠拟合、模型效果差、稳定性极差	欠拟合、树太浅，仅学粗规律，全量准确率低	过拟合、对噪音样本敏感，训练集准确率虚高
取值适中（黄金区间）	50~200，准确率高、稳定性强、算力适中	2~10，完美学核心规律，无过拟合，泛化能力最强	2~4，叶子节点样本充足，只学通用规律
取值过大（问题）	轻微过拟合、准确率提升微乎其微、训练慢/占内存高	严重过拟合、死记训练集噪音，训练集满分、测试集暴跌	欠拟合、树无法充分分裂，学不到精细规律
你的球员数据最优调参范围	`randint(50, 200)`	`randint(2, 10)`	`randint(2, 4)`
调参优先级	⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐

可以自动找到参数最合适的大小吗？

RandomizedSearchCV

RandomizedSearchCV
- Randomized 随机
- Search 搜索
- CV Cross Validation 交叉验证

cv_rule = StratifiedKFold(n_splits=5, shuffle=True, random_state=4000)
cv_result = cross_validate(
    estimator=pipe_model,
    X=X,
    y=y,
    cv=cv_rule,
    scoring='accuracy',
    return_train_score=True
)

RandomizedSearchCV
- 本质上是交叉验证(Cross Validation)工具
- 附带随机搜索
随机搜索搜的是什么呢？

代码

# 导入所有依赖库
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

# 你的数据集 + 指定噪音样本 不变
X = [[1.75,1.78,70],[1.65,1.68,55],[1.83,1.85,85],
     [1.70,1.73,60],[1.91,1.96,95],[1.88,1.93,82],
     [1.98,2.11,90],[2.03,2.21,102],[2.08,2.13,111],[2.16,2.27,116],
     [1.60, 1.65, 61.7]]
y = [0,0,0,0,0,1,1,1,1,1,1]

# 流水线模型：标准化 + 随机森林，固定 random_state=42
pipe_model = Pipeline([
    ("数据标准化", StandardScaler()),
    ("随机森林分类器", RandomForestClassifier(random_state=42))
])

# 定义3个核心超参数搜索空间
param_distributions = {
    "随机森林分类器__n_estimators": randint(50, 200),
    "随机森林分类器__max_depth": randint(2, 10),
    "随机森林分类器__min_samples_leaf": randint(2, 4)
}

# ===================== 核心：RandomizedSearchCV 5折交叉验证+自动调参 =====================
random_search = RandomizedSearchCV(
    estimator=pipe_model,
    param_distributions=param_distributions,
    n_iter=8,
    cv=5,                # 5折交叉验证
    scoring='accuracy',
    random_state=42,     # 统一指定为 42
    n_jobs=-1,
    return_train_score=True
).fit(X, y)

# 全量训练+特征重要性
best_model = random_search.best_estimator_
best_model.fit(X, y)
feat_imp  = best_model['随机森林分类器'].feature_importances_

# 结果输出
print("="*70)
print("✅ RandomizedSearchCV 自动调参 + 随机森林 (random_state=42)")
print("="*70)
print(f"总样本数：{len(X)} | 5折交叉验证 | 随机搜索参数组数：8")
print(f"📌 最优超参数组合：{random_search.best_params_}")
print(f"📌 最优参数交叉验证平均准确率：{random_search.best_score_:.4f}")
print(f"📌 特征重要性：身高={feat_imp[0]:.3f} | 臂展={feat_imp[1]:.3f} | 体重={feat_imp[2]:.3f}")
print("="*70)

经过一段时间

可以得到随机森林的最佳参数
- 最小叶子 - 3
- 最大深度 - 8
- 随机分类器数量 - 142

代码细节

三个参数
- 各有随机范围
  1. 随机分类器最大深度 - [2, 4)
  2. 随机分类器最小叶子 - [2, 10)
  3. 随机分类器数量 - [50, 100)

# 定义3个核心超参数搜索空间
param_distributions = {
    "随机森林分类器__n_estimators": randint(50, 200),
    "随机森林分类器__max_depth": randint(2, 10),
    "随机森林分类器__min_samples_leaf": randint(2, 4)
}

然后8轮 5折交叉验证


# ==== 核心：RandomizedSearchCV 5折交叉验证+自动调参===========
random_search = RandomizedSearchCV(
    estimator=pipe_model,
    param_distributions=param_distributions,
    n_iter=8,			# 8轮
    cv=5,                # 5折交叉验证
    scoring='accuracy',
    random_state=42,     # 统一指定为 42
    n_jobs=-1,
    return_train_score=True
).fit(X, y)

# 全量训练+特征重要性
best_model = random_search.best_estimator_

核心参数是

n_iter=8 ：随机抽取8 组独立的参数元组，不会多也不会少
cv=5 ：对每一组参数，都独立跑完完整的 5 折交叉验证

8轮交叉验证都随机到了那些参数呢？

观察

from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

X = [[1.75,1.78,70],[1.65,1.68,55],[1.83,1.85,85],
     [1.70,1.73,60],[1.91,1.96,95],[1.88,1.93,82],
     [1.98,2.11,90],[2.03,2.21,102],[2.08,2.13,111],[2.16,2.27,116],
     [1.60, 1.65, 61.7]]
y = [0,0,0,0,0,1,1,1,1,1,1]

pipe_model = Pipeline([
    ("数据标准化", StandardScaler()),
    ("随机森林分类器", RandomForestClassifier(random_state=42))
])

param_distributions = {
    "随机森林分类器__n_estimators": randint(50, 200),
    "随机森林分类器__max_depth": randint(2, 10),
    "随机森林分类器__min_samples_leaf": randint(2, 4)
}

random_search = RandomizedSearchCV(
    estimator=pipe_model,
    param_distributions=param_distributions,
    n_iter=8,
    cv=5,
    scoring='accuracy',
    random_state=42,
    n_jobs=-1
).fit(X, y)

# 打印8轮5折对应的8组参数元组
all_8_params = random_search.cv_results_['params']
print("="*80)
print("8轮5折交叉验证 对应的8组完整参数元组")
print("="*80)
for idx, param in enumerate(all_8_params, start=1):
    print(f"第{idx}轮 → {param}")

# 打印核心结果
print("="*80)
print(f"最优参数组合：{random_search.best_params_}")
print(f"最优5折交叉验证平均分：{random_search.best_score_:.4f}")
print("="*80)

效果

为什么这样选择参数元组呢？

随机范围

不是参数乱选
- 也不是排序随机
- 所有规则都是固定的、可复现的
- 有明确逻辑的**
- 结合你代码里的 random_state=42
- 随机森林3个超参

# 你定义的3个参数的取值范围（源头）
param_distributions = {
    "随机森林分类器__n_estimators": randint(50, 200),  # 只能抽：50 ≤ x ≤ 199 的整数
    "随机森林分类器__max_depth": randint(2, 10),        # 只能抽：2 ≤ x ≤ 9 的整数
    "随机森林分类器__min_samples_leaf": randint(2, 4)   # 只能抽：2 / 3 两个整数（唯一值）
}

你的参数空间里
- 总参数元组数量
  - = (199-50+1) × (9-2+1) × 2
  - = 150 × 8 × 2
  - = 2400种
8组参数全部严格满足范围：
- n_estimators：122/78/103/167/62/147/94/118 → 全在50~199之间
- max_depth：9/5/3/7/6/4/8/2 → 全在2~9之间
- min_samples_leaf：2/2/3/2/3/2/3/2 → 只有2/3两个值，无其他数
为什么最终只有8组参数？

固定随机种子的核心作用

用GridSearchCV网格搜索
- 要跑 2400×5=12000 次训练
- 算力爆炸
RandomizedSearchCV只抽 8 组
- 就能找到「接近最优」的参数
- 效率碾压网格搜索
- 这是它的核心优势
为什么是这8组？？

「真随机」 vs 「伪随机」

代码里的 random_state=42
- 是「排序固定、参数固定」的唯一原因
- 也是机器学习实验「可复现性」的核心
- 这是重中之重
RandomizedSearchCV 里的「随机采样」
- 不是生活中的真随机
- 而是计算机的 伪随机(Pseudo-Random)
- 随机搜索 ≠ 乱搜，是「高效采样」
  1. 用一个固定的「种子(seed)」，生成一套固定的随机数序列
  2. 只要种子不变 →
    - 随机数序列就不变 →
    - 抽出来的参数元组就不变 →
    - 参数的排序也不变
  3. 你的代码里写死了
    - random_state=42 →
    - 不管你运行多少次代码永远都是
      - 「第1组是122/9/2
      - 第2组是78/5/2...
      - 第8组是118/2/2」
      - 顺序+数值一丝不差**
会不会抽到「重复的参数元组」？

重复

大概率不会
- 极小概率会
- 但程序会自动处理
你只抽n_iter=8组
- 从2400种里抽8种
- 重复的概率几乎为0✔️
即使极端情况抽到重复的参数元组
- 程序也会「重复跑一次5折CV」
  - 不会跳过
  - 最终结果里会出现两个相同参数、相同分数
具体怎么生成的？

24个随机小数

要求生成种子为42的
- 24个小数序列

import numpy as np

# 初始化固定种子42的伪随机数生成器（你的RandomizedSearchCV底层就是用的这个）
rng = np.random.RandomState(seed=42)

# 生成前24个 0~1之间 的伪随机小数（8组参数 × 3个参数 = 24个，完美匹配你的需求）
random_24_list = rng.random(24)

# 格式化输出，方便你复制/查看/验算参数
print("random_state=42 生成的【前24个固定伪随机小数】↓↓↓")
print("="*80)
for idx, num in enumerate(random_24_list, start=1):
    print(f"第{idx}个: {num}")

每次运行的
- 是这一组小数序列
- 永远不变

小数序列怎么转化为参数呢？

具体参数

import numpy as np

# 1. 固定种子生成24个随机小数
rng = np.random.RandomState(seed=42)
random_24_list = rng.random(24)

# 2. 你的3个参数固定规则（左闭右开）
n_min, n_count = 50, 150  # randint(50,200) → 50~199
d_min, d_count = 2, 8     # randint(2,10) → 2~9
l_min, l_count = 2, 2     # randint(2,4) → 2~3

# 3. 核心公式+遍历计算8组完整参数
print("✅ random_state=42 → 24个随机小数 → 8组完整参数元组（你的代码真实运行参数）")
print("="*100)
for i in range(8):
    # 每组取3个连续的随机小数
    d1 = random_24_list[i*3]
    d2 = random_24_list[i*3+1]
    d3 = random_24_list[i*3+2]
    
    # 计算公式：最小值 + int(随机小数 × 取值总个数)
    n_estimators = n_min + int(d1 * n_count)
    max_depth = d_min + int(d2 * d_count)
    min_samples_leaf = l_min + int(d3 * l_count)
    
    # 格式化输出
    print(f"第{i+1}组: n_estimators={n_estimators}, max_depth={max_depth}, min_samples_leaf={min_samples_leaf}")

每次运行得到参数元组
- 固定不变
这参数元组和 RandomizedSearchCV
- 得到的一致

既然这样的话
- 我可以不取随机值
- 只挑一些经典值吗？

具体参数

我想要新建随机方式
1. "随机森林分类器__n_estimators": [50,100,150,200]
2. "随机森林分类器__max_depth": [2, 6, 10]
3. "随机森林分类器__min_samples_leaf": [2, 3]
- 在这里面随机抽8个

from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import RandomizedSearchCV

X = [[1.75,1.78,70],[1.65,1.68,55],[1.83,1.85,85],
     [1.70,1.73,60],[1.91,1.96,95],[1.88,1.93,82],
     [1.98,2.11,90],[2.03,2.21,102],[2.08,2.13,111],[2.16,2.27,116],
     [1.60, 1.65, 61.7]]
y = [0,0,0,0,0,1,1,1,1,1,1]

pipe_model = Pipeline([
    ("数据标准化", StandardScaler()),
    ("随机森林分类器", RandomForestClassifier(random_state=42))
])

# ========== 核心修改处 ✅ 按你的要求写死候选列表，不再用randint ==========
param_distributions = {
    "随机森林分类器__n_estimators": [50, 100, 150, 200],  # 指定候选值
    "随机森林分类器__max_depth": [2, 6, 10],             # 指定候选值
    "随机森林分类器__min_samples_leaf": [2, 3]           # 指定候选值
}

random_search = RandomizedSearchCV(
    estimator=pipe_model,
    param_distributions=param_distributions,
    n_iter=8,        # 从所有组合中【随机抽取8组】超参
    cv=5,            # 5折交叉验证
    scoring='accuracy',
    random_state=42, # 固定随机种子，结果可复现，每次运行抽的8组都一样
    n_jobs=-1
).fit(X, y)

# 打印8轮5折对应的8组参数元组
all_8_params = random_search.cv_results_['params']
print("="*80)
print("8轮5折交叉验证 对应的8组完整参数元组")
print("="*80)
for idx, param in enumerate(all_8_params, start=1):
    print(f"第{idx}轮 → {param}")

# 打印核心结果
print("="*80)
print(f"最优参数组合：{random_search.best_params_}")
print(f"最优5折交叉验证平均分：{random_search.best_score_:.4f}")
print("="*80)

效果

什么是RandomizedSearchCV

RandomizedSearchCV

RandomizedSearchCV
- 本质上是 CV
  - 交叉验证工具
  - Cross-Validator
- 功能上具备
  - RandomizedSearch
  - 随机参数采样

如果不随机搜索全都捋一遍
- 那算什么？

网格搜索

from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV  # 网格搜索-全遍历所有参数组合

X = [[1.75,1.78,70],[1.65,1.68,55],[1.83,1.85,85],
     [1.70,1.73,60],[1.91,1.96,95],[1.88,1.93,82],
     [1.98,2.11,90],[2.03,2.21,102],[2.08,2.13,111],[2.16,2.27,116],
     [1.60, 1.65, 61.7]]
y = [0,0,0,0,0,1,1,1,1,1,1]

pipe_model = Pipeline([
    ("数据标准化", StandardScaler()),
    ("随机森林分类器", RandomForestClassifier(random_state=42))
])

# 你的固定参数列表，不变
param_grid = {
    "随机森林分类器__n_estimators": [50, 100, 150, 200],
    "随机森林分类器__max_depth": [2, 6, 10],
    "随机森林分类器__min_samples_leaf": [2, 3]
}

# ✅ 核心修正：删除 random_state=42 即可
grid_search = GridSearchCV(
    estimator=pipe_model,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
).fit(X, y)

# 打印全部24组遍历的参数
all_params = grid_search.cv_results_['params']
print("="*80)
print("网格搜索 遍历的全部24组完整参数元组")
print("="*80)
for idx, param in enumerate(all_params, start=1):
    print(f"第{idx}轮 → {param}")

# 打印核心结果
print("="*80)
print(f"最优参数组合：{grid_search.best_params_}")
print(f"最优5折交叉验证平均分：{grid_search.best_score_:.4f}")
print("="*80)

运行结果

因为挨个捋了一遍
- 也就不需要 random_state

random_state 为什么偏偏是「42」？

`42`

原因①：【文化梗，最核心】
- 42 是《银河系漫游指南》的「宇宙终极答案」
- 这是42被奉为编程界随机种子「标准答案」的绝对根源
- 没有之一！
- 超级计算机花了750万年计算「生命、宇宙以及任何事情的终极答案」
- 最后给出的答案就是：42

原因②：【技术层面，无缺点】
- 42 是「中立完美种子」，无任何坑点
- 随机种子本质是伪随机数生成器的「初始密码」
- 只要是非负整数都可以
- 但选种子有隐性的「避坑原则」
- 而42完美符合所有原则：
1. ✅ 不是 0 或 1 容易生成规律化的随机序列
2. ✅ 不是太大的数：数字越大无意义，徒增记忆成本
3. ✅ 是普通整数：不是2的幂，不会触发伪随机数生成器的特殊序列，
原因③：【工程层面，最佳实践】
- 统一种子=统一标准
具体过程是啥？

分3步完成封神：

第一步：科幻圈 → Python圈
- Python之父 (Guido van Rossum) 是《银河系漫游指南》的铁杆粉丝
- 他在设计Python时
- 多次在源码、文档里埋下「42」的彩蛋
- 比如Python的内置库
- 示例代码里
- 随机种子默认都是42

影响：Python作为数据分析/机器学习的第一语言直接把「42」带到了数据科学界

第二步：Python圈 → sklearn/numpy官方文档
- numpy、sklearn的核心开发者
- 全是Python生态的核心成员
- 他们也都是这个梗的爱好者
- 在所有官方教程、API文档、示例代码里
- 所有需要写随机种子的地方，清一色用42

你的代码，就是学的官方文档的写法

第三步：官方文档 → 全球程序员的最佳实践
- 当所有人都看到「官方用42」
- 加上「42的梗深入人心」

总结

核心术语	对应你的代码	本质含义
Feature（特征）	$X$	输入模型的量化数据（你的身高、臂展、体重数据）
Label（标签）	$y$	数据的目标结果（你的0/1分类标签）
Estimator（估计器）	`RandomForestClassifier()`	机器学习模型/算法本身（sklearn中所有模型统称估计器）
Fitting（拟合）	`.fit(X, y)`	训练模型让模型学习特征与标签之间的规律
Predicting（预测）	`clf.predict(X_new)`	应用训练好的模型对新数据做结果预测/分类
Pipeline（流水线管道）	`Pipeline`	把「数据预处理+模型训练」串联成一个整体统一执行避免数据泄露简化代码逻辑
Cross Validation （交叉验证，CV）	`RandomizedSearchCV`	核心调参评估手段将数据集切分成K份轮换用「K-1份训练+1份测试」避免单次划分的偶然性评估结果更客观
Hyperparameter Search （超参数搜索）	`RandomizedSearchCV`	从设定的参数范围中寻找能让模型效果最优的超参数组合的过程

目前的数据都是完整的
- 但是如果有个球员的数据不完整
- 应该怎么办？
我们下次再说👋

本文来自 oeasy Python 系统教程。
想完整、扎实学 Python，
搜索 oeasy 即可。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

从零开始

回忆

目前架构

内圈的随机森林参数

RandomizedSearchCV

代码

代码细节

观察

随机范围

固定随机种子的核心作用

「真随机」 vs 「伪随机」

重复

24个随机小数

具体参数

具体参数

效果

RandomizedSearchCV

网格搜索

运行结果

`42`

分3步完成封神：

总结

FilesExpand file tree

0803.md

Latest commit

History

0803.md

File metadata and controls

从零开始

回忆

目前架构

内圈的随机森林参数

RandomizedSearchCV

代码

代码细节

观察

随机范围

固定随机种子的核心作用

「真随机」 vs 「伪随机」

重复

24个随机小数

具体参数

具体参数

效果

RandomizedSearchCV

网格搜索

运行结果

42

分3步完成封神：

总结

`42`