赛题背景
在当今科技日新月异的时代,人工智能(AI)技术正以前所未有的深度和广度渗透到科研领域,特别是在化学及药物研发中展现出了巨大潜力。精准预测分子性质有助于高效筛选出具有优异性能的候选药物。以PROTACs为例,它是一种三元复合物由目标蛋白配体、linker、E3连接酶配体组成,靶向降解目标蛋白质。本次大赛聚焦于运用先进的人工智能算法预测其降解效能,旨在激发参赛者创新思维,推动AI技术与化学生物学的深度融合,进一步提升药物研发效率与成功率,为人类健康事业贡献智慧力量。通过此次大赛,我们期待见证并孵化出更多精准、高效的分子性质预测模型,共同开启药物发现的新纪元。
赛事任务与数据
选手根据提供的demo数据集,可以基于demo数据集进行数据增强、自行搜集数据等方式扩充数据集,并自行划分数据。运用深度学习、强化学习或更加优秀人工智能的方法预测PROTACs的降解能力,若DC50>100nM且Dmax<80% ,则视为降解能力较差(demo数据集中Label=0);若DC50<=100nM或Dmax>=80%,则视为降解能力好(demo数据集中Label=1)。
大白话解释:
【训练分子性质分类预测模型】运用深度学习、强化学习或更加优秀人工智能的方法预测PROTACs的降解能力,分类为 降解能力较差/降解能力好 两种结论
评价指标
本次竞赛的评价标准采用f1_score,分数越高,效果越好
解题思路
参赛选手的任务是基于训练集的样本数据,构建一个模型来预测测试集中分子的性质情况。这是一个二分类任务,其中目标是根据分析相关信息以及结构信息等特征,预测该分子的性质标签。具体来说,选手需要利用给定的数据集进行特征工程、模型选择和训练,然后使用训练好的模型对测试集中的用户进行预测,并生成相应的预测结果。
导入必要的库
import numpy as np
import pandas as pd
import joblib
from catboost import CatBoostClassifier
from sklearn.model_selection import StratifiedKFold, KFold, GroupKFold
from sklearn.metrics import f1_score
from rdkit import Chem
from rdkit.Chem import Descriptors,rdMolDescriptors,GraphDescriptors,Lipinski
from rdkit.Chem.rdMolDescriptors import CalcMolFormula, CalcTPSA
from rdkit.Chem.Crippen import MolLogP
from sklearn.feature_extraction.text import TfidfVectorizer
from openfe import OpenFE, tree_to_formula, transform, TwoStageFeatureSelector
from gensim.models import Word2Vec
import tqdm, sys, os, gc, re, argparse, warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_rows',None)
pd.set_option('display.max_columns',None)
读取数据,删去非空值小于10的列
train = pd.read_excel('./dataset-new/traindata-new.xlsx')
test = pd.read_excel('./dataset-new/testdata-new.xlsx')# test数据不包含 DC50 (nM) 和 Dmax (%)
train = train.drop(['DC50 (nM)','Dmax (%)'], axis=1)
# 定义了一个空列表drop_cols,用于存储在测试数据集中非空值小于10个的列名。
drop_cols =[]for f in test.columns:if test[f].notnull().sum()<10:
drop_cols.append(f)# 使用drop方法从训练集和测试集中删除了这些列,以避免在后续的分析或建模中使用这些包含大量缺失值的列
train = train.drop(drop_cols, axis=1)
test = test.drop(drop_cols, axis=1)
特征工程
# 使用pd.concat将清洗后的训练集和测试集合并成一个名为data的DataFrame,便于进行统一的特征工程处理
data = pd.concat([train, test], axis=0, ignore_index=True)
cols = data.columns[2:]
特征关联性分析
train_label = train.copy()# 自然数编码()deflabel_encode(series):
unique =list(series.unique())return series.map(dict(zip(
unique,range(series.nunique()))))
object_cols = train_label.select_dtypes(include=['object']).columns
for col in object_cols:
train_label[col]= label_encode(train_label[col])
features = train_label.columns[1:]
corr =[]for feat in features:
corr.append(abs(train_label[[feat,"Label"]].fillna(0).corr().values[0][1]))
se = pd.Series(corr, index=features).sort_values(ascending=False)
se
data = data.drop(se[-6:].index, axis=1)
提取Smiles特征
DeepChem是一个用于科研的机器学习库。DeepChem最初专注于化学分子的研究,但随着版本更迭,现在其已能更广泛地支持所有类型的科学应用。我觉得这个模块做的比较好的几点在于:
- 能够方便地将化学分子用统一长度的向量或矩阵表示,便于机器学习数据读入;
- 提供方便使用的机器学习接口,你可以不必专门学习机器学习模块(如Tensorflow 、Pytorch等);
- 封装化程度高,上手容易。但对于需要个性化参数调整的需求就不是很方便了,这个时候就需要查阅源码,在理解的基础上进行调整。
import deepchem as dc
dc_smiles = data['Smiles']
rdkit_featurizer = dc.feat.RDKitDescriptors()
rdkit_feature = rdkit_featurizer.featurize(dc_smiles)
dc_feature = pd.DataFrame(rdkit_feature)
dc_feature.columns =[f'smiles_dc_{i}'for i inrange(dc_feature.shape[1])]
zeros_count = dc_feature.eq(0).sum()
columns_to_drop = zeros_count[zeros_count >=704].index.tolist()
smiles_feature = dc_feature.drop(columns=columns_to_drop)
提取InChI特征
atomic_masses ={'H':1.008,'He':4.002602,'Li':6.94,'Be':9.0122,'B':10.81,'C':12.01,'N':14.01,'O':16.00,'F':19.00,'Ne':20.180,'Na':22.990,'Mg':24.305,'Al':26.982,'Si':28.085,'P':30.97,'S':32.07,'Cl':35.45,'Ar':39.95,'K':39.10,'Ca':40.08,'Sc':44.956,'Ti':47.867,'V':50.942,'Cr':52.00,'Mn':54.938,'Fe':55.845,'Co':58.933,'Ni':58.69,'Cu':63.55,'Zn':65.38}# 函数用于解析单个InChI字符串defparse_inchi(row):
inchi_str = row['InChI']
formula =''
molecular_weight =0
element_counts ={}# 提取分子式
formula_match = re.search(r"InChI=1S/([^/]+)/c", inchi_str)if formula_match:
formula = formula_match.group(1)# 计算分子量和原子计数for element, count in re.findall(r"([A-Z][a-z]*)([0-9]*)", formula):
count =int(count)if count else1
element_mass = atomic_masses.get(element.upper(),0)
molecular_weight += element_mass * count
element_counts[element.upper()]= count
return pd.Series({'ElementCounts': element_counts
})# 应用函数到DataFrame的每一行
data[['ElementCounts']]= data.apply(parse_inchi, axis=1)# 定义存在的key
keys =['H','He','Li','Be','B','C','N','O','F','Ne','Na','Mg','Al','Si','P','S','Cl','Ar','K','Ca','Sc','Ti','V','Cr','Mn','Fe','Co','Ni','Cu','Zn']# 创建一个空的DataFrame,列名为keys
df_expanded = pd.DataFrame({key: pd.Series()for key in keys})# 遍历数据,填充DataFramefor index, item inenumerate(data['ElementCounts'].values):for key in keys:# 将字典中的值填充到相应的列中
df_expanded.at[index, key]= item.get(key,0)
df_expanded = pd.DataFrame(df_expanded)
zeros_count = df_expanded.eq(0).sum()
columns_to_drop = zeros_count[zeros_count >=704].index.tolist()
inchi_keys = df_expanded.drop(columns=columns_to_drop)
from rdkit import Chem
from rdkit.Chem import Descriptors, rdMolDescriptors, GraphDescriptors, Lipinski
defcalculate_descriptors(inchi):# 解析InChI字符串,提取分子信息
mol = Chem.MolFromInchi(inchi)# 氢键供体
h_donors = Descriptors.NumHDonors(mol)# 氢键受体
h_acceptors = Descriptors.NumHAcceptors(mol)# 旋转键个数
rotatable_bonds = Descriptors.NumRotatableBonds(mol)# 芳香环数
aromatic_ring_count = Descriptors.NumAromaticRings(mol)# 总极性表面积 (TPSA)
tpsa = rdMolDescriptors.CalcTPSA(mol)# XLogP
xlogp = Descriptors.MolLogP(mol)# 价电子数
num_valence_electrons = Descriptors.NumValenceElectrons(mol)# 平均信息含量
avg_ipc = GraphDescriptors.AvgIpc(mol)# Balaban's J
balaban_j = GraphDescriptors.BalabanJ(mol)# BertzCT 复杂度
bertz_ct = GraphDescriptors.BertzCT(mol)# 重原子分子量
heavy_atom_mol_wt = Descriptors.HeavyAtomMolWt(mol)# 最大绝对部分电荷
max_abs_partial_charge = Descriptors.MaxAbsPartialCharge(mol)# 最大部分电荷
max_partial_charge = Descriptors.MaxPartialCharge(mol)# 最小绝对部分电荷
min_abs_partial_charge = Descriptors.MinAbsPartialCharge(mol)# 最小部分电荷
min_partial_charge = Descriptors.MinPartialCharge(mol)# 分子的Kappa1
kappa1 = rdMolDescriptors.CalcKappa1(mol)# 分子的Kappa2
kappa2 = rdMolDescriptors.CalcKappa2(mol)# 分子的Kappa3
kappa3 = rdMolDescriptors.CalcKappa3(mol)# 分子的Labute ASA
labute_asa = rdMolDescriptors.CalcLabuteASA(mol)# 分子的Morgan指纹
morgan_fingerprint = rdMolDescriptors.GetMorganFingerprint(mol,2)# 分子的自旋轨道耦合常数
kappa = rdMolDescriptors.CalcPhi(mol)# 分子的饱和碳环数
num_saturated_carbocycles = rdMolDescriptors.CalcNumSaturatedCarbocycles(mol)# 分子的饱和杂环数
num_saturated_heterocycles = rdMolDescriptors.CalcNumSaturatedHeterocycles(mol)# 分子的饱和环数
num_saturated_rings = rdMolDescriptors.CalcNumSaturatedRings(mol)# 分子的螺原子数
num_spiro_atoms = rdMolDescriptors.CalcNumSpiroAtoms(mol)# 分子的氧化数
rdMolDescriptors.CalcOxidationNumbers(mol)# 分子的CSP3分数
fraction_csp3 = Lipinski.FractionCSP3(mol)# 分子的NHOH计数
nhoh_count = Lipinski.NHOHCount(mol)# 分子的NO计数
no_count = Lipinski.NOCount(mol)# 分子的异原子数
num_heteroatoms = Lipinski.NumHeteroatoms(mol)# 分子的非芳香碳环数
num_aliphatic_carbocycles = Lipinski.NumAliphaticCarbocycles(mol)# 分子的非芳香杂环数
num_aliphatic_heterocycles = Lipinski.NumAliphaticHeterocycles(mol)# 分子的非芳香环数
num_aliphatic_rings = Lipinski.NumAliphaticRings(mol)# 分子的芳烃碳环数
num_aromatic_carbocycles = Lipinski.NumAromaticCarbocycles(mol)# 分子的芳烃杂环数
num_aromatic_heterocycles = Lipinski.NumAromaticHeterocycles(mol)# 分子的摩尔折射率
mol_refractivity = Descriptors.MolMR(mol)return{"H-Bond Donors": h_donors,"H-Bond Acceptors": h_acceptors,"Rotatable Bonds": rotatable_bonds,"Aromatic Ring Count": aromatic_ring_count,"TPSA": tpsa,"XLogP": xlogp,"Num Valence Electrons": num_valence_electrons,"Average Information Content": avg_ipc,"Balaban's J": balaban_j,"BertzCT Complexity": bertz_ct,"Heavy Atom Molecular Weight": heavy_atom_mol_wt,"Max Absolute Partial Charge": max_abs_partial_charge,"Max Partial Charge": max_partial_charge,"Min Absolute Partial Charge": min_abs_partial_charge,"Min Partial Charge": min_partial_charge,"Kappa1": kappa1,"Kappa2": kappa2,"Kappa3": kappa3,"Labute Accessible Surface Area": labute_asa,"Spin-Orbit Coupling Constant": kappa,"Saturated Carbocycles": num_saturated_carbocycles,"Saturated Heterocycles": num_saturated_heterocycles,"Saturated Rings": num_saturated_rings,"Spiro Atoms": num_spiro_atoms,"CSP3 Fraction": fraction_csp3,"NHOH Count": nhoh_count,"NO Count": no_count,"Heteroatoms": num_heteroatoms,"Aliphatic Carbocycles": num_aliphatic_carbocycles,"Aliphatic Heterocycles": num_aliphatic_heterocycles,"Aliphatic Rings": num_aliphatic_rings,"Aromatic Carbocycles": num_aromatic_carbocycles,"Aromatic Heterocycles": num_aromatic_heterocycles,"Molar Refractivity": mol_refractivity,}# 创建一个空的列表以存储提取的特征
features_list =[]# 提取特征并添加到列表中for inchi in data['InChI']:
features = calculate_descriptors(inchi)
features_list.append(features)# 将列表转换为DataFrame
inchi_features = pd.DataFrame(features_list)
# 将提取的特征添加到原始数据集
data = pd.concat([data, smiles_feature, inchi_keys, inchi_features], axis=1)
data[:4]
根据关联性分析筛选特征
data = data.drop(['ElementCounts'], axis=1)# 自然数编码()deflabel_encode(series):
unique =list(series.unique())return series.map(dict(zip(
unique,range(series.nunique()))))
object_cols = data.select_dtypes(include=['object']).columns
for col in object_cols:
data[col]= label_encode(data[col])
train = data[data.Label.notnull()].reset_index(drop=True)
test = data[data.Label.isnull()].reset_index(drop=True)
features1 = train.columns[1:]
corr1 =[]for feat in features1:
corr1.append(abs(train[[feat,"Label"]].fillna(0).corr().values[0][1]))
se1 = pd.Series(corr1, index=features1).sort_values(ascending=False)
drop_se1 = se1.index[-4:]# 使用drop方法从训练集和测试集中删除了这些列,以避免在后续的分析或建模中使用这些包含大量缺失值的列
train = train.drop(drop_se1, axis=1)
test = test.drop(drop_se1, axis=1)
train[:3]
# 特征筛选
features =[f for f in train.columns if f notin['uuid','Label']]# 构建训练集和测试集
x_train = train[features]
x_test = test[features]# 训练集标签
y_train = train['Label'].astype(int)
x_train.info()
train.rename(columns=lambda x: re.sub(r'[^\w\s]','_', x), inplace=True)
test.rename(columns=lambda x: re.sub(r'[^\w\s]','_', x), inplace=True)
OpenFE特征构造
OpenFE,全称Open Feature Engineering,是一个开源的Python库,专门设计用于简化和自动化特征工程的过程。通过提供一系列的工具和函数,OpenFE使数据科学家和机器学习工程师能够更高效地创建、测试和部署特征。
- 自动特征生成:OpenFE能够根据现有数据自动创建新的特征,帮助提升模型的性能。
- 特征选择与优化:它提供了多种特征选择方法,帮助用户识别和保留最有价值的特征,同时去除冗余或无关的特征。
- 易于使用的API:OpenFE设计了简洁直观的API,即使是没有太多编程经验的人也能轻松上手。
- 灵活性和可扩展性:用户可以根据自己的需要自定义特征转换规则,使得OpenFE能够适用于各种不同的数据和项目需求。
ofe = OpenFE()
features = ofe.fit(data=x_train, label=y_train, n_jobs=6)joblib.dump(ofe,"ofe.pkl")for feature in ofe.new_features_list:print(tree_to_formula(feature))x_train, x_test = transform(x_train, x_test, features, n_jobs=6)cat_columns = x_train.select_dtypes(include=['category']).columns
x_train[cat_columns]= x_train[cat_columns].astype(np.int32)
cat_columns = x_test.select_dtypes(include=['category']).columns
x_test[cat_columns]= x_test[cat_columns].astype(np.int32)
模型训练
这里借鉴了《机器学习算法竞赛实战》的代码
lgm
模型特征选择
import numpy as np
import pandas as pd
import lightgbm as lgb
from sklearn.model_selection import KFold
from hyperopt import hp, fmin, tpe
from numpy.random import RandomState
from sklearn.metrics import mean_squared_error,f1_score
deffeature_select_wrapper(train, test):"""
:param train:
:param test:
:return:
"""print('feature_select_wrapper...')
label ='Label'
features = train.columns.tolist()
features.remove('uuid')
features.remove('Label')# 配置模型的训练参数
params_initial ={'num_leaves':31,'learning_rate':0.1,'boosting':'gbdt','min_child_samples':20,'bagging_seed':2020,'bagging_fraction':0.7,'bagging_freq':1,'feature_fraction':0.7,'max_depth':-1,'metric':'auc','reg_alpha':0,'reg_lambda':1,'objective':'binary'}
ESR =30
NBR =10000
VBE =50
kf = KFold(n_splits=5, random_state=2020, shuffle=True)
fse = pd.Series(0, index=features)
callbacks =[lgb.early_stopping(stopping_rounds=30, verbose=50)]for train_part_index, eval_index in kf.split(train[features], train[label]):# 模型训练
train_part = lgb.Dataset(train[features].loc[train_part_index],
train[label].loc[train_part_index])
eval1 = lgb.Dataset(train[features].loc[eval_index],
train[label].loc[eval_index])
bst = lgb.train(params_initial, train_part, num_boost_round=10000,
valid_sets=[train_part, eval1],
valid_names=['train','valid'],
callbacks=callbacks
)
fse += pd.Series(bst.feature_importance(), features)
feature_select =['uuid']+ fse.sort_values(ascending=False).index.tolist()[:200]print('done')return train[feature_select +['Label']], test[feature_select]
参数寻优
defparams_append(params):"""
:param params:
:return:
"""
params['objective']='binary'
params['metric']='auc'
params['bagging_seed']=2020return params
defparam_hyperopt(train):"""
:param train:
:return:
"""
label ='Label'
features = train.columns.tolist()
features.remove('uuid')
features.remove('Label')
params1 ={'feature_pre_filter':False}
train_data = lgb.Dataset(train[features], train[label], params = params1)
callbacks1 =[lgb.early_stopping(stopping_rounds=20, verbose=False),lgb.log_evaluation(show_stdv=False)]defhyperopt_objective(params):"""
:param params:
:return:
"""
params = params_append(params)print(params)
res = lgb.cv(params, train_data,1000,
nfold=2,
stratified=False,
shuffle=True,
metrics='auc',
seed=2020,
callbacks=callbacks1)returnmin(res['valid auc-mean'])
params_space ={'learning_rate': hp.uniform('learning_rate',1e-2,5e-1),'bagging_fraction': hp.uniform('bagging_fraction',0.5,1),'feature_fraction': hp.uniform('feature_fraction',0.5,1),'num_leaves': hp.choice('num_leaves',list(range(10,300,10))),'reg_alpha': hp.randint('reg_alpha',0,10),'reg_lambda': hp.uniform('reg_lambda',0,10),'bagging_freq': hp.randint('bagging_freq',1,10),'min_child_samples': hp.choice('min_child_samples',list(range(1,30,5)))}
params_best = fmin(
hyperopt_objective,
space=params_space,
algo=tpe.suggest,
max_evals=100,
rstate=np.random.default_rng(2020))return params_best
模型预测
deftrain_predict(train, test, params):"""
:param train:
:param test:
:param params:
:return:
"""
label ='Label'
features = train.columns.tolist()
features.remove('uuid')
features.remove('Label')
params = params_append(params)
kf = KFold(n_splits=5, random_state=2020, shuffle=True)
prediction_test =0
cv_score =[]
prediction_train = pd.Series()
ESR =30
NBR =10000
VBE =50
callbacks =[lgb.early_stopping(stopping_rounds=30, verbose=50)]for train_part_index, eval_index in kf.split(train[features], train[label]):# 模型训练
train_part = lgb.Dataset(train[features].loc[train_part_index],
train[label].loc[train_part_index])eval= lgb.Dataset(train[features].loc[eval_index],
train[label].loc[eval_index])
bst = lgb.train(params, train_part, num_boost_round=NBR,
valid_sets=[train_part,eval],
valid_names=['train','valid'],
callbacks=callbacks)
prediction_test += bst.predict(test[features])
prediction_train = prediction_train._append(pd.Series(bst.predict(train[features].loc[eval_index]),
index=eval_index))
eval_pre = bst.predict(train[features].loc[eval_index]).astype(int)
score = np.sqrt(f1_score(train[label].loc[eval_index].values, eval_pre))
cv_score.append(score)print(cv_score,sum(cv_score)/5)
pd.Series(prediction_train.sort_index().values).to_csv("train_lightgbm.csv", index=False)
pd.Series(prediction_test /5).to_csv("test_lightgbm.csv", index=False)
test['Label']= prediction_test /5
test[['uuid','Label']].to_csv("submit_lightgbm.csv", index=False)return
train_select, test_select = feature_select_wrapper(train, test)
best_clf = param_hyperopt(train_select)
joblib.dump(best_clf,"best_clf.pkl")
best_clf = joblib.load('best_clf.pkl')
train_predict(train_select, test_select, best_clf)
xgb
import numpy as np
import pandas as pd
import lightgbm as lgb
import xgboost as xgb
from sklearn.model_selection import KFold
from hyperopt import hp, fmin, tpe
from scipy import sparse
from scipy.sparse import csr_matrix
from sklearn.feature_selection import f_regression,f_classif
from numpy.random import RandomState
from sklearn.metrics import mean_squared_error,f1_score
from bayes_opt import BayesianOptimization
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error,f1_score
defread_data1(debug=True):
features = train.columns.tolist()
features.remove('uuid')
features.remove('Label')
train_x = csr_matrix(train[features].astype(pd.SparseDtype("float64",0)).sparse.to_coo()).tocsr()
test_x = csr_matrix(test[features].astype(pd.SparseDtype("float64",0)).sparse.to_coo()).tocsr()print("done")return train_x, test_x
defparams_append1(params):"""
:param params:
:return:
"""
params['objective']='binary:hinge'
params['eval_metric']='auc'
params["min_child_weight"]=int(params["min_child_weight"])
params['max_depth']=int(params['max_depth'])return params
defparam_beyesian1(train):"""
:param train:
:return:
"""
train_y = pd.read_excel("dataset-new/traindata-new.xlsx")['Label'].values
train_data = xgb.DMatrix(train, train_y, silent=True)defxgb_cv(colsample_bytree, subsample, min_child_weight, max_depth,
reg_alpha, eta,
reg_lambda):"""
:param colsample_bytree:
:param subsample:
:param min_child_weight:
:param max_depth:
:param reg_alpha:
:param eta:
:param reg_lambda:
:return:
"""
params ={'objective':'binary:hinge','early_stopping_round':100,'eval_metric':'auc'}
params['colsample_bytree']=max(min(colsample_bytree,1),0)
params['subsample']=max(min(subsample,1),0)
params["min_child_weight"]=int(min_child_weight)
params['max_depth']=int(max_depth)
params['eta']=float(eta)
params['reg_alpha']=max(reg_alpha,0)
params['reg_lambda']=max(reg_lambda,0)print(params)
cv_result = xgb.cv(params, train_data,
num_boost_round=10000,
nfold=5, seed=2,
stratified=False,
shuffle=True,
early_stopping_rounds=30,
verbose_eval=False)return-min(cv_result['test-auc-mean'])
xgb_bo = BayesianOptimization(
xgb_cv,{'colsample_bytree':(0.5,1),'subsample':(0.5,1),'min_child_weight':(1,30),'max_depth':(5,12),'reg_alpha':(0,5),'eta':(0.02,1),'reg_lambda':(0,5)})
xgb_bo.maximize(init_points=21, n_iter=10)# init_points表示初始点,n_iter代表迭代次数(即采样数)print(xgb_bo.max['target'], xgb_bo.max['params'])return xgb_bo.max['params']
deftrain_predict1(train, test, params):"""
:param train:
:param test:
:param params:
:return:
"""
train_y = pd.read_excel("dataset-new/traindata-new.xlsx")['Label']
test_data = xgb.DMatrix(test)
params = params_append1(params)
kf = KFold(n_splits=5, random_state=2020, shuffle=True)
prediction_test =0
cv_score =[]
prediction_train = pd.Series()
ESR =30
NBR =10000
VBE =50for train_part_index, eval_index in kf.split(train, train_y):# 模型训练
train_part = xgb.DMatrix(train.tocsr()[train_part_index,:],
train_y.loc[train_part_index])
eval2 = xgb.DMatrix(train.tocsr()[eval_index,:],
train_y.loc[eval_index])
bst = xgb.train(params, train_part, NBR,[(train_part,'train'),(eval2,'eval')], verbose_eval=VBE,
maximize=False, early_stopping_rounds=ESR,)
prediction_test += bst.predict(test_data)
eval_pre = bst.predict(eval2)
prediction_train = prediction_train._append(pd.Series(eval_pre, index=eval_index))
score = np.sqrt(f1_score(train_y.loc[eval_index].values, eval_pre))
cv_score.append(score)print(cv_score,sum(cv_score)/5)
pd.Series(prediction_train.sort_index().values).to_csv("train_xgboost.csv", index=False)
pd.Series(prediction_test /5).to_csv("test_xgboost.csv", index=False)
test = pd.read_excel('dataset-new/testdata-new.xlsx')
test['Label']= prediction_test /5
test[['uuid','Label']].to_csv("submission_xgboost.csv", index=False)return
train1, test1 = read_data1(debug=False)
best_clf1 = param_beyesian1(train1)
train_predict1(train1, test1, best_clf1)
cat
defcv_model(clf, train_x, train_y, test_x, clf_name, seed=2024):
kf = KFold(n_splits=5, shuffle=True, random_state=seed)
train = np.zeros(train_x.shape[0])
test = np.zeros(test_x.shape[0])
cv_scores =[]for i,(train_index, valid_index)inenumerate(kf.split(train_x, train_y)):print('************************************ {} {}************************************'.format(str(i+1),str(seed)))
trn_x, trn_y, val_x, val_y = train_x.iloc[train_index], train_y[train_index], train_x.iloc[valid_index], train_y[valid_index]
params ={'learning_rate':0.1,'depth':6,'l2_leaf_reg':10,'bootstrap_type':'Bernoulli','random_seed':seed,'od_type':'Iter','od_wait':100,'random_seed':11,'allow_writing_files':False,'task_type':'CPU'}
model = clf(iterations=20000,**params, eval_metric='auc')
model.fit(trn_x, trn_y, eval_set=(val_x, val_y),
metric_period=100,
cat_features=[],
use_best_model=True,
verbose=1)
val_pred = model.predict_proba(val_x)[:,1]
test_pred = model.predict_proba(test_x)[:,1]
train[valid_index]= val_pred
test += test_pred / kf.n_splits
cv_scores.append(f1_score(val_y, np.where(val_pred>0.5,1,0)))print(cv_scores)print("%s_score_list:"% clf_name, cv_scores)print("%s_score_mean:"% clf_name, np.mean(cv_scores))print("%s_score_std:"% clf_name, np.std(cv_scores))return train, test
cat_train, cat_test = cv_model(CatBoostClassifier, x_train, y_train, x_test,"cat")
pd.DataFrame({'uuid': test['uuid'],'Label': np.where(cat_test>0.5,1,0)}).to_csv('submit_v4.csv', index=None)
未完待续……
版权归原作者 Data新青年 所有, 如有侵权,请联系我们删除。