Sklearn gridsearchcv and make_scorer
posted on 20 Oct 2017 under category 人生经验
以xgboost为例,首先设定好不变的参数,然后设定变的参数的范围,最后利用gridsearchcv计算每组变量组的score, 用来选择参数,即model selection.
from sklearn.model_selection import GridSearchCV
param_gsearch = {"max_depth": range(3, 10, 2),
"min_child_weight": range(1, 6, 2)}
param = {"learning_rate": 0.2, "n_estimators": 20, "max_depth": 5, "min_child_weight": 1, "gamma": 0, "subsample": 0.8, "colsample_bytree": 0.8,
"objective": "binary:logistic", "n_jobs": 2, "scale_pos_weight": 1, "random_state": 2017}
clf = XGBClassifier(kwargs=param)
# verbose > 0会打印cv过程的信息,便于观察.
gsearch = GridSearchCV(estimator=clf, param_grid=param_gsearch, scoring=gini_scorer, n_jobs=2, iid=False, cv=5, verbose=5)
gsearch.fit(train_df[col_X], train_df[col_target])
注意到上面gridsearchcv的时候,有参数scoring=gini_scorer, 这里gini_scorer是custom score function,方便在cv的时候用自己的score function判定模型的好坏。
具体用法如下:
def gini(y, pred):
g = np.asarray(np.c_[y, pred, np.arange(len(y)) ], dtype=np.float)
g = g[np.lexsort((g[:,2], -1*g[:,1]))]
gs = g[:,0].cumsum().sum() / g[:,0].sum()
gs -= (len(y) + 1) / 2.
return gs / len(y)
def gini_normalized(y, pred, pred_use_col):
pred = [a[pred_use_col] for a in pred]
return gini(y, pred) / gini(y, y)
from sklearn.metrics import make_scorer
gini_scorer = make_scorer(gini_normalized, greater_is_better=True, needs_proba=True, pred_use_col=1)
这里需要注意的是make_scorer接受的参数,最后是**kwargs,接受其他custom score function需要的参数,这个例子里是pred_use_col,用来取
predict_y的第二列作为predict_result,与ground truth共同计算score function,用法就是gridsearchcv中scoring=gini_scorer
。