Sklearn gridsearchcv and make_scorer

  • AI
  • kaggle

posted on 20 Oct 2017 under category 人生经验

gridsearchcv

以xgboost为例,首先设定好不变的参数,然后设定变的参数的范围,最后利用gridsearchcv计算每组变量组的score, 用来选择参数,即model selection.

from sklearn.model_selection import GridSearchCV

param_gsearch = {"max_depth": range(3, 10, 2), 
                 "min_child_weight": range(1, 6, 2)}

param = {"learning_rate": 0.2, "n_estimators": 20, "max_depth": 5, "min_child_weight": 1, "gamma": 0, "subsample": 0.8, "colsample_bytree": 0.8,
         "objective": "binary:logistic", "n_jobs": 2, "scale_pos_weight": 1, "random_state": 2017}
clf = XGBClassifier(kwargs=param)

# verbose > 0会打印cv过程的信息,便于观察.
gsearch = GridSearchCV(estimator=clf, param_grid=param_gsearch, scoring=gini_scorer, n_jobs=2, iid=False, cv=5, verbose=5)
gsearch.fit(train_df[col_X], train_df[col_target])

make_scorer

注意到上面gridsearchcv的时候,有参数scoring=gini_scorer, 这里gini_scorer是custom score function,方便在cv的时候用自己的score function判定模型的好坏。

具体用法如下:

def gini(y, pred):
    g = np.asarray(np.c_[y, pred, np.arange(len(y)) ], dtype=np.float)
    g = g[np.lexsort((g[:,2], -1*g[:,1]))]
    gs = g[:,0].cumsum().sum() / g[:,0].sum()
    gs -= (len(y) + 1) / 2.
    return gs / len(y)


def gini_normalized(y, pred, pred_use_col):
    pred = [a[pred_use_col] for a in pred]
    return gini(y, pred) / gini(y, y)

from sklearn.metrics import make_scorer

gini_scorer = make_scorer(gini_normalized, greater_is_better=True, needs_proba=True, pred_use_col=1)

这里需要注意的是make_scorer接受的参数,最后是**kwargs,接受其他custom score function需要的参数,这个例子里是pred_use_col,用来取 predict_y的第二列作为predict_result,与ground truth共同计算score function,用法就是gridsearchcv中scoring=gini_scorer