In this post, we explore the cross_val_predict()
function from the scikit-learn library and demonstrate how to use it with an XGBoost classifier on a multi-class dataset. The cross_val_predict()
function works similarly to cross_val_score()
, but instead of returning evaluation scores, it returns predictions for each individual sample. These predictions are generated by models that never saw the corresponding data during training, making them useful for realistic performance evaluation. In this example, we use thenp.argmax()
function to convert predicted probabilities into class labels by selecting the class with the highest probability for each sample.
For cross-validation, we use the StratifiedKFold()
function to split the data into five folds. This ensures that each fold maintains the same class distribution as the overall dataset—particularly helpful when dealing with imbalanced classes. We will use the classic Iris dataset, which contains three flower species and is a standard example of a multi-class classification task.
Here is the complete code example:
import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_predict, StratifiedKFold
from sklearn.metrics import roc_auc_score, matthews_corrcoef, accuracy_score
from xgboost import XGBClassifier
from sklearn.datasets import load_iris
from sklearn.utils import shuffle
from sklearn.preprocessing import label_binarize
def load_ml_data(rseed=0):
"""
load the sample data for testing the code - the Iris dataset from sklearn
"""
data = load_iris()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)
X, y = shuffle(X, y, random_state=rseed)
print(f"Data shape: {X.shape}, {np.unique(y)}")
return X, y
def estimate_prediction(X, y, rseed=1234):
"""
Use XGBoost model to train and test the model using 5-fold CV
"""
# Define the model - use default values of the XGBoost parameters
model = XGBClassifier(use_label_encoder=False, eval_metric='logloss')
# use 5-fold cross-validation strategy
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=rseed)
# get predicted probabilities
y_pred_proba = cross_val_predict(model, X, y, cv=cv, method='predict_proba')
# assign labels
y_pred = np.argmax(y_pred_proba, axis=1)
# calculate Accuracy
accuracy = accuracy_score(y, y_pred)
# calculate MCC
mcc = matthews_corrcoef(y, y_pred)
# compute AUC (One-vs-Rest for multi-class)
y_bin = label_binarize(y, classes=np.unique(y))
auc = roc_auc_score(y_bin, y_pred_proba, multi_class='ovr')
# print results
print(f"Accuracy: {accuracy:.4f}")
print(f"MCC: {mcc:.4f}")
print(f"AUC (OvR): {auc:.4f}")
def main():
"""
demonstration of use of cross_val_predict() function
"""
data,labels = load_ml_data(rseed=1001)
estimate_prediction(data, labels, rseed=1001)
if __name__ == "__main__":
main()
Please let me know in the comments if you find any errors in this code.