Evaluating Machine Learning Model with cross_val_predict() Function

In this post, we explore the cross_val_predict() function from the scikit-learn library and demonstrate how to use it with an XGBoost classifier on a multi-class dataset. The cross_val_predict() function works similarly to cross_val_score(), but instead of returning evaluation scores, it returns predictions for each individual sample. These predictions are generated by models that never saw the corresponding data during training, making them useful for realistic performance evaluation. In this example, we use thenp.argmax()function to convert predicted probabilities into class labels by selecting the class with the highest probability for each sample.

For cross-validation, we use the StratifiedKFold() function to split the data into five folds. This ensures that each fold maintains the same class distribution as the overall dataset—particularly helpful when dealing with imbalanced classes. We will use the classic Iris dataset, which contains three flower species and is a standard example of a multi-class classification task.

Here is the complete code example:

import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_predict, StratifiedKFold
from sklearn.metrics import roc_auc_score, matthews_corrcoef, accuracy_score
from xgboost import XGBClassifier
from sklearn.datasets import load_iris
from sklearn.utils import shuffle
from sklearn.preprocessing import label_binarize

def load_ml_data(rseed=0):
    """
    load the sample data for testing the code - the Iris dataset from sklearn
    """
    data = load_iris()
    X = pd.DataFrame(data.data, columns=data.feature_names)
    y = pd.Series(data.target)
    X, y = shuffle(X, y, random_state=rseed)
    print(f"Data shape: {X.shape}, {np.unique(y)}")
    return X, y

def estimate_prediction(X, y, rseed=1234):
    """
    Use XGBoost model to train and test the model using 5-fold CV
    """
    # Define the model - use default values of the XGBoost parameters
    model = XGBClassifier(use_label_encoder=False, eval_metric='logloss')

    # use 5-fold cross-validation strategy
    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=rseed)

    # get predicted probabilities
    y_pred_proba = cross_val_predict(model, X, y, cv=cv, method='predict_proba')

    # assign labels
    y_pred = np.argmax(y_pred_proba, axis=1)

    # calculate Accuracy
    accuracy = accuracy_score(y, y_pred)

    # calculate MCC
    mcc = matthews_corrcoef(y, y_pred)

    # compute AUC (One-vs-Rest for multi-class)
    y_bin = label_binarize(y, classes=np.unique(y))
    auc = roc_auc_score(y_bin, y_pred_proba, multi_class='ovr')

    # print results
    print(f"Accuracy: {accuracy:.4f}")
    print(f"MCC: {mcc:.4f}")
    print(f"AUC (OvR): {auc:.4f}")

def main():
    """
    demonstration of use of cross_val_predict() function
    """
    data,labels = load_ml_data(rseed=1001)
    estimate_prediction(data, labels, rseed=1001)


if __name__ == "__main__":
    main()

Please let me know in the comments if you find any errors in this code.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.