Synthetic data generation for machine learning classification/clustering using Python sklearn library

The machine learning repository of UCI has several good datasets that one can use to run classification, clustering, or regression algorithms. However, if you want to use some synthetic data to test your algorithms, the sklearn library provides some functions that can help you with that. In this post, I am going to use make_blobs() and make_classification() functions to generate a matrix of features and corresponding discrete targets.

You can create multiclass datasets using any of these functions. They allocate each class one or more normally distributed clusters of points. As per the sklearn documentation, make_blobs() provides greater control regarding the centers and standard deviations of each cluster and is used to demonstrate clustering. On the other hand, if you want to introduce artificial noise in the data, you can use make_classification(). It adds noise by using correlated, redundant, and uninformative features; multiple Gaussian clusters per class; and linear transformations of the feature space.

The following codes will generate the synthetic data and will save it in a TSV file.

Using make_blobs()

from sklearn.datasets import make_blobs
import pandas as pd

#### Generate synthetic data and labels ####
# n_samples: number of samples in the data
# centers: number of classes/clusters
# n_features: number of features for each sample
# shuffle: should the samples of one class be together?
X, y = make_blobs(n_samples=500, centers=2, n_features=10, random_state=1, shuffle=True)

# Create a dataframe using the data
df = pd.DataFrame(X)

# add labels to a new column
df['y'] = y

# make label the first column
i = list(df.columns)
id = [i[-1]] + i[:-1]
df = df[id]

# write dataframe to a tab separated file
df.to_csv('data1.tsv', sep='\t')

Using make_classification()

from sklearn.datasets import make_classification
import pandas as pd

############################################
#    Generate synthetic data and labels    #
############################################
# n_samples: number of samples in the data
# n_classes: number of classes/clusters
# n_features: number of features for each sample
# shuffle: should the samples of one class be together?
# flip_y: The fraction of samples whose class is assigned randomly.
X, y = make_classification(n_samples=500, n_features=10, n_classes=2, random_state=0, shuffle=True, flip_y=0.01)

# Create a dataframe using the data
df = pd.DataFrame(X)

# add labels to a new column
df['y'] = y

# make label the first column
i = list(df.columns)
id = [i[-1]] + i[:-1]
df = df[id]

# write dataframe to a tab separated file
df.to_csv('data2.tsv', sep='\t')

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.