cratepy.clustering.clusteringalgs.MiniBatchKMeansSK

class MiniBatchKMeansSK(init='k-means++', max_iter=100, tol=0.0, random_state=None, batch_size=100, max_no_improvement=10, init_size=None, n_init=3, reassignment_ratio=0.01, n_clusters=None)[source]

Bases: ClusteringAlgorithm

Mini-Batch K-Means clustering algorithm (wrapper).

Documentation: see here.

perform_clustering(self, data_matrix):

Perform cluster analysis and get cluster label of each dataset item.

Notes

The Mini-Batch K-Means clustering algorithm is taken from scikit-learn (https://scikit-learn.org). Further information can be found in there.

Constructor.

Parameters:
  • n_clusters (int, default=None) – Number of clusters to find.

  • init ({‘k-means++’, ‘random’, numpy.ndarray, callable}, default=’k-means++’) – Method for centroid initialization.

  • n_init (int, default=3) – Number of times K-Means is run with different centroid seeds.

  • max_iter (int, default=300) – Maximum number of iterations.

  • tol (float, default=1e-4) – Convergence tolerance (based on Frobenius norm of the different in the cluster centers of two consecutive iterations).

  • random_state (int, RandomState instance, default=None) – Determines random number generation for centroid initialization. Use an int to make the randomness deterministic.

  • init_size (int, default=None) – Number of samples to randomly sample for speeding up the initialization (sometimes at the expense of accuracy): the only algorithm is initialized by running a batch KMeans on a random subset of the data.

  • n_init – Number of random initializations that are tried (best of initializations is used to run the algorithm).

  • reassignment_ratio (float, default=0.01) – Control the fraction of the maximum number of counts for a center to be reassigned.

List of Public Methods

perform_clustering

Perform cluster analysis and get cluster label of each dataset item.

Methods

__init__(init='k-means++', max_iter=100, tol=0.0, random_state=None, batch_size=100, max_no_improvement=10, init_size=None, n_init=3, reassignment_ratio=0.01, n_clusters=None)[source]

Constructor.

Parameters:
  • n_clusters (int, default=None) – Number of clusters to find.

  • init ({‘k-means++’, ‘random’, numpy.ndarray, callable}, default=’k-means++’) – Method for centroid initialization.

  • n_init (int, default=3) – Number of times K-Means is run with different centroid seeds.

  • max_iter (int, default=300) – Maximum number of iterations.

  • tol (float, default=1e-4) – Convergence tolerance (based on Frobenius norm of the different in the cluster centers of two consecutive iterations).

  • random_state (int, RandomState instance, default=None) – Determines random number generation for centroid initialization. Use an int to make the randomness deterministic.

  • init_size (int, default=None) – Number of samples to randomly sample for speeding up the initialization (sometimes at the expense of accuracy): the only algorithm is initialized by running a batch KMeans on a random subset of the data.

  • n_init – Number of random initializations that are tried (best of initializations is used to run the algorithm).

  • reassignment_ratio (float, default=0.01) – Control the fraction of the maximum number of counts for a center to be reassigned.

perform_clustering(data_matrix)[source]

Perform cluster analysis and get cluster label of each dataset item.

Parameters:

data_matrix (numpy.ndarray (2d)) – Data matrix containing the required data to perform the cluster analysis (numpy.ndarray of shape (n_items, n_features)).

Returns:

cluster_labels – Cluster label (int) assigned to each dataset item.

Return type:

numpy.ndarray (1d)