OpenEnsembles¶
The “data” class¶
-
class
openensembles.
data
(df, x)[source]¶ df is a dataframe and x is the x_axis values (or numbers indicating the number entry). Behavior: Only numerical data in df will be carried into a numpy array
- Parameters
- dfa pandas dataframe
Dataframe with objects in rows and columns representing the feature dimensions
- xlist
The x-axis elements. If x is a list of strings, it will be converted here to a list of ints (range 0 to len(x))
- Raises
- ValueError of the size of x and dimensionality of df do not match
- Attributes
- dfpandas dataframe
the original dataframe
- Ddictionary
A dictionary of numpy data matrices, callable by ‘source_name’.
- xlist
a list of integer or float values
- x_labelslist
a list of strings (if that was passed in) or the int and float. So that xticklabels could be updated or referenced
- params: dict
A dictionary of parameter labels and their values that were used during transformations
Methods
merge
(d_list)Returns an appended object – a merge of the data object (self) and all data objects inside a passed list.
plot_data
(source_name[, fig_num])Plot the data matrix that belongs to source_name
slice
(names)Returns a new data object containing a slice indicated by the list of names given (dictionary keys shared amongst D, params, etc.).
transform
(source_name, txfm_fcn, txfm_name, …)This runs transform (txfm_fcn) on the data matrix defined by source_name with parameters that are variable for each transform.
Returns a list of all transformations available
-
merge
(d_list)[source]¶ Returns an appended object – a merge of the data object (self) and all data objects inside a passed list.
- Parameters
- d_list: list
A list of data objects
- Returns
- transDictArr: list of dicts
A list of dictionary translation of new labels in merged object, with original labels. List order is same as those passed in
- Raises
- ValueError
If objects in d_list are not well formed openenembles data objects
Examples
Merge two sets of data objects. Keeps the same dataframe (self.df), i.e. use this when it makes sense to always use that reference dataframe
FINISH EXAMPLES here
-
plot_data
(source_name, fig_num=1, **kwargs)[source]¶ Plot the data matrix that belongs to source_name
- Parameters
- source_namestring
name of data source to plot, e.g. ‘parent’
- fig_num: int
Set to a different figure number to plot on existing figure. Default fig_num=1
- **class_labelslist of ints
this is a vector that assigns points to classes, and will be used to color the points according to assigned class type
- **clusters_to_plot: list of ints
If you wish to plot a subset of cluster types (classes), pass that as a list of ints
- **titlestring
Title for plot
- Raises
- ValueError:
If clusters_to_plot not a set in cluster labels
-
slice
(names)[source]¶ Returns a new data object containing a slice indicated by the list of names given (dictionary keys shared amongst D, params, etc.). Cannot remove ‘parent’ as that is the default dataframe matrix that established data object. To replace parent, instead instantiate a new object on a dataframe created from transformation of interest.
- Parameters
- names: list
A list of strings matching the names to keep in the new slice
- Returns
- d: an openensembles data object
A oe.data object that contains only those names passed in
- Raises
- ValueError
If a name in the list of names does not exist in data object
Examples
Remove ‘zscore’ from the list, keeping everything else
>>> names = d.D.keys() #get all the keys >>> names = names.remove(['zscore']) >>> dNew = d.slice(names)
-
transform
(source_name, txfm_fcn, txfm_name, **kwargs)[source]¶ This runs transform (txfm_fcn) on the data matrix defined by source_name with parameters that are variable for each transform. For example, oe.data.transform(‘parent’, ‘zscore’,’zscore_parent’, axis=0) will run the zscore in a vector-wise manner across the matrix (column-wise) and the new data dictionary access to the transformed data is oe.data[‘zscore_parent’] Successful completion results in the addition of a new entry in the data dictionary with a key according to txfm_name.
- Parameters
- source_name: string
the name of the source data, for example ‘parent’, or ‘log2’
- txfm_fcn: string
the name of the transform function. See transforms.py or run oe.data.transforms_available() for list
- txfm_name: string
the name you want to use in the data object dictionary oe.data.D[‘name’] to access transformed data
- Other Parameters
- **Keep_NaN: boolean
Set to True in order to prevent transformations from being added that produce NaNs. Default Keep_NaN=True this will add transformed data even if NaNs are produced. Set to 0 to prevent addition of data transforms containing NaNs.
- **Keep_Inf: boolean
Set to True in order to prevent transformations from being added that produce infinite values Default: Keep_Inf = True (this will add transformed data even if infinite values are produced. Set to 0 to prevent addition of data transforms conta
- Raises
- ValueError
if the transform function does not exist OR if the data source does not exist by source_name
Warning
NaNs or infinite values are produced
Examples
>>> import pandas as pd >>> import openensembles as oe >>> df = pd.read_csv(file) >>> d = oe.data(df, df.columns >>> d.transform('parent', 'zscore', 'zscore') >>> d.transform('zscore', 'PCA', 'pca', n_components=3)
The “cluster” class¶
-
class
openensembles.
cluster
(dataObj)[source]¶ Initialize a clustering object, which is instantiated with a data object class from OpenEnsembles When clustering is performed, the dictionaries of all attributes are extended using the key given as output_name
- Parameters
- dataObj
openensembles.data class – consists at least of one data matrix called ‘parent’
- Returns
- clusterObject
empty openensembles.cluster object
See also
Examples
Load data, zscore it, transform it into the first three principal components and cluster using KMeans with K=4
>>> import pandas as pd >>> import openensembles as oe >>> df = pd.read_csv(file) >>> d = oe.data(df, df.columns >>> d.transform('parent', 'zscore', 'zscore') >>> d.transform('zscore', 'PCA', 'pca', n_components=3) >>> c = oe.cluster(d) >>> c.cluster('pca', 'kmeans', 'kmeans_pca', 4)
- Attributes
- dataObj: openensembles.data class
openensembles.data class that was used to instantiate cluster object
- labels: dict of lists
A dictionary of lists of clustering solutions (ints). Referred to as output_name in .cluster method
- data_source: dict of strings
Name of data source in dataObj
- params: dict of dicts
A dictionary of all parameters passed during clustering
- clusterNumbers: dict of lists
A listing of the unique set of cluster numbers produced in a clustering
- random_state: dict of objects
A listing of the random state objects that can be used to reset the state and
Methods
MI
([MI_type])Calculate the mutual information between all pairs of clustering solutions
Call this to list all algorithms currently available in algorithms.py
cluster
(source_name, algorithm, output_name)This runs clustering algorithms on the data matrix defined by source_name with parameters that are variable for each algorithm.
This function returns a dictionary with keys equal to parameters of interest {K, linkage, affinity} whose entries indicate algorithms that take those as free parameters.
co_occurrence_matrix
([data_source_name])Calculate the co-occurrence of all pairs of objects across the ensemble
finish_co_occ_linkage
(threshold[, linkage])The finishing technique that calculates a co-occurrence matrix on all cluster solutions in the ensemble and then hierarchically clusters the co-occurrence, treating it as a similarity matrix.
finish_graph_closure
(threshold[, clique_size])The finishing technique that treats the co-occurrence matrix as a graph, that is binarized by the threshold (>=threshold becomes an unweighted, undirected edge in an adjacency matrix).
finish_majority_vote
([threshold])Based on Ana Fred’s 2001 paper: Fred, Ana.
get_cluster_members
(solution_name, clusterNum)Return the dataframe row indexes of a cluster number in solution named by solution_name
merge
(c_list)Returns an appended object – a merge of the cluster object (self) and all cluster objects inside a passed list.
mixture_model
([K, iterations])Finishing Technique to assemble a final, hard parition of the data according to maximizing the likelihood according to the observed clustering solutions across the ensemble.
search_field
(field, value)Find solutions that were made with
slice
(names)Returns a new cluster object containing a slice indicated by the list of names given (dictionary keys shared amongst labels, params, etc.)
-
MI
(MI_type='standard')[source]¶ Calculate the mutual information between all pairs of clustering solutions
- Parameters
- MI_type: string {‘standard’, ‘adjusted’, ‘normalized’}
The sklearn.metric mutual information to use, either mutual_info, adjusted_mutual_info, or normalized_mutual_info
- Returns
- MI class
mutualinformation.MI class, where MI.matrix is the claculated matrix of pairwise mutual information. The diagonal is not guaranteed to be 1 (it depends on the type of MI calculated)
Examples
>>> MI = c.MI(MI_type='adjusted') >>> MI.plot(sorted=True)
-
algorithms_available
()[source]¶ Call this to list all algorithms currently available in algorithms.py
-
cluster
(source_name, algorithm, output_name, K=None, Require_Unique=False, random_seed=None, **kwargs)[source]¶ This runs clustering algorithms on the data matrix defined by source_name with parameters that are variable for each algorithm. Note that K is required for most algorithms.
- Parameters
- source_name: string
the source data matrix name to operate on in clusterclass dataObj
- algorithm: string
name of the algorithm to use, see clustering.py or call oe.cluster.algorithms_available()
- output_name: string
this is the dict key for interacting with the results of this clustering solution in any of the cluster class dictionary attributes
- K: int
number of clusters to create (ignored for algorithms that define K during clustering). The var_params gets K after, either the parameter passed, or the number of clusters produced if the K was not passed.
- Require_Unique: bool
If FALSE and you already have an output_name solution, this will append a number to create a unique name. If TRUE and a solution by that name exists, this will not add solution and raise ValueError. Default Require_Unique=False
- random_seed: int or random.getstate()
Pass a random seed or random seed state (random.getstate()) in order to force the starting point of a clustering algorithm to that state. Default is None
- Raises
- ValueError
if data source is not available by source_name
Warning
This will warn if the number of clusters is differen than what was requested, typically when an algorithm does not accept K as an argument.
Examples
Cluster using KMeans on parent data
>>> c = oe.cluster >>> c.cluster('parent', 'kmeans','kmeans_parent', K=5)
Form an iteration to build an ensemble using different values for K
>>> for k in range(2,12): >>> name='kmeans_'+k >>> c.cluster('parent', 'kmeans', name, k)
-
clustering_algorithm_parameters
()[source]¶ This function returns a dictionary with keys equal to parameters of interest {K, linkage, affinity} whose entries indicate algorithms that take those as free parameters. For example K-means takes K as an argument, but Affinity Propagation does not, so you will find kmeans is listed in dict[‘K’], but not AffinityPropagation. This is not inclusive of all paramaters of every algorithm, but the common parameters one might want to vary.
- Returns
- a: dictionary
Keys equal to parameters {K, linkages, distances} and values as lists of algorithms that use that key as a variable
-
co_occurrence_matrix
(data_source_name='parent')[source]¶ Calculate the co-occurrence of all pairs of objects across the ensemble
Parameters: data_source_name: string
Name of the data source to link to co-occurrence object. Default is ‘parent’
- Returns
- coMat class
coMat.co_matrix is the NxN matrix, whose entries indicate the number of times the pair of objects in positon (i,j) cluster across the ensemble of clustering solutions available in clustering object.
Examples
>>> coMat = c.co_occurrence_matrix() >>> coMat.plot()
-
finish_co_occ_linkage
(threshold, linkage='average')[source]¶ The finishing technique that calculates a co-occurrence matrix on all cluster solutions in the ensemble and then hierarchically clusters the co-occurrence, treating it as a similarity matrix. The clusters are defined by the threshold of the distance used to cut.
- Parameters
- threshold: float
Linkage distance to use as a cutoff to create partitions
- linkage: string
Linkage type. See scipy.cluster.hierarchy
- Returns
- c: openensembles clustering object
a new clustering object with c.labels[‘co_occ_linkage’] set to the final solution.
Examples
To determine where the cut is visually, at threshold=0.5:
>>> coMat = c.co_occurrence() >>> coMat.plot(threshold=0.5, linkage='ward')
To create the cut at threshold=0.5
>>> cWard = c.co_occ_linkage(0.5, 'ward') >>> d.plot_data('parent', cluster_labels=cWard.labels['co_occ_linkage'])
-
finish_graph_closure
(threshold, clique_size=3)[source]¶ The finishing technique that treats the co-occurrence matrix as a graph, that is binarized by the threshold (>=threshold becomes an unweighted, undirected edge in an adjacency matrix). This graph object is then subjected to clique formation according to clique_size (such as triangles if clique_size=3). The cliques are then combined in the graph to create unique cluster formations.
- Returns
- c: openenembles clustering object
New cluster object with final solution and name ‘graph_closure’
See also
finishing.py
Examples
>>> cGraph = c.finish_graph_closure(0.5, 3) >>> d.plot_data('parent', cluster_labels=cGraph.labels['graph_closure'])
-
finish_majority_vote
(threshold=0.5)[source]¶ Based on Ana Fred’s 2001 paper: Fred, Ana. Finding Consistent Clusters in Data Partitions. In Multiple Classifier Systems, edited by Josef Kittler and Fabio Roli, LNCS 2096, 309-18. Springer, 2001. This algorithm assingns clusters to the same class if they co-cluster at least 50 of the time. It greedily joins clusters with the evidence that at least one pair of items from two different clusters co-cluster a majority of the time. Outliers will get their own cluster.
- Parameters
- threshold: float
the threshold, or fraction of times objects co-cluster to consider a ‘majority’. Default is 0.5 (50% of the time)
- Returns
- c: openensembles cluster object
New cluster object with final solution and name ‘majority_vote’
Examples
>>> c_MV = c.majority_vote(threshold=0.7) >>> labels = c_MV.labels['majority_vote']
-
get_cluster_members
(solution_name, clusterNum)[source]¶ Return the dataframe row indexes of a cluster number in solution named by solution_name
- Parameters
- solution_name: string
the name of the clustering solution of interest
- clusterNum: int
The cluster number of interest
- Returns
- indexes: list
a list of indexes of objects with clusterNum in solution_name
Examples
Get a list of objects that belong to each cluster type in a solution
>>> name = 'zscore_agglomerative_ward' >>> c.cluster('zscore', 'agglomerative', name, K=4, linkage='ward') >>> labels = {} >>> for i in c.clusterNumbers[name]: >>> labels{i} = c.get_cluster_members(name, i)
-
merge
(c_list)[source]¶ Returns an appended object – a merge of the cluster object (self) and all cluster objects inside a passed list. This will keep the parent dataobject of the self cluster object. This assumes that the ojbects were instantiated and clustered on the same data source (at least the same mxn features)
- Parameters
- c_list: list
A list of cluster objects
- Returns
- transDictArr: list of dicts
A list of dictionary translation of new labels in merged object, with original labels. List order is same as those passed in
- Raises
- ValueError
If objects in c_list are not well formed cluster objects
Examples
Merge two sets of clustering objects
FINISH EXAMPLES here
-
mixture_model
(K=2, iterations=10)[source]¶ Finishing Technique to assemble a final, hard parition of the data according to maximizing the likelihood according to the observed clustering solutions across the ensemble. This will operate on all clustering solutions contained in the container cluster class. Operates on entire ensemble of clustering solutions in self, to create a mixture model See finishing.mixture_model for more details.
- Parameters
- K: int
number of clusters to create. Default K=2
- iterations: int
number of iterations of EM algorithm to perform. Default iterations=10
- Returns
- c: openensembles clustering object
a new clustering object with c.labels[‘mixture_model’] set to the final solution.
- Raises
- ValueError:
If there are not at least two clustering solutions
References
Topchy, Jain, and Punch, “A mixture model for clustering ensembles Proc. SIAM Int. Conf. Data Mining (2004)”
Examples
>>> cMM = c.mixture_model(4, 10) >>> d.plot_data('parent', cluster_labels=cMM.labels['mixture_model'])
-
search_field
(field, value)[source]¶ Find solutions that were made with
- Parameters
- field: string {‘algorithm’, ‘data_source’, ‘K’, ‘linkage’, ‘distance’, ‘clusterNumber’, etc.}
The name of field, either in algorithm used, data_source selected, or a parameter passed to search for an exact value
- value: string or int
The value to search for (where ints are passed for K (desired clusters) or clusterNumber (actual returned clusters))
- Returns
- names: list of strings
The names of dictionary entries in clustering solutions matching field, value criteria. Returns empty list if nothing was found
- Raises
- ValueError
If the field was not recognized.
Examples
Find all clustering solutions where K=2 was used >>> names = c.search_field(‘K’, 2)
Find all clustering solutions where the actual cluster numbers were 2 >>> names = c.search_field(‘clusterNumber’, 2)
Find all solutions clustered using kmeans >>> names = c.search_field(‘algorithm’, ‘kmeans’)
Find all clustering solutions where ward linkage was used >>> names = c.search_field(‘linkage’, ‘ward’)
-
slice
(names)[source]¶ Returns a new cluster object containing a slice indicated by the list of names given (dictionary keys shared amongst labels, params, etc.)
- Parameters
- names: list
A list of strings matching the names to keep in the new slice
- Returns
- c: an openensembles clustering object
A oe.cluster object that contains only those names passed in
- Raises
- ValueError
If a name in the list of names does not exist in cluster object
Examples
Get only the solutions made by agglomerative clustering
>>> names = c.search_field('algorithm', 'agglomerative') #return all solutions with agglomerative >>> cNew = c.slice(names)
Get only the solutions that were made with K=2 calls
>>> names = c.search_field('K', 2) #return all solution names that used K=2 >>> cNew = c.slice(names)
The “validation” class¶
-
class
openensembles.
validation
(dataObj, cObj)[source]¶ validation is a class to calculate any number of validation metrics on clustering solutions in data. An individual validation metric must be called on a particular instantiation of the data matrix (like ‘parent’ or ‘zscore’) and a specific solution in cObj.
Methods
-
calculate
(validation_name, cluster_name, source_name='parent')[source]¶ Calls the function titled by validation_name on the data matrix set by source_name (default ‘parent’) and clustering solution by cluster_name Appends to validation with key value equal to the validation_name+source_name+cluster_name
- Returns
- output_name: str
The handle name for accessing validation results
-
merge
(v_list)[source]¶ Returns an appended object – a merge of the validation object (self) and all validation objects inside a passed list. This will keep the dataobject and clusterObjects of the self validation object.
- Parameters
- v_list: list
A list of validation objects
- Returns
- transDictArr: list of dicts
A list of dictionary translation of new labels in merged object, with original labels. List order is same as those passed in
- Raises
- ValueError
If objects in v_list are not well formed value objects
Examples
Merge two sets of validation objects
FINISH EXAMPLES here
-