deDANSy

This is the deDANSy class, which extend the DANSy analysis to differential expression analysis results.

class dansy.DEdansy(dataset, uniprot_ref=None, n=10, id_conv=None, conv_cols='Gene stable ID', data_ids='gene_id', penalty='dynamic', run_conversion=True, **kwargs)[source]

A container class of multiple Domain n-gram networks related to a differentially expressed dataset that was generated using the DESeq analysis pipeline. This provides methods to analyze and contrast pairs of domain architecture subnetworks to understand changes in functional molecular ecosystems available to different conditions.

Parameters:
dataset: pandas DataFrame

The expression dataset that contains the proteins of interest and expression values to designate differentially expressed genes/proteins

uniprot_ref: pandas DataFrame (Optional)

The base reference file that contains all the proteins of interest within the dataset. Note: It is recommended to include if multiple instances of a DEdansy object are being instantiated that share a common set of proteins.

n: int (Optional)

Length of n-grams to extract. Default is 10

id_conv: pandas DataFrame (Optional)

A dataframe with a column of UniProt IDs and a column for a IDs used in the expression dataset to help in converting

conv_cols: str (Optional)

The name of the column with IDs matching in the id_conv dataframe. Assumes the naming convention generated from pybiomart

data_ids: str (Optional)

The name of the column in the dataset dataframe for the IDs to be converted to UniProt IDs

penalty: ‘dynamic’ or int (Optional)

The value or ‘dynamic’ for the penalty during network separation calculations.

run_conversion: bool

Whether the dataset IDs have to be converted using a provided id_conv dataframe

kwargs
Additional keyword arguments for generating the DANSy network. See dansy.set_ngram_parameters for details acceptable values are reproduced below
  • ‘min_arch’

  • ‘max_node_len’

  • ‘collapse’

  • ‘readable_flag’

  • ‘verbose’

Attributes:
—————–
At Initialization
—————–
dataset: pandas DataFrame

The expression dataset for DANSy analysis

ref: pandas DataFrame

The reference file information for the proteins within the dataset

n: int

The maximum length of n-grams being extracted

interproIDs: list

A list of all protein domain InterPro IDs that were found within the dataset

protsOI: list

The UniProt IDs for the proteins found within the dataset

ngrams: list

The extracted domain n-grams

collapsed_ngrams: list

The domain n-grams which were collapsed into other n-grams which represent the set of proteins

G: networkx Graph

The network graph representation of the DANSy n-gram network

adj: pandas DataFrame

The adjacency matrix for the n-gram network for the DANSy analysis

interpro2uniprot: dict

The keys of InterPro IDs with values of a list of UniProt IDs that have the InterPro ID

id_conversion_dict: dict

A dictionary containing the conversion of the provided gene/protein ID to a UniProt ID. If UniProt IDs were provided, returns a dict of UniProt to UniProt IDs.

data_id_cols: str

The ID column used in the dataset for conversion

network_params: dict

Key-value pairs of acceptable networkx drawing parameters

min_arch: int (Default: 1)

The minimum number of domain architectures for an n-gram to be retained.

max_node_len: int

The maximum n-gram length that will be retained during the collapsing step to represent n-grams sharing the same set of proteins. This will not be larger than n (Default of 10).

collapse: bool

Whether the n-grams were collapsed

readable_flag: bool

Whether the n-grams are human-legible

verbose: bool

Whether progress statements are to be printed during calculations

————————————–
After Establishing Comparison MetaData
————————————–
comp_metadata: pandas DataFrame

The metadata information for comparisons of interest in the dataset

———————-
After DEG Calculations
———————-
up_DEGs/down_DEGs: list

List of UniProts that have were designated as up or down for a specific condition

up_ngrams/down_ngrams: list

The n-grams for either up- or down-regulated proteins/genes

alpha: float

The p-value cutoff for designating DEGs

fcthres: float

The fold-change cutoff for designating DEGs

pval_data_col: str

The column name in the datasetused for the p-value data for DEGs

fc_data_col: str

The column name in the dataset used for fold-change data for DEGs

—————————–
After creating deDANSy scores
—————————–
scores: pandas DataFrame

The scores, p-values, and FPR values for the deDANSy Separation and Distinction scores for the comparisons

raw_dists: pandas DataFrame

The raw distribution of values for each score

fpr_dists: pandas DataFrame

The raw distribution of FPR values for each score

Methods

DEG_network_sep([force_run])

This computes the network separation of two conditions that designates individual genes as differentially expressed.

calc_DEG_ngrams(comp[, batch_mode])

Defines the DEG UniProt IDs and the associated n-grams for the datasset of interest

calculate_ngram_enrichment(comparison[, ...])

Calculates the n-gram enrichment for the comparison of interest to find the most significant n-grams.

calculate_scores(comps[, fpr_trials, ...])

This calculates the separation and distinct functional neighborhood scores for a deDANSy instance.

create_contrast_metadata(comparisons, ...[, ...])

This create the metadata for the deDANSy object that contains information for each comparison of interest and the cutoffs to define differentially expressed genes/proteins (DEGs).

deg_summary([detailed])

This provides a summary of the DEG information that has been used within this dataset.

get_ngram_results(comparison)

Returns the n-gram enrichment results of the comparison of interest.

plot_DEG_ns([pos, deg_labels, large_cc_mode])

Using the defined differentially expressed genes displaying a network graph of the n-grams that are associated or shared between the DEG conditions.

plot_ngram_enrichment(comparison[, p, q, ...])

Creates a bubble plot of the enriched n-grams for up and down-regulated n-grams for the comparison of interest.

plot_scores([show_FPR, aspect, order])

This creates the bubble plots for the separation and distinction scores.

DEG_network_sep(force_run=False)[source]

This computes the network separation of two conditions that designates individual genes as differentially expressed. Silently passes a nan if it fails.

calc_DEG_ngrams(comp, batch_mode=False)[source]

Defines the DEG UniProt IDs and the associated n-grams for the datasset of interest

calculate_ngram_enrichment(comparison, fpr_trials=100, seed=None)[source]

Calculates the n-gram enrichment for the comparison of interest to find the most significant n-grams.

calculate_scores(comps, fpr_trials=50, min_pval=-10, num_ss_trials=100, processes=1, seed=None, verbose=True, overwrite=False)[source]

This calculates the separation and distinct functional neighborhood scores for a deDANSy instance. It creates new attributes for the deDANSy instance containing all the information of interest. If scores for a condition have been generated, this will raise a warning and overwrite existing scores.

Parameters:
dedansydeDANSy object

The base deDANSy object containing all n-grams and expression data

compslist or str

The conditions that will be compared

fpr_trialsint (Optional)

Number of FPR trials to perform.

min_pvalint (Optional)

The log10 transform of the minimum p-value to use for the p-value pruning sweep step.

num_ss_trialsint (Optional)

The number of subsampled (and random) networks used to build distributions for comparing the network separation and IQR

processesint (Optional)

Number of processes to use if multiprocessing is desired. (Recommended having 4-8 when feasible)

seedint

Seed for random numbers. If not provided will use system time

Returns:
The following attributes are added or adjusted:
scorespandas DataFrame

All scores, p-values, and FPR values for each condition

raw_distspandas DataFrame

For each condition the raw values that made up the distributions for each score

fpr_distspandas DataFrame

For each condition the p-values from the random FPR trials for calculating the FPR

create_contrast_metadata(comparisons, fc_cols, pval_cols, fcs=1, alphas=0.05, delim=None)[source]

This create the metadata for the deDANSy object that contains information for each comparison of interest and the cutoffs to define differentially expressed genes/proteins (DEGs). Must provide the same number of columns for both the fold change and p-value columns

Parameters:
comparisonslist

All comparisons that will be used that match what is provided in the deDANSy dataset

fc_colsstr or list

Either a stem or list of columns used for the foldchanges to define DEGs (Use the delim parameter and a stem if reoccurring patterns are used where the comparison is at the end.)

pval_colsstr or list

Either a stem or list of columns used for the p-values to define DEGs. (Use the delim parameter and a stem if reoccurring patterns are used where the comparison is at the end.)

fcsfloat or list

The fold-change cutoff. Can either be a single value or a list of values of the same size as comparisons

alphasfloat or list

The p-value cutoff. Can either be a single value or a list of values of the same size as comparisons

delimstr (Optional)

Delimiter used in column names separating the stem of the column from the comparison. If none then will take the

Returns:
comp_metapandas DataFrame

DataFrame where each row is a single comparison

deg_summary(detailed=False)[source]

This provides a summary of the DEG information that has been used within this dataset.

get_ngram_results(comparison)[source]

Returns the n-gram enrichment results of the comparison of interest.

plot_DEG_ns(pos=[], deg_labels=[], large_cc_mode=False)[source]

Using the defined differentially expressed genes displaying a network graph of the n-grams that are associated or shared between the DEG conditions.

plot_ngram_enrichment(comparison, p=0.05, q=0.05, show_FPR=True, **kwargs)[source]

Creates a bubble plot of the enriched n-grams for up and down-regulated n-grams for the comparison of interest. The resulting plot can be paritally customized based on additional keyword parameters. This function is a wrapper for the plot_enriched_ngrams function from the enrichment_plotting_helpers module, but specific to the deDANSy object instance.

plot_scores(show_FPR=True, aspect=0.9, order=None)[source]

This creates the bubble plots for the separation and distinction scores. This is a wrapper function for the base plotting function found in the enrichment_plotting_helpers.

Parameters:
show_FPRbool (Optional)

Whether the FPR legend handles should be displayed.

aspectfloat (Optional)

The aspect ratio for each score plot.

orderdict (Optional)

Key-value pairs for each comparison and what order they should be displayed on the axis.

Returns:
axmatplotlib Axes

The axes of the resulting plot

deDANSy Supporting Functions

These are functions integrated into deDANSy, which are not necessary to call, but can aid in the analysis of a deDANSy object or graphs from multiple (de)DANSy objects.

dansy.network_separation_helpers.build_network_reference_dict(ref_ngram_net, penalty=None)[source]

This builds the dict that contains the reference network information necessary for calculating the network separation value from a provided DomainNgramNetwork object.

It is recommended to use this function to generate the reference dictionary prior to calculating network separation, especially when using a dynamic penalty and comparing several networks, to improve the execution speed of the calculation.

dansy.network_separation_helpers.network_separation(G_in, H_in, ref_G_data, mmd_verbose=False, force_run=False, verbose=True)[source]

Calculates the network separation between two networks of interest that lie on a common larger, reference network.

dansy.enrichment_plotting_helpers.collapse_to_max_info(ngram_list, res_df)[source]

This collapses the n-grams to those that represent the most discriminating information of interest. This will take longer n-grams and collapse them into shorter ones if the trends of p-values are similar, but the longer n-grams are slightly less signficant. If a longer n-gram is more significant it will not be collapsed.

dansy.enrichment_plotting_helpers.gather_enrichment_results(hyper_values, fpr_values)[source]

This gathers both the statistical results from the enrichment analysis and the FPR calculations to return a complete results dataframe.

dansy.enrichment_plotting_helpers.get_max_info_enriched_ngrams(res_df, condition_labels=None, q=None, p=None)[source]

Returns the top X values of enriched n-grams that passes a quantile and/or p-value cutoff. This will collapse n-grams that provide similar p-value trends into a single representative n-gram if a shorter n-gram is in the longer n-gram. Longer n-grams that have more signficant p-values than their shorter counterparts will be retained.

dansy.enrichment_plotting_helpers.plot_enriched_ngrams(res, dansyOI, condition_labels=None, q=0.05, p=None, show_FPR=True, **kwargs)[source]

This plots the top X percent (default 5) n-grams enriched between two different conditions. For clarity n-grams that contain similar information will be collapsed into the shorter n-gram (i.e. if EGF-like domain and EGF-like domain|EGF-like domain both have similar enrichment values they will only be represented by EGF-like domain).

dansy.enrichment_plotting_helpers.plot_functional_scores(res, show_FPR_handle=True, aspect=0.9, order=None)[source]

This creates the bubble plots for both the separation and distinction scores calculated by deDANSy.

dansy.enrichment_helpers.calculate_fpr(actual_res, random_res)[source]

For each result of interest calculates the False positive rate

dansy.enrichment_helpers.calculate_separation_stability(dedansy, num_trials=50, pval_sweep=array([1.00000000e+00, 3.16227766e-01, 1.00000000e-01, 3.16227766e-02, 1.00000000e-02, 3.16227766e-03, 1.00000000e-03, 3.16227766e-04, 1.00000000e-04, 3.16227766e-05, 1.00000000e-05, 3.16227766e-06, 1.00000000e-06, 3.16227766e-07, 1.00000000e-07, 3.16227766e-08, 1.00000000e-08, 3.16227766e-09, 1.00000000e-09, 3.16227766e-10, 1.00000000e-10]), return_distributions=False, processes=1, verbose=True)[source]

Generates the distribution of network separation and interquartile range of network separation values during pruning for generating the different scores for deDANSy analysis. This is the key function of the algorithm that gathers all the results.

dansy.enrichment_helpers.cohen_d(a, b)[source]

Calculates the Cohen’s d effect size of 2 lists of values, where the second is considered the “control” group.

dansy.enrichment_helpers.individual_trial_calc(dedansy, arch_weights, comp_arch_dist, ratio, sweep, originals, dist_flag=False, seed=123)[source]

Calculates the subsampled and randomly chosen genes network separtion and pruning analysis.

dansy.enrichment_helpers.retrieve_fpr_checks(dedansy, num_DEGs, fpr_trials=50, num_internal_trials=50, deg_ratios=0.7, processes=1, seed=123)[source]

Performs a false positive rate check on the values for the differentially expressed genes by performing the same subsampling and random gene selection analysis with random genes that were measured to determine if findings are true positives. This process can be highly time consuming based on the number of FPR trials that are to conducted.