deDANSy
This is the deDANSy class, which extend the DANSy analysis to differential expression analysis results.
- class dansy.DEdansy(dataset, uniprot_ref=None, n=10, id_conv=None, conv_cols='Gene stable ID', data_ids='gene_id', penalty='dynamic', run_conversion=True, **kwargs)[source]
A container class of multiple Domain n-gram networks related to a differentially expressed dataset that was generated using the DESeq analysis pipeline. This provides methods to analyze and contrast pairs of domain architecture subnetworks to understand changes in functional molecular ecosystems available to different conditions.
- Parameters:
- dataset: pandas DataFrame
The expression dataset that contains the proteins of interest and expression values to designate differentially expressed genes/proteins
- uniprot_ref: pandas DataFrame (Optional)
The base reference file that contains all the proteins of interest within the dataset. Note: It is recommended to include if multiple instances of a DEdansy object are being instantiated that share a common set of proteins.
- n: int (Optional)
Length of n-grams to extract. Default is 10
- id_conv: pandas DataFrame (Optional)
A dataframe with a column of UniProt IDs and a column for a IDs used in the expression dataset to help in converting
- conv_cols: str (Optional)
The name of the column with IDs matching in the id_conv dataframe. Assumes the naming convention generated from pybiomart
- data_ids: str (Optional)
The name of the column in the dataset dataframe for the IDs to be converted to UniProt IDs
- penalty: ‘dynamic’ or int (Optional)
The value or ‘dynamic’ for the penalty during network separation calculations.
- run_conversion: bool
Whether the dataset IDs have to be converted using a provided id_conv dataframe
- kwargs
- Additional keyword arguments for generating the DANSy network. See dansy.set_ngram_parameters for details acceptable values are reproduced below
‘min_arch’
‘max_node_len’
‘collapse’
‘readable_flag’
‘verbose’
- Attributes:
- —————–
- At Initialization
- —————–
- dataset: pandas DataFrame
The expression dataset for DANSy analysis
- ref: pandas DataFrame
The reference file information for the proteins within the dataset
- n: int
The maximum length of n-grams being extracted
- interproIDs: list
A list of all protein domain InterPro IDs that were found within the dataset
- protsOI: list
The UniProt IDs for the proteins found within the dataset
- ngrams: list
The extracted domain n-grams
- collapsed_ngrams: list
The domain n-grams which were collapsed into other n-grams which represent the set of proteins
- G: networkx Graph
The network graph representation of the DANSy n-gram network
- adj: pandas DataFrame
The adjacency matrix for the n-gram network for the DANSy analysis
- interpro2uniprot: dict
The keys of InterPro IDs with values of a list of UniProt IDs that have the InterPro ID
- id_conversion_dict: dict
A dictionary containing the conversion of the provided gene/protein ID to a UniProt ID. If UniProt IDs were provided, returns a dict of UniProt to UniProt IDs.
- data_id_cols: str
The ID column used in the dataset for conversion
- network_params: dict
Key-value pairs of acceptable networkx drawing parameters
- min_arch: int (Default: 1)
The minimum number of domain architectures for an n-gram to be retained.
- max_node_len: int
The maximum n-gram length that will be retained during the collapsing step to represent n-grams sharing the same set of proteins. This will not be larger than n (Default of 10).
- collapse: bool
Whether the n-grams were collapsed
- readable_flag: bool
Whether the n-grams are human-legible
- verbose: bool
Whether progress statements are to be printed during calculations
- ————————————–
- After Establishing Comparison MetaData
- ————————————–
- comp_metadata: pandas DataFrame
The metadata information for comparisons of interest in the dataset
- ———————-
- After DEG Calculations
- ———————-
- up_DEGs/down_DEGs: list
List of UniProts that have were designated as up or down for a specific condition
- up_ngrams/down_ngrams: list
The n-grams for either up- or down-regulated proteins/genes
- alpha: float
The p-value cutoff for designating DEGs
- fcthres: float
The fold-change cutoff for designating DEGs
- pval_data_col: str
The column name in the datasetused for the p-value data for DEGs
- fc_data_col: str
The column name in the dataset used for fold-change data for DEGs
- —————————–
- After creating deDANSy scores
- —————————–
- scores: pandas DataFrame
The scores, p-values, and FPR values for the deDANSy Separation and Distinction scores for the comparisons
- raw_dists: pandas DataFrame
The raw distribution of values for each score
- fpr_dists: pandas DataFrame
The raw distribution of FPR values for each score
Methods
DEG_network_sep([force_run])This computes the network separation of two conditions that designates individual genes as differentially expressed.
calc_DEG_ngrams(comp[, batch_mode])Defines the DEG UniProt IDs and the associated n-grams for the datasset of interest
calculate_ngram_enrichment(comparison[, ...])Calculates the n-gram enrichment for the comparison of interest to find the most significant n-grams.
calculate_scores(comps[, fpr_trials, ...])This calculates the separation and distinct functional neighborhood scores for a deDANSy instance.
create_contrast_metadata(comparisons, ...[, ...])This create the metadata for the deDANSy object that contains information for each comparison of interest and the cutoffs to define differentially expressed genes/proteins (DEGs).
deg_summary([detailed])This provides a summary of the DEG information that has been used within this dataset.
get_ngram_results(comparison)Returns the n-gram enrichment results of the comparison of interest.
plot_DEG_ns([pos, deg_labels, large_cc_mode])Using the defined differentially expressed genes displaying a network graph of the n-grams that are associated or shared between the DEG conditions.
plot_ngram_enrichment(comparison[, p, q, ...])Creates a bubble plot of the enriched n-grams for up and down-regulated n-grams for the comparison of interest.
plot_scores([show_FPR, aspect, order])This creates the bubble plots for the separation and distinction scores.
- DEG_network_sep(force_run=False)[source]
This computes the network separation of two conditions that designates individual genes as differentially expressed. Silently passes a nan if it fails.
- calc_DEG_ngrams(comp, batch_mode=False)[source]
Defines the DEG UniProt IDs and the associated n-grams for the datasset of interest
- calculate_ngram_enrichment(comparison, fpr_trials=100, seed=None)[source]
Calculates the n-gram enrichment for the comparison of interest to find the most significant n-grams.
- calculate_scores(comps, fpr_trials=50, min_pval=-10, num_ss_trials=100, processes=1, seed=None, verbose=True, overwrite=False)[source]
This calculates the separation and distinct functional neighborhood scores for a deDANSy instance. It creates new attributes for the deDANSy instance containing all the information of interest. If scores for a condition have been generated, this will raise a warning and overwrite existing scores.
- Parameters:
- dedansydeDANSy object
The base deDANSy object containing all n-grams and expression data
- compslist or str
The conditions that will be compared
- fpr_trialsint (Optional)
Number of FPR trials to perform.
- min_pvalint (Optional)
The log10 transform of the minimum p-value to use for the p-value pruning sweep step.
- num_ss_trialsint (Optional)
The number of subsampled (and random) networks used to build distributions for comparing the network separation and IQR
- processesint (Optional)
Number of processes to use if multiprocessing is desired. (Recommended having 4-8 when feasible)
- seedint
Seed for random numbers. If not provided will use system time
- Returns:
- The following attributes are added or adjusted:
- scorespandas DataFrame
All scores, p-values, and FPR values for each condition
- raw_distspandas DataFrame
For each condition the raw values that made up the distributions for each score
- fpr_distspandas DataFrame
For each condition the p-values from the random FPR trials for calculating the FPR
- create_contrast_metadata(comparisons, fc_cols, pval_cols, fcs=1, alphas=0.05, delim=None)[source]
This create the metadata for the deDANSy object that contains information for each comparison of interest and the cutoffs to define differentially expressed genes/proteins (DEGs). Must provide the same number of columns for both the fold change and p-value columns
- Parameters:
- comparisonslist
All comparisons that will be used that match what is provided in the deDANSy dataset
- fc_colsstr or list
Either a stem or list of columns used for the foldchanges to define DEGs (Use the delim parameter and a stem if reoccurring patterns are used where the comparison is at the end.)
- pval_colsstr or list
Either a stem or list of columns used for the p-values to define DEGs. (Use the delim parameter and a stem if reoccurring patterns are used where the comparison is at the end.)
- fcsfloat or list
The fold-change cutoff. Can either be a single value or a list of values of the same size as comparisons
- alphasfloat or list
The p-value cutoff. Can either be a single value or a list of values of the same size as comparisons
- delimstr (Optional)
Delimiter used in column names separating the stem of the column from the comparison. If none then will take the
- Returns:
- comp_metapandas DataFrame
DataFrame where each row is a single comparison
- deg_summary(detailed=False)[source]
This provides a summary of the DEG information that has been used within this dataset.
- get_ngram_results(comparison)[source]
Returns the n-gram enrichment results of the comparison of interest.
- plot_DEG_ns(pos=[], deg_labels=[], large_cc_mode=False)[source]
Using the defined differentially expressed genes displaying a network graph of the n-grams that are associated or shared between the DEG conditions.
- plot_ngram_enrichment(comparison, p=0.05, q=0.05, show_FPR=True, **kwargs)[source]
Creates a bubble plot of the enriched n-grams for up and down-regulated n-grams for the comparison of interest. The resulting plot can be paritally customized based on additional keyword parameters. This function is a wrapper for the plot_enriched_ngrams function from the enrichment_plotting_helpers module, but specific to the deDANSy object instance.
- plot_scores(show_FPR=True, aspect=0.9, order=None)[source]
This creates the bubble plots for the separation and distinction scores. This is a wrapper function for the base plotting function found in the enrichment_plotting_helpers.
- Parameters:
- show_FPRbool (Optional)
Whether the FPR legend handles should be displayed.
- aspectfloat (Optional)
The aspect ratio for each score plot.
- orderdict (Optional)
Key-value pairs for each comparison and what order they should be displayed on the axis.
- Returns:
- axmatplotlib Axes
The axes of the resulting plot
deDANSy Supporting Functions
These are functions integrated into deDANSy, which are not necessary to call, but can aid in the analysis of a deDANSy object or graphs from multiple (de)DANSy objects.
- dansy.network_separation_helpers.build_network_reference_dict(ref_ngram_net, penalty=None)[source]
This builds the dict that contains the reference network information necessary for calculating the network separation value from a provided DomainNgramNetwork object.
It is recommended to use this function to generate the reference dictionary prior to calculating network separation, especially when using a dynamic penalty and comparing several networks, to improve the execution speed of the calculation.
- dansy.network_separation_helpers.network_separation(G_in, H_in, ref_G_data, mmd_verbose=False, force_run=False, verbose=True)[source]
Calculates the network separation between two networks of interest that lie on a common larger, reference network.
- dansy.enrichment_plotting_helpers.collapse_to_max_info(ngram_list, res_df)[source]
This collapses the n-grams to those that represent the most discriminating information of interest. This will take longer n-grams and collapse them into shorter ones if the trends of p-values are similar, but the longer n-grams are slightly less signficant. If a longer n-gram is more significant it will not be collapsed.
- dansy.enrichment_plotting_helpers.gather_enrichment_results(hyper_values, fpr_values)[source]
This gathers both the statistical results from the enrichment analysis and the FPR calculations to return a complete results dataframe.
- dansy.enrichment_plotting_helpers.get_max_info_enriched_ngrams(res_df, condition_labels=None, q=None, p=None)[source]
Returns the top X values of enriched n-grams that passes a quantile and/or p-value cutoff. This will collapse n-grams that provide similar p-value trends into a single representative n-gram if a shorter n-gram is in the longer n-gram. Longer n-grams that have more signficant p-values than their shorter counterparts will be retained.
- dansy.enrichment_plotting_helpers.plot_enriched_ngrams(res, dansyOI, condition_labels=None, q=0.05, p=None, show_FPR=True, **kwargs)[source]
This plots the top X percent (default 5) n-grams enriched between two different conditions. For clarity n-grams that contain similar information will be collapsed into the shorter n-gram (i.e. if EGF-like domain and EGF-like domain|EGF-like domain both have similar enrichment values they will only be represented by EGF-like domain).
- dansy.enrichment_plotting_helpers.plot_functional_scores(res, show_FPR_handle=True, aspect=0.9, order=None)[source]
This creates the bubble plots for both the separation and distinction scores calculated by deDANSy.
- dansy.enrichment_helpers.calculate_fpr(actual_res, random_res)[source]
For each result of interest calculates the False positive rate
- dansy.enrichment_helpers.calculate_separation_stability(dedansy, num_trials=50, pval_sweep=array([1.00000000e+00, 3.16227766e-01, 1.00000000e-01, 3.16227766e-02, 1.00000000e-02, 3.16227766e-03, 1.00000000e-03, 3.16227766e-04, 1.00000000e-04, 3.16227766e-05, 1.00000000e-05, 3.16227766e-06, 1.00000000e-06, 3.16227766e-07, 1.00000000e-07, 3.16227766e-08, 1.00000000e-08, 3.16227766e-09, 1.00000000e-09, 3.16227766e-10, 1.00000000e-10]), return_distributions=False, processes=1, verbose=True)[source]
Generates the distribution of network separation and interquartile range of network separation values during pruning for generating the different scores for deDANSy analysis. This is the key function of the algorithm that gathers all the results.
- dansy.enrichment_helpers.cohen_d(a, b)[source]
Calculates the Cohen’s d effect size of 2 lists of values, where the second is considered the “control” group.
- dansy.enrichment_helpers.individual_trial_calc(dedansy, arch_weights, comp_arch_dist, ratio, sweep, originals, dist_flag=False, seed=123)[source]
Calculates the subsampled and randomly chosen genes network separtion and pruning analysis.
- dansy.enrichment_helpers.retrieve_fpr_checks(dedansy, num_DEGs, fpr_trials=50, num_internal_trials=50, deg_ratios=0.7, processes=1, seed=123)[source]
Performs a false positive rate check on the values for the differentially expressed genes by performing the same subsampling and random gene selection analysis with random genes that were measured to determine if findings are true positives. This process can be highly time consuming based on the number of FPR trials that are to conducted.