Performing deDANSy on an RNA-sequencing dataset (Shimpi & Tan et al. 2021)

Note: This come directly from the supplementary file provided with the publication. This is based on a RNA-sequencing dataset which requires converting between ENSEMBL gene IDs to UniProt IDs.

[1]:

import pandas as pd
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt
from pybiomart import Dataset

import dansy

[2]:

# First need to import and create an ID conversion dataframe
bm_dataset = Dataset(host = 'http://useast.ensembl.org', name='hsapiens_gene_ensembl',)
gene_ID_conv = bm_dataset.query(attributes=['ensembl_gene_id','external_gene_name','uniprotswissprot'])

# Importing the proteome reference file
ref = dansy.import_proteome_files()

[3]:

# Now importing the data of interest
input_dir = './'
filename = 'Shimpi_Tan_RNA_seq_Results.csv'
data = pd.read_csv(input_dir+filename)

[4]:

# Now building the deDANSy object
my_dedansy = dansy.DEdansy(uniprot_ref = ref,
                           dataset=data,
                            id_conv=gene_ID_conv,
                            conv_cols = 'Gene stable ID',
                            data_ids = 'ensembl_gene_id')

# And now creating the metadata for designating differentially expressed genes
my_dedansy.create_contrast_metadata(comparisons = ['HAS2_v_Control','HAS3_v_Control'],
                                    fc_cols = ['log2FoldChange_HAS2','log2FoldChange_HAS3'],
                                    pval_cols = ['padj_HAS2','padj_HAS3'],
                                    fcs = 1,
                                    alphas = 0.05)

Starting to fetch n-grams.
Finished getting all n-grams
Starting to generate adjacency
Finished building adjacency.
Building the reference network information.

[5]:

# The deDANSy object now has several built-in functions that allow for quick visualization of the n-gram network colored by whether n-grams are associated with up- or down-regulated genes.

# Setting up the DEG condition
my_dedansy.calc_DEG_ngrams('HAS2_v_Control')

# Now establishing the layout of the network and then visualizing
pos = nx.spring_layout(my_dedansy.G, k = 0.04, seed = 123)
my_dedansy.plot_DEG_ns(pos=pos)
plt.gcf().set_size_inches(1,1)
plt.gcf().set_dpi(300)

[6]:

# If you only want to focus on the largest connected component:
my_dedansy.plot_DEG_ns(pos=pos, large_cc_mode=True)
plt.gcf().set_size_inches(1,1)
plt.gcf().set_dpi(300)

While visualizing the network provides an initial assessment of how distinct n-grams from each condition are, to get a more quantitative value we can calculate the separation and distinction scores from the actual deDANSy algorithm.

[7]:

# The algorithm can be run on multiple conditions in one round
my_dedansy.calculate_scores(comps = ['HAS2_v_Control', 'HAS3_v_Control'],
                            num_ss_trials=20, # Lower than recommended to save time (Default value of 100 is recommended)
                            fpr_trials=5, # Lower than recommended to save time (Default value of 50 is recommended)
                            min_pval=-10,
                            processes=8,
                            seed=123, # For reproducibility but can be omitted
                            verbose=True)

2025-10-08 17:08:42:.....Starting deDANSy scoring analysis with random seed: 123
2025-10-08 17:08:53:.....Done with enrichment-based pruning for HAS2_v_Control
2025-10-08 17:09:09:.....Done with enrichment-based pruning for HAS3_v_Control
2025-10-08 17:09:09:.....Starting the false positive rate procedure.
2025-10-08 17:09:57:.....Finished FPR procedure for HAS2_v_Control.
2025-10-08 17:10:58:.....Finished FPR procedure for HAS3_v_Control.
2025-10-08 17:10:58:.....Preparing the FPR results for export

[8]:

# We can then display each of the scores as a bubble plot
axs = my_dedansy.plot_scores(aspect=5) # The aspect parameter changes the aspect ratio which can make the graph a bit more aesthetically pleasing.
plt.gcf().set_dpi(100)
plt.gcf().set_size_inches(4.5,2)

[9]:

# A more in-depth table of the results can be displayed or saved using the scores parameter, which includes both the Mann-Whitney U significance test and the FPR results.
my_dedansy.scores

[9]:

	Comparison	Separation_Score	Separation_Category	Separation_p	Separation_FPR	Distinction_Score	Distinction_Category	Distinction_p	Distinction_FPR
Comparison
HAS2_v_Control	HAS2_v_Control	0.882058	Less	1.058121e-02	0.4	0.413433	Stably Distinct	0.694891	1.0
HAS3_v_Control	HAS3_v_Control	3.824758	Less	6.795615e-08	0.0	0.344319	Unstable/Overlapping	0.239324	0.2

Running additional comparisons after scores have been generated

Sometimes, you may want to add in additional comparisons, which are either additional biological comparisons not explored, or to be more restrictive in what is designated as a differentially expressed gene/protein. You can add in additional comparisons after scores have been calculated and then run a second time with new comparisons to allow additional analysis to be conducted.

[10]:

my_dedansy.create_contrast_metadata(comparisons = ['HAS2_restrictive', 'HAS3_restrictive'],
                                    fc_cols = ['log2FoldChange_HAS2','log2FoldChange_HAS3'],
                                    pval_cols = ['padj_HAS2','padj_HAS3'],
                                    fcs = 1,
                                    alphas = 0.001)
my_dedansy.calculate_scores(comps = ['HAS2_restrictive', 'HAS3_restrictive'],
                            num_ss_trials=20, # Lower than recommended to save time (Default value of 100 is recommended)
                            fpr_trials=5, # Lower than recommended to save time (Default value of 50 is recommended)
                            min_pval=-10,
                            processes=8,
                            seed=123, # For reproducibility but can be omitted
                            verbose=True)

2025-10-08 17:10:59:.....Starting deDANSy scoring analysis with random seed: 123
2025-10-08 17:11:08:.....Done with enrichment-based pruning for HAS2_restrictive
2025-10-08 17:11:21:.....Done with enrichment-based pruning for HAS3_restrictive
2025-10-08 17:11:21:.....Starting the false positive rate procedure.
2025-10-08 17:12:05:.....Finished FPR procedure for HAS2_restrictive.
2025-10-08 17:13:01:.....Finished FPR procedure for HAS3_restrictive.
2025-10-08 17:13:01:.....Preparing the FPR results for export

[11]:

axs = my_dedansy.plot_scores(aspect=1.5) # The aspect parameter changes the aspect ratio which can make the graph a bit more aesthetically pleasing.
plt.gcf().set_dpi(100)
plt.gcf().set_size_inches(4.5,2)

[12]:

my_dedansy.comp_metadata

[12]:

	fc_col	fc_thres	pval_col	alpha
HAS2_v_Control	log2FoldChange_HAS2	1	padj_HAS2	0.05
HAS3_v_Control	log2FoldChange_HAS3	1	padj_HAS3	0.05
HAS2_restrictive	log2FoldChange_HAS2	1	padj_HAS2	0.001
HAS3_restrictive	log2FoldChange_HAS3	1	padj_HAS3	0.001

Determining the enriched n-grams of individual conditions.

In addition to generating scores for network wide behavior, deDANSy can get the significantly enriched n-grams for both up- and down-regulated proteins/genes. Like how DANSy collapses n-grams representing similar data, the n-gram enrichment calculation collapses n-grams that have similar significance values that have a parent-children relationship (i.e. shorter n-gram in a longer n-gram).

[13]:

my_dedansy.calculate_ngram_enrichment('HAS2_v_Control')
my_dedansy.plot_ngram_enrichment('HAS2_v_Control')

# And just like before can adjust the aspect ratio to make it a little more aesthetically pleasing.
plt.gca().set_aspect(.5)
plt.title('HAS2_v_Control')
plt.show()

Will be overwriting existing DEG data.

Each comparison can have its own plot to show the unique/shared n-grams for each comparison.

[14]:

my_dedansy.calculate_ngram_enrichment('HAS3_v_Control')
my_dedansy.plot_ngram_enrichment('HAS3_v_Control')

# And just like before can adjust the aspect ratio to make it a little more aesthetically pleasing.
plt.gca().set_aspect(.5)
plt.title('HAS3_v_Control')
plt.show()

Will be overwriting existing DEG data.

The displayed n-grams can be adjusted to pass a more lenient/restrictive cutoff for being displayed. We can use either a p-value cutoff (p parameter) or a quantile cutoff (q parameter) to change the plot behavior.

[15]:

my_dedansy.plot_ngram_enrichment('HAS3_v_Control',q=0.01)
plt.gca().set_aspect(2)
plt.title('HAS3_v_Control\nOnly Top 1%')
plt.show()

If you want to check the full n-gram enrichment data used for plotting, you can use the get_ngram_results method for the comparison of interest

[16]:

my_dedansy.get_ngram_results('HAS2_v_Control')

[16]:

	variable	ngram_legible	ngram	p	-log10(p)	FPR	FPR <= 0.05
0	Up	Znf_C2H2_type\|Znf_C2H2_type\|Znf_C2H2_type\|Znf_...	IPR013087\|IPR013087\|IPR013087\|IPR013087\|IPR013...	0.998364	0.000711	1.000000	False
1	Up	Znf_C2H2_type\|Znf_C2H2_type\|Znf_C2H2_type	IPR013087\|IPR013087\|IPR013087	0.997764	0.000972	1.000000	False
2	Up	Small_GTP-bd	IPR005225	0.546010	0.262799	0.346667	False
3	Up	RRM_dom	IPR000504	0.930060	0.031489	0.914286	False
4	Up	SH3_domain	IPR001452	0.754723	0.122212	0.704225	False
...	...	...	...	...	...	...	...
567	Down	LamG-like\|Notch_dom\|Notch_dom\|Sushi_SCR_CCP_do...	IPR006558\|IPR000800\|IPR000800\|IPR000436\|IPR000...	0.007810	2.107342	0.000000	True
568	Down	WW_dom\|PDZ\|FERM_domain	IPR001202\|IPR001478\|IPR000299	0.007810	2.107342	0.000000	True
569	Down	LCCL\|VWF_A\|VWF_A	IPR004043\|IPR002035\|IPR002035	0.007810	2.107342	0.000000	True
570	Down	MyBP-C_THB\|Ig_sub\|Ig_sub2\|Ig_sub\|Ig_sub2\|FN3_d...	IPR040849\|IPR003599\|IPR003598\|IPR003599\|IPR003...	0.007810	2.107342	0.000000	True
571	Down	Ig_sub\|MyBP-C_THB\|Ig_sub\|Ig_sub2\|Ig_sub\|Ig_sub...	IPR003599\|IPR040849\|IPR003599\|IPR003598\|IPR003...	0.007810	2.107342	0.000000	True

300 rows × 7 columns

Performing deDANSy analysis with (phospho-)proteomic data

If you are working with a proteomic dataset and do not require converting IDs because you are using UniProt IDs, you can build a deDANSy object without the conversion and perform analysis as outlined below with a phosphoproteomic dataset from Chen et al, 2024. We are using Supplementary Table S5 specifically.

[17]:

# Importing the data
filename = 'CUTO28_pY_Results.xlsx'
raw_phos = pd.read_excel(filename,header=1)

# First filtering out the raw data to ensure we only have the human data and those found within our reference file.
proc_phos = raw_phos[raw_phos['IsHuman'] == 1]
proc_phos = proc_phos[proc_phos['Protein'].isin(ref['UniProt ID'])]

[18]:

# Now building the deDANSy object, which can ignore any ID conversion by setting the run_conversion parameter to False.
prot_dedansy = dansy.DEdansy(dataset = proc_phos,
                             uniprot_ref = ref,
                             data_ids = 'Protein',
                             run_conversion = False)

Starting to fetch n-grams.
Finished getting all n-grams
Starting to generate adjacency
Finished building adjacency.
Building the reference network information.

After this the process of calculating scores and plotting them is identical.