Base DANSy Analysis

Using the base version of DANSy to analyze proteins of interest.

[1]:
import numpy as np
import pandas as pd
import networkx as nx
import dansy

First need to import a reference file that contains all the protein information. Here, we import a reference file for the entire proteome.

[3]:
ref = dansy.import_proteome_files()

Now building the DANSy object and showing the resulting network.

[4]:
my_dansy = dansy.dansy(ref=ref)
my_dansy.draw_network()
Starting to fetch n-grams.
Finished getting all n-grams
Starting to generate adjacency
Finished building adjacency.
../_images/Examples_base_dansy_5_1.png

DANSy Analysis on a subset of proteins of interest

If we want to focus on a small subset of proteins rather than the whole proteome, we can provide a list of UniProt IDs in addition to the reference file.

For example, if we want to visualize the domain n-grams of the ERBB, FGFR, EPHA, and EPHB receptor tyrosine kinase families:

[5]:
rtks = ['P00533','P04626','P21860','Q15303', # ERBB family
        'P11362','P21802','P22607','P22455', # FGFR family
        'P54764','P21709','P29317','P29322','P54756','Q15375','P29320','Q9UF33', # EPHA family
        'P29323','P54760','P54762','P54753'] # EPHB family

rtk_dansy = dansy.dansy(protsOI=rtks, ref=ref)

# Now just adjusting some aesthetics for the network
pos = nx.spring_layout(rtk_dansy.G, seed=123)
network_params = {'node_size':50,
                  'edgecolor':'k',
                  'width':1,
                  'linewdiths':0.25,
                  'pos':pos}

rtk_dansy.network_params = network_params

rtk_dansy.draw_network()
Starting to fetch n-grams.
Finished getting all n-grams
Starting to generate adjacency
Finished building adjacency.
../_images/Examples_base_dansy_7_1.png

We can then extract all the n-grams by calling the ngram attribute

[7]:
rtk_dansy.ngrams
[7]:
['IPR000719',
 'IPR001090',
 'IPR003961|IPR027936|IPR000719|IPR001660',
 'IPR003961|IPR003961|IPR027936|IPR000719|IPR001660',
 'IPR001090|IPR011641|IPR003961|IPR003961|IPR027936|IPR000719|IPR001660',
 'IPR001090|IPR003961',
 'IPR000494|IPR006211|IPR000494|IPR032778',
 'IPR003598|IPR003598|IPR003598|IPR000719',
 'IPR000494|IPR006211|IPR000494|IPR032778|IPR049328|IPR000719',
 'IPR001090|IPR003961|IPR003961|IPR027936|IPR000719|IPR001660',
 'IPR001090|IPR003961|IPR027936|IPR000719|IPR001660',
 'IPR000494|IPR006211|IPR000494|IPR032778|IPR000719']

However, these can be a little difficult to parse since the default values returned use the InterPro IDs rather than a human legible form. To convert these to the domain names we can use the return_legible_ngram method.

[8]:
[rtk_dansy.return_legible_ngram(x) for x in rtk_dansy.ngrams]
[8]:
['Prot_kinase_dom',
 'Ephrin_rcpt_lig-bd_dom',
 'FN3_dom|Eph_TM|Prot_kinase_dom|SAM',
 'FN3_dom|FN3_dom|Eph_TM|Prot_kinase_dom|SAM',
 'Ephrin_rcpt_lig-bd_dom|Tyr-kin_ephrin_A/B_rcpt-like|FN3_dom|FN3_dom|Eph_TM|Prot_kinase_dom|SAM',
 'Ephrin_rcpt_lig-bd_dom|FN3_dom',
 'Rcpt_L-dom|Furin-like_Cys-rich_dom|Rcpt_L-dom|GF_recep_IV',
 'Ig_sub2|Ig_sub2|Ig_sub2|Prot_kinase_dom',
 'Rcpt_L-dom|Furin-like_Cys-rich_dom|Rcpt_L-dom|GF_recep_IV|TM_ErbB1|Prot_kinase_dom',
 'Ephrin_rcpt_lig-bd_dom|FN3_dom|FN3_dom|Eph_TM|Prot_kinase_dom|SAM',
 'Ephrin_rcpt_lig-bd_dom|FN3_dom|Eph_TM|Prot_kinase_dom|SAM',
 'Rcpt_L-dom|Furin-like_Cys-rich_dom|Rcpt_L-dom|GF_recep_IV|Prot_kinase_dom']

Useful DANSy attributes and methods for additional analysis

[9]:
# The G attribute is a Graph object from the networkx package so any analysis that can be performed with the a networkx Graph can be used on the dansy.G

# Centrality measurements:
deg_cent = nx.degree_centrality(rtk_dansy.G)
btwn_cent = nx.betweenness_centrality(rtk_dansy.G)

# Connected components
conn_c = nx.connected_components(rtk_dansy.G)

[10]:
# To get a summary of all the DANSy analysis
rtk_dansy.summary(detailed=True)
[10]:
name Domain n-gram Network
Proteins 20
n-grams 12
Network Isolates 0
Network Connected Components 1
Collapsed n-grams 52
Network Edges 22
Maximum Length of Protein Domain Architecture 7
[11]:
# Getting protein information for a specific protein that was used for the analysis
rtk_dansy.retrieve_protein_info(prot = 'P00533')
[11]:
UniProt ID Gene Species Uniprot Domains Ref Sequence PDB IDs Uniprot Domain Architecture Interpro Domains Interpro Domain Architecture Interpro Domain Architecture IDs
2839 P00533 EGFR Homo sapiens Protein kinase:712:979 MRPSGTAGAALLALLAALCPASRALEEKKVCQGTSNKLTQLGTFED... 1IVO;1M14;1M17;1MOX;1NQL;1XKK;1YY9;1Z9I;2EB2;2... Protein kinase Rcpt_L-dom:IPR000494:57:167;Furin-like_Cys-ric... Rcpt_L-dom|Furin-like_Cys-rich_dom|Rcpt_L-dom|... IPR000494|IPR006211|IPR000494|IPR032778|IPR049...
[12]:
# Getting protein information for all proteins that have a specific ngram (e.g. 'FN3_dom|Eph_TM|Prot_kinase_dom|SAM':'IPR003961|IPR027936|IPR000719|IPR001660')
rtk_dansy.retrieve_protein_info(ngram='IPR003961|IPR027936|IPR000719|IPR001660')
[12]:
UniProt ID Gene Species Uniprot Domains Ref Sequence PDB IDs Uniprot Domain Architecture Interpro Domains Interpro Domain Architecture Interpro Domain Architecture IDs
501 P29323 EPHB2 Homo sapiens Eph LBD:20:202;Fibronectin type-III 1:324:434;... MALRRLGAALLLLPLLAAVEETLMDSTTATAELGWMVHPPSGWEEV... 1B4F;1F0M;2QBX;3ZFM;8EBL Eph LBD|Fibronectin type-III 1|Fibronectin typ... Ephrin_rcpt_lig-bd_dom:IPR001090:20:202;Tyr-ki... Ephrin_rcpt_lig-bd_dom|Tyr-kin_ephrin_A/B_rcpt... IPR001090|IPR011641|IPR003961|IPR003961|IPR027...
834 Q15375 EPHA7 Homo sapiens Eph LBD:32:210;Fibronectin type-III 1:331:441;... MVFQTRYPSWIILCYIWLLRFAHTGEAQAAKEVLLLDSKAQQTELE... 2REI;3DKO;3H8M;3NRU;7EEC;7EED;7EEF Eph LBD|Fibronectin type-III 1|Fibronectin typ... Ephrin_rcpt_lig-bd_dom:IPR001090:32:210;Tyr-ki... Ephrin_rcpt_lig-bd_dom|Tyr-kin_ephrin_A/B_rcpt... IPR001090|IPR011641|IPR003961|IPR003961|IPR027...
2298 P54762 EPHB1 Homo sapiens Eph LBD:19:201;Fibronectin type-III 1:322:432;... MALDYLLLLLLASAVAAMEETLMDTRTATAELGWTANPASGWEEVS... 2DJS;2EAO;3ZFX;5MJA;5MJB;6UMW;7KPL;7KPM Eph LBD|Fibronectin type-III 1|Fibronectin typ... Ephrin_rcpt_lig-bd_dom:IPR001090:19:201;FN3_do... Ephrin_rcpt_lig-bd_dom|FN3_dom|Eph_TM|Prot_kin... IPR001090|IPR003961|IPR027936|IPR000719|IPR001660
3227 P54760 EPHB4 Homo sapiens Eph LBD:17:202;Fibronectin type-III 1:323:432;... MELRVLLCWASLAAALEETLLNTKLETADLKWVTFPQVDGQWEELS... 2BBA;2E7H;2HLE;2QKQ;2VWU;2VWV;2VWW;2VWX;2VWY;2... Eph LBD|Fibronectin type-III 1|Fibronectin typ... Ephrin_rcpt_lig-bd_dom:IPR001090:17:202;Tyr-ki... Ephrin_rcpt_lig-bd_dom|Tyr-kin_ephrin_A/B_rcpt... IPR001090|IPR011641|IPR003961|IPR003961|IPR027...
3569 P54756 EPHA5 Homo sapiens Eph LBD:60:238;Fibronectin type-III 1:357:467;... MRGSGPRGAGRRRPPSGGGDTPITPASLAGCYSAPRRAPLWTCLLL... 2R2P;4ET7 Eph LBD|Fibronectin type-III 1|Fibronectin typ... Ephrin_rcpt_lig-bd_dom:IPR001090:60:238;FN3_do... Ephrin_rcpt_lig-bd_dom|FN3_dom|FN3_dom|Eph_TM|... IPR001090|IPR003961|IPR003961|IPR027936|IPR000...
4409 P54764 EPHA4 Homo sapiens Eph LBD:30:209;Fibronectin type-III 1:328:439;... MAGIFYFALFSCLFGICDAVTGSRVYPANEVTLLDSRSVQGELGWI... 2LW8;2WO1;2WO2;2WO3;3CKH;3GXU;4BK4;4BK5;4BKA;4... Eph LBD|Fibronectin type-III 1|Fibronectin typ... Ephrin_rcpt_lig-bd_dom:IPR001090:30:209;Tyr-ki... Ephrin_rcpt_lig-bd_dom|Tyr-kin_ephrin_A/B_rcpt... IPR001090|IPR011641|IPR003961|IPR003961|IPR027...
6629 P29322 EPHA8 Homo sapiens Eph LBD:31:209;Fibronectin type-III 1:328:438;... MAPARGRLPPALWVVTAAAAAATCVSAARGEVNLLDTSTIHGDWGW... 1UCV;1X5L;3KUL Eph LBD|Fibronectin type-III 1|Fibronectin typ... Ephrin_rcpt_lig-bd_dom:IPR001090:31:209;FN3_do... Ephrin_rcpt_lig-bd_dom|FN3_dom|FN3_dom|Eph_TM|... IPR001090|IPR003961|IPR003961|IPR027936|IPR000...
7772 P54753 EPHB3 Homo sapiens Eph LBD:39:217;Fibronectin type-III 1:339:451;... MARARPPPPPSPPPGLLPLLPPLLLLPLLLLPAGCRALEETLMDTK... 3P1I;3ZFY;5L6O;5L6P Eph LBD|Fibronectin type-III 1|Fibronectin typ... Ephrin_rcpt_lig-bd_dom:IPR001090:39:217;Tyr-ki... Ephrin_rcpt_lig-bd_dom|Tyr-kin_ephrin_A/B_rcpt... IPR001090|IPR011641|IPR003961|IPR003961|IPR027...
12235 P29320 EPHA3 Homo sapiens Eph LBD:29:207;Fibronectin type-III 1:325:435;... MDCQLSILLLLSCSVLDSFGELIPQPSNEVNLLDSKTIQGELGWIS... 2GSF;2QO2;2QO7;2QO9;2QOB;2QOC;2QOD;2QOF;2QOI;2... Eph LBD|Fibronectin type-III 1|Fibronectin typ... Ephrin_rcpt_lig-bd_dom:IPR001090:29:207;Tyr-ki... Ephrin_rcpt_lig-bd_dom|Tyr-kin_ephrin_A/B_rcpt... IPR001090|IPR011641|IPR003961|IPR003961|IPR027...
13622 P29317 EPHA2 Homo sapiens Eph LBD:28:206;Fibronectin type-III 1:328:432;... MELQAARACFALLWGCALAAAAAAQGKEVVLLDFAAAGGELGWLTH... 1MQB;2E8N;2K9Y;2KSO;2X10;2X11;3C8X;3CZU;3FL7;3... Eph LBD|Fibronectin type-III 1|Fibronectin typ... Ephrin_rcpt_lig-bd_dom:IPR001090:28:206;FN3_do... Ephrin_rcpt_lig-bd_dom|FN3_dom|FN3_dom|Eph_TM|... IPR001090|IPR003961|IPR003961|IPR027936|IPR000...
17323 Q9UF33 EPHA6 Homo sapiens Eph LBD:34:212;Fibronectin type-III 1:331:441;... MGGCEVREFLLQFGFFLPLLTAWPGDCSHVSNNQVVLLDTTTVLGE... Eph LBD|Fibronectin type-III 1|Fibronectin typ... Ephrin_rcpt_lig-bd_dom:IPR001090:34:212;Tyr-ki... Ephrin_rcpt_lig-bd_dom|Tyr-kin_ephrin_A/B_rcpt... IPR001090|IPR011641|IPR003961|IPR003961|IPR027...
17777 P21709 EPHA1 Homo sapiens Eph LBD:27:209;Fibronectin type-III 1:332:445;... MERRWPLGLGLVLLLCAPLPPGARAKEVTLMDTSKAQGELGWLLDP... 2K1K;2K1L;3HIL;3KKA Eph LBD|Fibronectin type-III 1|Fibronectin typ... Ephrin_rcpt_lig-bd_dom:IPR001090:27:209;Tyr-ki... Ephrin_rcpt_lig-bd_dom|Tyr-kin_ephrin_A/B_rcpt... IPR001090|IPR011641|IPR003961|IPR003961|IPR027...

Generating a DANSy without the collapsing step

A key step in DANSy is collapsing n-grams that are non-redundant, or in other words n-grams that represent the same family of proteins. This allows only domain n-grams that represent the maximum information in the network to be displayed. We can omit the collapsing step during the dansy object creation, but recommend only considering this step for small protein collections (<50) because the network can become difficult to understand otherwise.

[13]:
rtk_dansy_uncollapsed = dansy.dansy(protsOI=rtks, ref=ref, collapse = False)
Starting to fetch n-grams.
Finished getting all n-grams
Starting to generate adjacency
Finished building adjacency.
[14]:
pos = nx.spring_layout(rtk_dansy_uncollapsed.G, seed=123)
network_params = {'node_size':50,
                  'edgecolor':'k',
                  'width':1,
                  'linewdiths':0.25,
                  'pos':pos}

rtk_dansy_uncollapsed.network_params = network_params

rtk_dansy_uncollapsed.draw_network()
../_images/Examples_base_dansy_19_0.png
[15]:
rtk_dansy_uncollapsed.summary(detailed=True)
[15]:
name Domain n-gram Network
Proteins 20
n-grams 64
Network Isolates 0
Network Connected Components 1
Collapsed n-grams 0
Network Edges 389
Maximum Length of Protein Domain Architecture 7