Reference Structure Files generation¶

Fetch all reference files for an Interpro_ID of interest¶

[17]:

from CoDIAC import *
import CoDIAC

Uniprot_IDs = ['O14508', 'Q06124'] #SOCS2 and PTPN11 were selected as examples for demonstrating ligand and domain-domain interfaces
Interpro_ID = 'IPR000980' # this is the domain we want to analyze within the Uniprot_IDs
#We will be creating a lot of files, this is how we would like them to be named
data_root = 'Data/'
name_root = 'SH2_'+Interpro_ID

# The files we will make in this process (so that different pieces of code can be run below as needed)
uniprot_reference_file = data_root+'Uniprot_Reference/'+name_root+'_uniprot_reference.csv' # The uniprot reference
fasta_long_header_file = data_root + 'Uniprot_Reference/' + name_root+'_long_header.fasta'
fasta_file = data_root + 'Uniprot_Reference/' + name_root+'.fasta'
#note: in addition to these 3 files, this also makes a mapping file for movng between fasta_long_header_file and fasta_file

#PDB Files we'll make in this process
PDB_file = data_root + 'PDB_Reference/' + name_root + '_PDB.csv'
PDB_file_annotated = data_root+ 'PDB_Reference/' + name_root + '_PDB_annotated.csv'
PDB_file_filtered = data_root + 'PDB_Reference/' + name_root + '_PDB_reference.csv' #The final PDB structure file, containing only filtered proteins

# PTMs feature directory location
feature_dir = data_root+'Uniprot_Reference/Features_relative_to_reference/PTM_features/'

# You can set offsets here for the family, used to reduce the bounds of the N- or C-terminal (systematically across all domains in the family)
N_OFFSET = 0
C_OFFSET = -1

# Set the number of PTMS required to be considered across the domain family for feature files (here low, since we're only considering two proteins)
PTM_THRESHOLD = 1

STEP 2: Make a human reference file of the family of interest¶

(Skipping Step 1, fetch of all proteins, since we’re using targeted Uniprot fetch)

[7]:

# These uniprot_IDs (list) came from fetching all uniprot IDs for an Interpro ID, but
# this could be a fixed list of interest, or even all unique IDs in a proteome
uniprot_df = CoDIAC.UniProt.makeRefFile(Uniprot_IDs, uniprot_reference_file)

Domain Reference File successfully created!
Adding Interpro Domains
Fetching domains..
Appending domains to file..
Interpro metadata succesfully incorporated

STEP 3: Get information about all PDB IDs that exist for the reference proteins of interest¶

[8]:

CoDIAC.PDB.generateStructureRefFile_fromUniprotFile(uniprot_reference_file, PDB_file)

Structure Reference File successfully created!
All PDBs successfully fetched

STEP 4: Annotate the structure file with reference, for domain annotation¶

[9]:

struct_df_out = CoDIAC.IntegrateStructure_Reference.add_reference_info_to_struct_file(PDB_file, uniprot_reference_file, PDB_file_annotated, INTERPRO=True, verbose=False)

STEP 5: Reduce the structure file to just those that contain the domain of interest¶

[10]:

# Now with an appended PDB File, create an output that contains only the lines that have SH2 domains
CoDIAC.IntegrateStructure_Reference.filter_structure_file(PDB_file_annotated, Interpro_ID, PDB_file_filtered)

Made Data/PDB_Reference/SH2_IPR000980_PDB_reference.csv file: 84 structures retained of 94 total

Reference FASTA file generation¶

[11]:

# Given the SH2 domain file, create the fasta reference file (using INTERPRO as default)

CoDIAC.UniProt.print_domain_fasta_file(uniprot_reference_file, Interpro_ID, fasta_long_header_file, N_OFFSET, C_OFFSET, APPEND=False)

# Shortening the fasta headers, still unique for each domain/protein pair
# dropping the redundant information about the domains printed. This creates a shorter header, useful for reading and processing
key_array_order= ['uniprot', 'gene', 'domain_num', 'start', 'end']
#translation creates a mapping file
output_fasta, mapping_file = CoDIAC.UniProt.translate_fasta_to_new_headers(fasta_long_header_file, fasta_file, key_array_order)

n offset is 0 and c offset is -1
Created files: Data/Uniprot_Reference/SH2_IPR000980.fasta and Data/Uniprot_Reference/SH2_IPR000980_mapping.csv

Feature and annotation files generation¶

STEP 7 Create the PTM feature files¶

[18]:

feature_dir_prot = feature_dir+'ProteomeScout/'
ptm_feature_file_list_pscout, ptm_count_dict, ptm_feature_dict, mapping_dict = CoDIAC.PTM.write_PTM_features(Interpro_ID, uniprot_reference_file, feature_dir_prot, mapping_file, N_OFFSET, C_OFFSET, gap_threshold=0.7, num_PTM_threshold = PTM_THRESHOLD)
print("Wrote these feature files:")
print(ptm_feature_file_list_pscout)
print("These belong to the following fasta file:")
print(output_fasta) #comes from block above - the short header format of the fasta header
print(ptm_count_dict)

n offset is 0 and c offset is -1
n offset is 0 and c offset is -1
Wrote these feature files:
['Data/Uniprot_Reference/Features_relative_to_reference/PTM_features/ProteomeScout/IPR000980_Phosphoserine.feature', 'Data/Uniprot_Reference/Features_relative_to_reference/PTM_features/ProteomeScout/IPR000980_Phosphothreonine.feature', 'Data/Uniprot_Reference/Features_relative_to_reference/PTM_features/ProteomeScout/IPR000980_Phosphotyrosine.feature', 'Data/Uniprot_Reference/Features_relative_to_reference/PTM_features/ProteomeScout/IPR000980_N6-acetyllysine.feature', 'Data/Uniprot_Reference/Features_relative_to_reference/PTM_features/ProteomeScout/IPR000980_Ubiquitination.feature']
These belong to the following fasta file:
Data/Uniprot_Reference/SH2_IPR000980.fasta
{'Phosphoserine': 4, 'Phosphothreonine': 2, 'Phosphotyrosine': 5, 'N6-acetyllysine': 2, 'Ubiquitination': 1}

[19]:

feature_dir_psite = feature_dir + 'PHOSPHOSITE_PLUS/'
ptm_feature_file_list_psite, ptm_count_dict, ptm_feature_dict, mapping_dict = CoDIAC.PTM.write_PTM_features(Interpro_ID, uniprot_reference_file, feature_dir_psite, mapping_file, N_OFFSET, C_OFFSET, gap_threshold=0.7, num_PTM_threshold = PTM_THRESHOLD, PHOSPHOSITE_PLUS=True)
print("Wrote these feature files:")
print(ptm_feature_file_list_psite)
print("These belong to the following fasta file:")
print(output_fasta) #comes from block above - the short header format of the fasta header
print(ptm_count_dict)

n offset is 0 and c offset is -1
n offset is 0 and c offset is -1
Wrote these feature files:
['Data/Uniprot_Reference/Features_relative_to_reference/PTM_features/PHOSPHOSITE_PLUS/IPR000980_Phosphoserine.feature', 'Data/Uniprot_Reference/Features_relative_to_reference/PTM_features/PHOSPHOSITE_PLUS/IPR000980_Phosphothreonine.feature', 'Data/Uniprot_Reference/Features_relative_to_reference/PTM_features/PHOSPHOSITE_PLUS/IPR000980_Phosphotyrosine.feature', 'Data/Uniprot_Reference/Features_relative_to_reference/PTM_features/PHOSPHOSITE_PLUS/IPR000980_Ubiquitination.feature', 'Data/Uniprot_Reference/Features_relative_to_reference/PTM_features/PHOSPHOSITE_PLUS/IPR000980_Acetylation.feature']
These belong to the following fasta file:
Data/Uniprot_Reference/SH2_IPR000980.fasta
{'Phosphoserine': 6, 'Phosphothreonine': 2, 'Phosphotyrosine': 5, 'Ubiquitination': 4, 'Acetylation': 2}

Step 8 combine feature files from ProteomeScout and PhosphoSitePlus and generate annotation tracks.¶

#paired list

[21]:

pairs = {}

proteomescout_base = feature_dir_prot+'/IPR000980_'
PSP_base = feature_dir_psite+'/IPR000980_'
pairs['N6-acetyllysine'] = [proteomescout_base+'N6-acetyllysine.feature', PSP_base+'Acetylation.feature']
pairs['Phosphoserine'] = [proteomescout_base+'Phosphoserine.feature', PSP_base+'Phosphoserine.feature']
pairs['Phosphothreonine'] = [proteomescout_base+'Phosphothreonine.feature', PSP_base+'Phosphothreonine.feature']
pairs['Phosphotyrosine'] = [proteomescout_base+'Phosphotyrosine.feature', PSP_base+'Phosphotyrosine.feature']
pairs['Ubiquitination'] = [proteomescout_base+'Ubiquitination.feature', PSP_base+'Ubiquitination.feature']

output_dir = feature_dir+'Combined/'
new_feature_files = {}

for mod in pairs.keys():
    feature_file = output_dir+'SH2_IPR000980_'+mod+'.feature'
    feature_combined, feature_color_dict = CoDIAC.jalviewFunctions.combine_feature_files(feature_file, pairs[mod])
    new_feature_files[mod] = feature_file

Created Data/Uniprot_Reference/Features_relative_to_reference/PTM_features/Combined/SH2_IPR000980_N6-acetyllysine.feature
Created Data/Uniprot_Reference/Features_relative_to_reference/PTM_features/Combined/SH2_IPR000980_Phosphoserine.feature
Created Data/Uniprot_Reference/Features_relative_to_reference/PTM_features/Combined/SH2_IPR000980_Phosphothreonine.feature
Created Data/Uniprot_Reference/Features_relative_to_reference/PTM_features/Combined/SH2_IPR000980_Phosphotyrosine.feature
Created Data/Uniprot_Reference/Features_relative_to_reference/PTM_features/Combined/SH2_IPR000980_Ubiquitination.feature