API Reference#
The “Config” Module#
- proteomeScoutAPI.config.update_configuration(dataset_dir=None, version=None, update=None)#
Update configuration file for ProteomeScoutAPI.
- Parameters:
- dataset_dirstr, optional
Path to the dataset directory.
- versionint, optional
Version number of the dataset.
- updatebool, optional
Whether to force an update of the dataset if available or if different version is requested.
Interfacing with the API#
The “ProteomeScoutAPI” Class#
- class proteomeScoutAPI.api.ProteomeScoutAPI(version=None, update=True)#
Bases:
objectClass for interacting with ProteomeScout flat files.
- Parameters:
- versionint, optional
Version number of the dataset to use. Defaults to the version specified in the configuration. If None, uses the latest version.
- updatebool, optional
Whether to check for updates to the dataset upon initialization. Defaults to the value specified in the configuration (True unless changed).
- Attributes:
- databasedict
Internal database storing ProteomeScout data.
- uniqueKeyslist
List of unique accession numbers in the dataset.
- versionint
The version number of the dataset being used.
Methods
check_for_updates([update])Check if a newer version of the ProteomeScout dataset is available on FigShare.
download_data([version])Retrieves proteomescout data files that are the companion for this version release from FigShare.
get_GO(ID)Return all GO terms associated with the ID in question
get_PTMs(ID[, output_format])Return all PTMs associated with the ID in question.
Return PTMs with their associated evidence information
get_Scansite(ID)DEPRECATED: Scansite predictions have been removed from the new data format.
get_Scansite_byPos(ID, res_pos)DEPRECATED: Scansite predictions have been removed from the new data format.
get_accessions(ID)Return all accession numbers associated with the ID
Return all available information for the ID as a dictionary
Given a UniProt ID, return a table of PTMs with annotations about whether they fall within domains, structures, or macro-molecular structures.
get_domains(ID[, domain_type, output_format])Return all domains associated with the ID in question.
get_evidence(ID)Return evidence scores associated with PTMs for the ID
get_gene_name(ID)Return the gene_name of an ID
get_macro_molecular(ID[, output_format])Return all macro-molecular structures associated with the ID in question.
get_nearbyPTMs(ID, pos, window[, output_format])Return all PTMs within a specified window of a given position
get_phosphosites(ID[, output_format])Return all phosphosites (S/T/Y phosphorylation) associated with the ID
get_sequence(ID)Return the sequence associated with the ID
get_species(ID)Return the species associated with the ID
get_structure(ID[, output_format])Return all structures associated with the ID in question.
Return a dictionary with species as keys and number of unique uniprot IDs as values.
Download and update to the latest version of the ProteomeScout dataset from FigShare.
get_region
- check_for_updates(update=False)#
Check if a newer version of the ProteomeScout dataset is available on FigShare.
- Parameters:
- updatebool, optional
Whether to update the dataset if a newer version is available. Defaults to False.
- download_data(version=None)#
Retrieves proteomescout data files that are the companion for this version release from FigShare. Will download and decompress the files to self.dataset_dir in a “ProteomeScout_Dataset” folder
- Parameters:
- versionint, optional
Version number of the dataset to download from FigShare. If None, uses the latest version.
- get_GO(ID)#
Return all GO terms associated with the ID in question
- Parameters:
- IDstr
SwissProt accession number
- Returns:
- list of tuples or int
tuple for each GO term associated with the ID (GO_term, type). type is ‘F’ (Molecular Function), ‘P’ (Biological Process), or ‘C’ (Cellular Component). Returns an empty list if no GO terms are found.
- get_PTMs(ID, output_format='list')#
Return all PTMs associated with the ID in question.
- Parameters:
- IDstr
SwissProt accession number
- output_formatstr, optional
Format of the output (‘list’ or ‘table’). Defaults to ‘list’.
- Returns:
- list of tuples, pd.DataFrame, or int
tuple for each PTM associated with the ID(position, residue, modification-type). If output_format is ‘table’, returns a pandas DataFrame with columns [‘Position’, ‘Residue’, ‘Modification_Type’]. Returns an empty list if no modifications are found.
- get_PTMs_withEvidence(ID)#
Return PTMs with their associated evidence information
- Parameters:
- IDstr
SwissProt accession number
- Returns:
- tuple or int
Tuple containing (mods, evidence) where mods is a list of PTMs and evidence is the associated evidence information.
- get_Scansite(ID)#
DEPRECATED: Scansite predictions have been removed from the new data format. This method is kept for backward compatibility but will return an empty list.
- Returns:
- empty list
- get_Scansite_byPos(ID, res_pos)#
DEPRECATED: Scansite predictions have been removed from the new data format. This method is kept for backward compatibility but will return an empty dictionary.
- Returns:
- empty dictionary
- get_accessions(ID)#
Return all accession numbers associated with the ID
- Parameters:
- IDstr
SwissProt accession number
- Returns:
- list of str or int
List of accession IDs associated with the Swissprot accession.
- get_all_protein_info(ID)#
Return all available information for the ID as a dictionary
- Parameters:
- IDstr
SwissProt accession number
- Returns:
- dict or int
Dictionary containing all available information for the ID.
- get_annotated_PTMs(ID)#
Given a UniProt ID, return a table of PTMs with annotations about whether they fall within domains, structures, or macro-molecular structures.
- Parameters:
- IDstr
SwissProt accession number
- Returns:
- pd.DataFrame or int
Dataframe with all PTMs associated with the ID, along with columns indicating domain names (InterPro and UniProt), structures, and macro-molecular structures that contain those PTMs
- get_domains(ID, domain_type=None, output_format='list')#
Return all domains associated with the ID in question. For interpro domains domain_type is ‘interpro’ For UniProt domains domain_type is ‘uniprot’ If domain_type is not specified, returns a dictionary with both.
- Parameters:
- IDstr
SwissProt accession number
- domain_typestr, optional
Type of domain to retrieve (‘interpro’ or ‘uniprot’). If None, returns both types.
- output_formatstr, optional
Format of the output (‘list’ or ‘table’). Defaults to ‘list’.
- Returns:
- list of tuples, dict, int, or pandas.DataFrame
tuples for each domain associated with the ID (domain_name, start_position, end_position, domain_id). Interpro domains include the interpro_id as the last element in the tuple; uniprot domains have None as the last element. If output_format is ‘table’, returns a pandas DataFrame with columns [‘Domain_Name’, ‘Start_Position’, ‘End_Position’, ‘Domain_ID’] instead of list.
Returns an empty list if no domains are found
Returns a dictionary with ‘interpro’ and ‘uniprot’ keys if domain_type is not specified.
- get_evidence(ID)#
Return evidence scores associated with PTMs for the ID
- Parameters:
- IDstr
SwissProt accession number
- Returns:
- str
Evidence scores associated with the ID
- get_gene_name(ID)#
Return the gene_name of an ID
- Parameters:
- IDstr
SwissProt accession number
- Returns:
- tuple
Gene name associated with the ID
- get_macro_molecular(ID, output_format='list')#
Return all macro-molecular structures associated with the ID in question.
- Returns:
- list of tuples, pd.DataFrame, or int
tuple for each macro-molecular structure associated with the ID (structure_name, start_position, end_position). If output_format is ‘table’, returns a pandas DataFrame with columns [‘Macro_Name’, ‘Start_Position’, ‘End_Position’]. Returns returns an empty list if no macro-molecular structures are found.
- get_nearbyPTMs(ID, pos, window, output_format='list')#
Return all PTMs within a specified window of a given position
- Parameters:
- IDstr
SwissProt accession number
- posint
Position in protein to search around
- windowint
Number of residues upstream and downstream to include in the search
- output_formatstr, optional
Format of the output (‘list’ or ‘table’). Defaults to ‘list’.
- Returns:
- list of tuples or int
tuple for each PTM within the specified window (position, residue, modification-type). If output_format is ‘table’, returns a pandas DataFrame with columns [‘Position’, ‘Residue’, ‘Modification_Type’]. Returns an empty list if no modifications are found within the window.
- get_phosphosites(ID, output_format='list')#
Return all phosphosites (S/T/Y phosphorylation) associated with the ID
- Parameters:
- IDstr
SwissProt accession number
- Returns:
- list of tuples or int
tuple for each phosphosite associated with the ID (position, residue, modification-type). Returns an empty list if no phosphosites are found.
- get_region(position, regions)#
- get_sequence(ID)#
Return the sequence associated with the ID
- Parameters:
- IDstr
SwissProt accession number
- Returns:
- str
Sequence associated with the ID
- get_species(ID)#
Return the species associated with the ID
- Parameters:
- IDstr
SwissProt accession number
- Returns:
- str
Species associated with the ID
- get_structure(ID, output_format='list')#
Return all structures associated with the ID in question.
- Parameters:
- IDstr
SwissProt accession number
- output_formatstr, optional
Format of the output (‘list’ or ‘table’). Defaults to ‘list’.
- Returns:
- list of tuples, pd.DataFrame, or int
tuple for each structures associated with the ID (domain_name, start_position, end_position). If output_format is ‘table’, returns a pandas DataFrame with columns [‘Structure_Name’, ‘Start_Position’, ‘End_Position’]. Returns returns an empty list or dataframe if no structures are found.
- return_species_nr_uniprot_ids()#
Return a dictionary with species as keys and number of unique uniprot IDs as values. Keep only those that are flagged as uniprot IDs in in the non-redundant list.
- Returns:
- species_dictdict
Dictionary with species names as keys and lists of uniprot IDs as values. {species_name: [list of uniprot IDs]}
- species_reference_booldict
Dictionary indicating whether each species is a reference species and contains all protein IDs (True/False). If False, it means that only nonredundant uniprot records with PTMs are included in the list.
- update_to_latest()#
Download and update to the latest version of the ProteomeScout dataset from FigShare.
The “ProteomicDataset” Class#
- class proteomeScoutAPI.api.ProteomicDataset(dataset, accession_col='acc', peptide_col='pep', find_site=True, domain_source='interpro', GO_terms=True)#
Bases:
ProteomeScoutAPIClass for annotating phosphoproteomic datasets with gene-level and site-specific information. Inherits methods from ProteomeScoutAPI.
- Parameters:
- dataset: pd.DataFrame
phosphoproteomic dataset to annotate
- accession_col: str
column name containing SwissProt accessions
- peptide_col: str
column name containing formatted peptides, with modification sites lowercased
- find_site: bool
whether to find modification sites within protein (based on which residues are lowercased in peptide)
- domain_source: str
source of domain annotations (‘interpro’ or ‘uniprot’)
- GO_terms: bool
whether to include GO term annotations
Methods
Given a proteomic dataset in self.dataset, annotate each row with information from ProteomeScout.
annotate_peptide(accession, peptide)Given a SwissProt accession and peptide sequence, annotate with gene-level and site-specific information from ProteomeScout.
check_for_updates([update])Check if a newer version of the ProteomeScout dataset is available on FigShare.
check_phosphosites(accessions, positions)Check if positions are documented phosphosites in ProteomeScout for the given accessions
download_data([version])Retrieves proteomescout data files that are the companion for this version release from FigShare.
get_GO(ID)Return all GO terms associated with the ID in question
get_PTMs(ID[, output_format])Return all PTMs associated with the ID in question.
get_PTMs_withEvidence(ID)Return PTMs with their associated evidence information
get_Scansite(ID)DEPRECATED: Scansite predictions have been removed from the new data format.
get_Scansite_byPos(ID, res_pos)DEPRECATED: Scansite predictions have been removed from the new data format.
get_accessions(ID)Return all accession numbers associated with the ID
get_all_protein_info(ID)Return all available information for the ID as a dictionary
get_annotated_PTMs(ID)Given a UniProt ID, return a table of PTMs with annotations about whether they fall within domains, structures, or macro-molecular structures.
get_domains(ID[, domain_type, output_format])Return all domains associated with the ID in question.
get_domains_with_site(domains, positions)Check if positions are within any domains
get_evidence(ID)Return evidence scores associated with PTMs for the ID
get_gene_name(ID)Return the gene_name of an ID
get_macro_molecular(ID[, output_format])Return all macro-molecular structures associated with the ID in question.
get_macro_with_site(macro_mol, positions)Check if positions are within any macro-molecular structures
get_nearbyPTMs(ID, pos, window[, output_format])Return all PTMs within a specified window of a given position
get_phosphosites(ID[, output_format])Return all phosphosites (S/T/Y phosphorylation) associated with the ID
get_sequence(ID)Return the sequence associated with the ID
get_species(ID)Return the species associated with the ID
get_structure(ID[, output_format])Return all structures associated with the ID in question.
return_species_nr_uniprot_ids()Return a dictionary with species as keys and number of unique uniprot IDs as values.
update_to_latest()Download and update to the latest version of the ProteomeScout dataset from FigShare.
get_region
- annotate_dataset()#
Given a proteomic dataset in self.dataset, annotate each row with information from ProteomeScout.
- annotate_peptide(accession, peptide)#
Given a SwissProt accession and peptide sequence, annotate with gene-level and site-specific information from ProteomeScout.
- Parameters:
- accessionstr
SwissProt accession number
- peptidestr
Peptide sequence with modification sites lowercased
- Returns:
- dict or int
Dictionary containing annotation information: - ‘gene_name’: Gene name associated with the protein accession. - ‘domains’: Semicolon-separated string of domain names associated with the protein. - ‘domain_architecture’: String representation of the domain architecture (order of domains) - ‘GO_terms’: Semicolon-separated string of GO terms associated with the protein.
If find_site is True, additional keys are included: - ‘modification_sites’: Semicolon-separated string of modification sites found in the peptide. - ‘aligned_peps’: Aligned peptide sequences found in the protein sequence (if find_site is True). - ‘documented_phosphosites’: Semicolon-separated string indicating whether each modification site is documented (1) or not (0). - ‘site_in_domain’: Semicolon-separated string of domain names that contain the modification sites - ‘site_in_macro’: Semicolon-separated string of macro-molecular structure names that contain the modification sites
Returns -1 if unable to find the accession in the database.
- check_phosphosites(accessions, positions)#
Check if positions are documented phosphosites in ProteomeScout for the given accessions
- Parameters:
- accessionsstr
SwissProt accession number
- positionslist of int
List of positions to check
- Returns:
- str
Semicolon-separated string of 1s and 0s indicating whether each position is a documented phosphosite (1) or not (0)
- get_domains_with_site(domains, positions)#
Check if positions are within any domains
- Parameters:
- domainslist of tuples
List of domains (domain_name, start_position, end_position, domain_id)
- positionslist of int
List of positions to check
- Returns:
- str
Semicolon-separated string of domain names that contain the positions
- get_macro_with_site(macro_mol, positions)#
Check if positions are within any macro-molecular structures
- Parameters:
- macro_mollist of tuples
List of macro-molecular structures (macro_name, start_position, end_position)
- positionslist of int
List of positions to check
- Returns:
- str
Semicolon-separated string of macro-molecular structure names that contain the positions