API Reference#

The “Config” Module#

proteomeScoutAPI.config.update_configuration(dataset_dir=None, version=None, update=None)#

Update configuration file for ProteomeScoutAPI.

Parameters:
dataset_dirstr, optional

Path to the dataset directory.

versionint, optional

Version number of the dataset.

updatebool, optional

Whether to force an update of the dataset if available or if different version is requested.

Interfacing with the API#

The “ProteomeScoutAPI” Class#

class proteomeScoutAPI.api.ProteomeScoutAPI(version=None, update=True)#

Bases: object

Class for interacting with ProteomeScout flat files.

Parameters:
versionint, optional

Version number of the dataset to use. Defaults to the version specified in the configuration. If None, uses the latest version.

updatebool, optional

Whether to check for updates to the dataset upon initialization. Defaults to the value specified in the configuration (True unless changed).

Attributes:
databasedict

Internal database storing ProteomeScout data.

uniqueKeyslist

List of unique accession numbers in the dataset.

versionint

The version number of the dataset being used.

Methods

check_for_updates([update])

Check if a newer version of the ProteomeScout dataset is available on FigShare.

download_data([version])

Retrieves proteomescout data files that are the companion for this version release from FigShare.

get_GO(ID)

Return all GO terms associated with the ID in question

get_PTMs(ID[, output_format])

Return all PTMs associated with the ID in question.

get_PTMs_withEvidence(ID)

Return PTMs with their associated evidence information

get_Scansite(ID)

DEPRECATED: Scansite predictions have been removed from the new data format.

get_Scansite_byPos(ID, res_pos)

DEPRECATED: Scansite predictions have been removed from the new data format.

get_accessions(ID)

Return all accession numbers associated with the ID

get_all_protein_info(ID)

Return all available information for the ID as a dictionary

get_annotated_PTMs(ID)

Given a UniProt ID, return a table of PTMs with annotations about whether they fall within domains, structures, or macro-molecular structures.

get_domains(ID[, domain_type, output_format])

Return all domains associated with the ID in question.

get_evidence(ID)

Return evidence scores associated with PTMs for the ID

get_gene_name(ID)

Return the gene_name of an ID

get_macro_molecular(ID[, output_format])

Return all macro-molecular structures associated with the ID in question.

get_nearbyPTMs(ID, pos, window[, output_format])

Return all PTMs within a specified window of a given position

get_phosphosites(ID[, output_format])

Return all phosphosites (S/T/Y phosphorylation) associated with the ID

get_sequence(ID)

Return the sequence associated with the ID

get_species(ID)

Return the species associated with the ID

get_structure(ID[, output_format])

Return all structures associated with the ID in question.

return_species_nr_uniprot_ids()

Return a dictionary with species as keys and number of unique uniprot IDs as values.

update_to_latest()

Download and update to the latest version of the ProteomeScout dataset from FigShare.

get_region

check_for_updates(update=False)#

Check if a newer version of the ProteomeScout dataset is available on FigShare.

Parameters:
updatebool, optional

Whether to update the dataset if a newer version is available. Defaults to False.

download_data(version=None)#

Retrieves proteomescout data files that are the companion for this version release from FigShare. Will download and decompress the files to self.dataset_dir in a “ProteomeScout_Dataset” folder

Parameters:
versionint, optional

Version number of the dataset to download from FigShare. If None, uses the latest version.

get_GO(ID)#

Return all GO terms associated with the ID in question

Parameters:
IDstr

SwissProt accession number

Returns:
list of tuples or int

tuple for each GO term associated with the ID (GO_term, type). type is ‘F’ (Molecular Function), ‘P’ (Biological Process), or ‘C’ (Cellular Component). Returns an empty list if no GO terms are found.

get_PTMs(ID, output_format='list')#

Return all PTMs associated with the ID in question.

Parameters:
IDstr

SwissProt accession number

output_formatstr, optional

Format of the output (‘list’ or ‘table’). Defaults to ‘list’.

Returns:
list of tuples, pd.DataFrame, or int

tuple for each PTM associated with the ID(position, residue, modification-type). If output_format is ‘table’, returns a pandas DataFrame with columns [‘Position’, ‘Residue’, ‘Modification_Type’]. Returns an empty list if no modifications are found.

get_PTMs_withEvidence(ID)#

Return PTMs with their associated evidence information

Parameters:
IDstr

SwissProt accession number

Returns:
tuple or int

Tuple containing (mods, evidence) where mods is a list of PTMs and evidence is the associated evidence information.

get_Scansite(ID)#

DEPRECATED: Scansite predictions have been removed from the new data format. This method is kept for backward compatibility but will return an empty list.

Returns:
empty list
get_Scansite_byPos(ID, res_pos)#

DEPRECATED: Scansite predictions have been removed from the new data format. This method is kept for backward compatibility but will return an empty dictionary.

Returns:
empty dictionary
get_accessions(ID)#

Return all accession numbers associated with the ID

Parameters:
IDstr

SwissProt accession number

Returns:
list of str or int

List of accession IDs associated with the Swissprot accession.

get_all_protein_info(ID)#

Return all available information for the ID as a dictionary

Parameters:
IDstr

SwissProt accession number

Returns:
dict or int

Dictionary containing all available information for the ID.

get_annotated_PTMs(ID)#

Given a UniProt ID, return a table of PTMs with annotations about whether they fall within domains, structures, or macro-molecular structures.

Parameters:
IDstr

SwissProt accession number

Returns:
pd.DataFrame or int

Dataframe with all PTMs associated with the ID, along with columns indicating domain names (InterPro and UniProt), structures, and macro-molecular structures that contain those PTMs

get_domains(ID, domain_type=None, output_format='list')#

Return all domains associated with the ID in question. For interpro domains domain_type is ‘interpro’ For UniProt domains domain_type is ‘uniprot’ If domain_type is not specified, returns a dictionary with both.

Parameters:
IDstr

SwissProt accession number

domain_typestr, optional

Type of domain to retrieve (‘interpro’ or ‘uniprot’). If None, returns both types.

output_formatstr, optional

Format of the output (‘list’ or ‘table’). Defaults to ‘list’.

Returns:
list of tuples, dict, int, or pandas.DataFrame

tuples for each domain associated with the ID (domain_name, start_position, end_position, domain_id). Interpro domains include the interpro_id as the last element in the tuple; uniprot domains have None as the last element. If output_format is ‘table’, returns a pandas DataFrame with columns [‘Domain_Name’, ‘Start_Position’, ‘End_Position’, ‘Domain_ID’] instead of list.

Returns an empty list if no domains are found

Returns a dictionary with ‘interpro’ and ‘uniprot’ keys if domain_type is not specified.

get_evidence(ID)#

Return evidence scores associated with PTMs for the ID

Parameters:
IDstr

SwissProt accession number

Returns:
str

Evidence scores associated with the ID

get_gene_name(ID)#

Return the gene_name of an ID

Parameters:
IDstr

SwissProt accession number

Returns:
tuple

Gene name associated with the ID

get_macro_molecular(ID, output_format='list')#

Return all macro-molecular structures associated with the ID in question.

Returns:
list of tuples, pd.DataFrame, or int

tuple for each macro-molecular structure associated with the ID (structure_name, start_position, end_position). If output_format is ‘table’, returns a pandas DataFrame with columns [‘Macro_Name’, ‘Start_Position’, ‘End_Position’]. Returns returns an empty list if no macro-molecular structures are found.

get_nearbyPTMs(ID, pos, window, output_format='list')#

Return all PTMs within a specified window of a given position

Parameters:
IDstr

SwissProt accession number

posint

Position in protein to search around

windowint

Number of residues upstream and downstream to include in the search

output_formatstr, optional

Format of the output (‘list’ or ‘table’). Defaults to ‘list’.

Returns:
list of tuples or int

tuple for each PTM within the specified window (position, residue, modification-type). If output_format is ‘table’, returns a pandas DataFrame with columns [‘Position’, ‘Residue’, ‘Modification_Type’]. Returns an empty list if no modifications are found within the window.

get_phosphosites(ID, output_format='list')#

Return all phosphosites (S/T/Y phosphorylation) associated with the ID

Parameters:
IDstr

SwissProt accession number

Returns:
list of tuples or int

tuple for each phosphosite associated with the ID (position, residue, modification-type). Returns an empty list if no phosphosites are found.

get_region(position, regions)#
get_sequence(ID)#

Return the sequence associated with the ID

Parameters:
IDstr

SwissProt accession number

Returns:
str

Sequence associated with the ID

get_species(ID)#

Return the species associated with the ID

Parameters:
IDstr

SwissProt accession number

Returns:
str

Species associated with the ID

get_structure(ID, output_format='list')#

Return all structures associated with the ID in question.

Parameters:
IDstr

SwissProt accession number

output_formatstr, optional

Format of the output (‘list’ or ‘table’). Defaults to ‘list’.

Returns:
list of tuples, pd.DataFrame, or int

tuple for each structures associated with the ID (domain_name, start_position, end_position). If output_format is ‘table’, returns a pandas DataFrame with columns [‘Structure_Name’, ‘Start_Position’, ‘End_Position’]. Returns returns an empty list or dataframe if no structures are found.

return_species_nr_uniprot_ids()#

Return a dictionary with species as keys and number of unique uniprot IDs as values. Keep only those that are flagged as uniprot IDs in in the non-redundant list.

Returns:
species_dictdict

Dictionary with species names as keys and lists of uniprot IDs as values. {species_name: [list of uniprot IDs]}

species_reference_booldict

Dictionary indicating whether each species is a reference species and contains all protein IDs (True/False). If False, it means that only nonredundant uniprot records with PTMs are included in the list.

update_to_latest()#

Download and update to the latest version of the ProteomeScout dataset from FigShare.

The “ProteomicDataset” Class#

class proteomeScoutAPI.api.ProteomicDataset(dataset, accession_col='acc', peptide_col='pep', find_site=True, domain_source='interpro', GO_terms=True)#

Bases: ProteomeScoutAPI

Class for annotating phosphoproteomic datasets with gene-level and site-specific information. Inherits methods from ProteomeScoutAPI.

Parameters:
dataset: pd.DataFrame

phosphoproteomic dataset to annotate

accession_col: str

column name containing SwissProt accessions

peptide_col: str

column name containing formatted peptides, with modification sites lowercased

find_site: bool

whether to find modification sites within protein (based on which residues are lowercased in peptide)

domain_source: str

source of domain annotations (‘interpro’ or ‘uniprot’)

GO_terms: bool

whether to include GO term annotations

Methods

annotate_dataset()

Given a proteomic dataset in self.dataset, annotate each row with information from ProteomeScout.

annotate_peptide(accession, peptide)

Given a SwissProt accession and peptide sequence, annotate with gene-level and site-specific information from ProteomeScout.

check_for_updates([update])

Check if a newer version of the ProteomeScout dataset is available on FigShare.

check_phosphosites(accessions, positions)

Check if positions are documented phosphosites in ProteomeScout for the given accessions

download_data([version])

Retrieves proteomescout data files that are the companion for this version release from FigShare.

get_GO(ID)

Return all GO terms associated with the ID in question

get_PTMs(ID[, output_format])

Return all PTMs associated with the ID in question.

get_PTMs_withEvidence(ID)

Return PTMs with their associated evidence information

get_Scansite(ID)

DEPRECATED: Scansite predictions have been removed from the new data format.

get_Scansite_byPos(ID, res_pos)

DEPRECATED: Scansite predictions have been removed from the new data format.

get_accessions(ID)

Return all accession numbers associated with the ID

get_all_protein_info(ID)

Return all available information for the ID as a dictionary

get_annotated_PTMs(ID)

Given a UniProt ID, return a table of PTMs with annotations about whether they fall within domains, structures, or macro-molecular structures.

get_domains(ID[, domain_type, output_format])

Return all domains associated with the ID in question.

get_domains_with_site(domains, positions)

Check if positions are within any domains

get_evidence(ID)

Return evidence scores associated with PTMs for the ID

get_gene_name(ID)

Return the gene_name of an ID

get_macro_molecular(ID[, output_format])

Return all macro-molecular structures associated with the ID in question.

get_macro_with_site(macro_mol, positions)

Check if positions are within any macro-molecular structures

get_nearbyPTMs(ID, pos, window[, output_format])

Return all PTMs within a specified window of a given position

get_phosphosites(ID[, output_format])

Return all phosphosites (S/T/Y phosphorylation) associated with the ID

get_sequence(ID)

Return the sequence associated with the ID

get_species(ID)

Return the species associated with the ID

get_structure(ID[, output_format])

Return all structures associated with the ID in question.

return_species_nr_uniprot_ids()

Return a dictionary with species as keys and number of unique uniprot IDs as values.

update_to_latest()

Download and update to the latest version of the ProteomeScout dataset from FigShare.

get_region

annotate_dataset()#

Given a proteomic dataset in self.dataset, annotate each row with information from ProteomeScout.

annotate_peptide(accession, peptide)#

Given a SwissProt accession and peptide sequence, annotate with gene-level and site-specific information from ProteomeScout.

Parameters:
accessionstr

SwissProt accession number

peptidestr

Peptide sequence with modification sites lowercased

Returns:
dict or int

Dictionary containing annotation information: - ‘gene_name’: Gene name associated with the protein accession. - ‘domains’: Semicolon-separated string of domain names associated with the protein. - ‘domain_architecture’: String representation of the domain architecture (order of domains) - ‘GO_terms’: Semicolon-separated string of GO terms associated with the protein.

If find_site is True, additional keys are included: - ‘modification_sites’: Semicolon-separated string of modification sites found in the peptide. - ‘aligned_peps’: Aligned peptide sequences found in the protein sequence (if find_site is True). - ‘documented_phosphosites’: Semicolon-separated string indicating whether each modification site is documented (1) or not (0). - ‘site_in_domain’: Semicolon-separated string of domain names that contain the modification sites - ‘site_in_macro’: Semicolon-separated string of macro-molecular structure names that contain the modification sites

Returns -1 if unable to find the accession in the database.

check_phosphosites(accessions, positions)#

Check if positions are documented phosphosites in ProteomeScout for the given accessions

Parameters:
accessionsstr

SwissProt accession number

positionslist of int

List of positions to check

Returns:
str

Semicolon-separated string of 1s and 0s indicating whether each position is a documented phosphosite (1) or not (0)

get_domains_with_site(domains, positions)#

Check if positions are within any domains

Parameters:
domainslist of tuples

List of domains (domain_name, start_position, end_position, domain_id)

positionslist of int

List of positions to check

Returns:
str

Semicolon-separated string of domain names that contain the positions

get_macro_with_site(macro_mol, positions)#

Check if positions are within any macro-molecular structures

Parameters:
macro_mollist of tuples

List of macro-molecular structures (macro_name, start_position, end_position)

positionslist of int

List of positions to check

Returns:
str

Semicolon-separated string of macro-molecular structure names that contain the positions