UniProt¶

CoDIAC.UniProt.makeRefFile(Uniprot_IDs, outputFile)[source]¶

Makes a Domain Reference file

Parameters:

Uniprot_IDslist: list of Uniprot Accession IDs generated by InterPro.py
outputFilestring: name of the output file

Returns:

Domain reference file in .csv format

Attributes:

UniProt_IDUniprot Accession ID
Genegene name
Speciesscientific name
DomainsReference domain names with boundary ranges
Ref SequenceReference sequence
PDB IDsAll PDB IDs linked to specific UniProt ID
Domain ArchitectureDomains found within the protein sequence arranged from N ter to C ter

CoDIAC.UniProt.print_domain_fasta_file(reference_csv, Interpro_ID, output_file, n_term_offset=0, c_term_offset=0, APPEND=False)[source]¶

Given a Uniprot Reference File of proteins, which contain Interpro domain annotations, create a fasta file of the domains of interest found in the reference file. The n_term and c_term offsets will build a small padding in case of domain boundary issues (default are 0, but can be set to lengths up to 20)

Parameters:

reference_csvstring: File location that contains the reference of interest (like produced from Uniprot.makeRefFile)
Interpro_ID: string: Interpro ID - for example in a reference line such as SH3_domain:IPR001452:82:143; SH2:IPR000980:147:246; Prot_kinase_dom:IPR000719:271:524 the interpro ID for the SH3_domain is IPR001452; for the SH2 domain is IPR000980
output_file: string: location of output fasta. Fasta headers will be uniprot_ID|InterproID|domain_name|domain_number|domain_start|domain_end domain number will indicate in proteins with more than one domain of the same type, the occurrence of this domain from N-to-C numbering. Domain start and stop are relative to ones-based counting of the Uniprot reference protein
n_term_offset: int: Number of amino acids to extend in the n-term direction (up to start of protein)
c_term_offset: int: Number of amino acids to extend in the c-term direction (up to end of protein)
APPEND: bool: if APPEND=true then it will open an existing file and append, else it overwrites

Returns:

domain_fasta_dict: dict: As described in make_domain_fasta_dict - headers are fasta headers and values are the domain sequence Also writes to output_file

CoDIAC.UniProt.translate_fasta_back(fasta_file, header_trans_file, output_fasta)[source]¶

Assuming that you have a fasta_file with shortened headers and would like to move those back to the long form names, found in the mapping file (header_trans_file) created by translate_fasta_to_new_headers Use this function to print at the output_fasta location the fasta file with long headers. You would do this assuming you wish to preserve a change, such as through alignment, of the shortened headers.

Parameters:

fasta_file: str: location of input fasta file to translate
header_trans_file: str: location that stores the header translation (as written in translate_fasta_to_new_headers)
output_fasta: str: location to print the output fasta, using the longer headers

Returns:

No returns, prints by non-append to output_fasta

CoDIAC.UniProt.translate_fasta_to_new_headers(fasta_file, output_fasta, key_array_order)[source]¶

UniProt.print_domain_fasta_file prints a highly informative fasta header in the production of printing fasta sequences for a given protein domain of interest. This structure is, in the following order, separated by ‘|’ uniprot_id, gene_name, domain_name, domainNum, Interpro_ID, start, end

If you wish to change the order, keeping track of this change, use this function to do so where you will use key_array_order as a list to indicate what items in what listing. possible_values = [‘uniprot’, ‘gene’, ‘domain_name’, ‘domain_num’, ‘Interpro_ID’, ‘start’, ‘end’]

For example, if you wanted to use uniprot ID first, gene, and the starting and ending position of the domain you would pass in for key_array_order [‘uniprot’, ‘gene’, ‘start’, ‘end’]

This will print the new fasta file at output_fasta and a mapping file, using the base of the output_fasta with _mapping.csv

Parameters:

fasta_file: str: location of input fasta file to translate
output_fasta: str: location to print the output fasta, using the longer headers
key_array_order: list: List of strings that includes the order and the values to keep from the possible values [‘uniprot’, ‘gene’, ‘domain_name’, ‘domain_num’, ‘Interpro_ID’, ‘start’, ‘end’]

Returns:

output_fasta: str: Confirmation of location of output file
mapping_file: str: Location of the mapping file created. This uses the same base name as the output_fasta and adds _mapping.csv

UniProt¶

Table of Contents

Previous topic

Next topic

This Page