UniProt

CoDIAC.UniProt.makeRefFile(Uniprot_IDs, outputFile)[source]

Makes a Domain Reference file

Parameters:
Uniprot_IDslist

list of Uniprot Accession IDs generated by InterPro.py

outputFilestring

name of the output file

Returns:
Domain reference file in .csv format
Attributes:
UniProt_IDUniprot Accession ID
Genegene name
Speciesscientific name
DomainsReference domain names with boundary ranges
Ref SequenceReference sequence
PDB IDsAll PDB IDs linked to specific UniProt ID
Domain ArchitectureDomains found within the protein sequence arranged from N ter to C ter
CoDIAC.UniProt.print_domain_fasta_file(reference_csv, Interpro_ID, output_file, n_term_offset=0, c_term_offset=0, APPEND=False)[source]

Given a Uniprot Reference File of proteins, which contain Interpro domain annotations, create a fasta file of the domains of interest found in the reference file. The n_term and c_term offsets will build a small padding in case of domain boundary issues (default are 0, but can be set to lengths up to 20)

Parameters:
reference_csvstring

File location that contains the reference of interest (like produced from Uniprot.makeRefFile)

Interpro_ID: string

Interpro ID - for example in a reference line such as SH3_domain:IPR001452:82:143; SH2:IPR000980:147:246; Prot_kinase_dom:IPR000719:271:524 the interpro ID for the SH3_domain is IPR001452; for the SH2 domain is IPR000980

output_file: string

location of output fasta. Fasta headers will be uniprot_ID|InterproID|domain_name|domain_number|domain_start|domain_end domain number will indicate in proteins with more than one domain of the same type, the occurrence of this domain from N-to-C numbering. Domain start and stop are relative to ones-based counting of the Uniprot reference protein

n_term_offset: int

Number of amino acids to extend in the n-term direction (up to start of protein)

c_term_offset: int

Number of amino acids to extend in the c-term direction (up to end of protein)

APPEND: bool

if APPEND=true then it will open an existing file and append, else it overwrites

Returns:
domain_fasta_dict: dict

As described in make_domain_fasta_dict - headers are fasta headers and values are the domain sequence Also writes to output_file

CoDIAC.UniProt.translate_fasta_back(fasta_file, header_trans_file, output_fasta)[source]

Assuming that you have a fasta_file with shortened headers and would like to move those back to the long form names, found in the mapping file (header_trans_file) created by translate_fasta_to_new_headers Use this function to print at the output_fasta location the fasta file with long headers. You would do this assuming you wish to preserve a change, such as through alignment, of the shortened headers.

Parameters:
fasta_file: str

location of input fasta file to translate

header_trans_file: str

location that stores the header translation (as written in translate_fasta_to_new_headers)

output_fasta: str

location to print the output fasta, using the longer headers

Returns:
No returns, prints by non-append to output_fasta
CoDIAC.UniProt.translate_fasta_to_new_headers(fasta_file, output_fasta, key_array_order)[source]

UniProt.print_domain_fasta_file prints a highly informative fasta header in the production of printing fasta sequences for a given protein domain of interest. This structure is, in the following order, separated by ‘|’ uniprot_id, gene_name, domain_name, domainNum, Interpro_ID, start, end

If you wish to change the order, keeping track of this change, use this function to do so where you will use key_array_order as a list to indicate what items in what listing. possible_values = [‘uniprot’, ‘gene’, ‘domain_name’, ‘domain_num’, ‘Interpro_ID’, ‘start’, ‘end’]

For example, if you wanted to use uniprot ID first, gene, and the starting and ending position of the domain you would pass in for key_array_order [‘uniprot’, ‘gene’, ‘start’, ‘end’]

This will print the new fasta file at output_fasta and a mapping file, using the base of the output_fasta with _mapping.csv

Parameters:
fasta_file: str

location of input fasta file to translate

output_fasta: str

location to print the output fasta, using the longer headers

key_array_order: list

List of strings that includes the order and the values to keep from the possible values [‘uniprot’, ‘gene’, ‘domain_name’, ‘domain_num’, ‘Interpro_ID’, ‘start’, ‘end’]

Returns:
output_fasta: str

Confirmation of location of output file

mapping_file: str

Location of the mapping file created. This uses the same base name as the output_fasta and adds _mapping.csv