UniProt¶
- CoDIAC.UniProt.makeRefFile(Uniprot_IDs, outputFile)[source]¶
Makes a Domain Reference file
- Parameters:
- Uniprot_IDslist
list of Uniprot Accession IDs generated by InterPro.py
- outputFilestring
name of the output file
- Returns:
- Domain reference file in .csv format
- Attributes:
- UniProt_IDUniprot Accession ID
- Genegene name
- Speciesscientific name
- DomainsReference domain names with boundary ranges
- Ref SequenceReference sequence
- PDB IDsAll PDB IDs linked to specific UniProt ID
- Domain ArchitectureDomains found within the protein sequence arranged from N ter to C ter
- CoDIAC.UniProt.print_domain_fasta_file(reference_csv, Interpro_ID, output_file, n_term_offset=0, c_term_offset=0, APPEND=False)[source]¶
Given a Uniprot Reference File of proteins, which contain Interpro domain annotations, create a fasta file of the domains of interest found in the reference file. The n_term and c_term offsets will build a small padding in case of domain boundary issues (default are 0, but can be set to lengths up to 20)
- Parameters:
- reference_csvstring
File location that contains the reference of interest (like produced from Uniprot.makeRefFile)
- Interpro_ID: string
Interpro ID - for example in a reference line such as SH3_domain:IPR001452:82:143; SH2:IPR000980:147:246; Prot_kinase_dom:IPR000719:271:524 the interpro ID for the SH3_domain is IPR001452; for the SH2 domain is IPR000980
- output_file: string
location of output fasta. Fasta headers will be uniprot_ID|InterproID|domain_name|domain_number|domain_start|domain_end domain number will indicate in proteins with more than one domain of the same type, the occurrence of this domain from N-to-C numbering. Domain start and stop are relative to ones-based counting of the Uniprot reference protein
- n_term_offset: int
Number of amino acids to extend in the n-term direction (up to start of protein)
- c_term_offset: int
Number of amino acids to extend in the c-term direction (up to end of protein)
- APPEND: bool
if APPEND=true then it will open an existing file and append, else it overwrites
- Returns:
- domain_fasta_dict: dict
As described in make_domain_fasta_dict - headers are fasta headers and values are the domain sequence Also writes to output_file
- CoDIAC.UniProt.translate_fasta_back(fasta_file, header_trans_file, output_fasta)[source]¶
Assuming that you have a fasta_file with shortened headers and would like to move those back to the long form names, found in the mapping file (header_trans_file) created by translate_fasta_to_new_headers Use this function to print at the output_fasta location the fasta file with long headers. You would do this assuming you wish to preserve a change, such as through alignment, of the shortened headers.
- Parameters:
- fasta_file: str
location of input fasta file to translate
- header_trans_file: str
location that stores the header translation (as written in translate_fasta_to_new_headers)
- output_fasta: str
location to print the output fasta, using the longer headers
- Returns:
- No returns, prints by non-append to output_fasta
- CoDIAC.UniProt.translate_fasta_to_new_headers(fasta_file, output_fasta, key_array_order)[source]¶
UniProt.print_domain_fasta_file prints a highly informative fasta header in the production of printing fasta sequences for a given protein domain of interest. This structure is, in the following order, separated by ‘|’ uniprot_id, gene_name, domain_name, domainNum, Interpro_ID, start, end
If you wish to change the order, keeping track of this change, use this function to do so where you will use key_array_order as a list to indicate what items in what listing. possible_values = [‘uniprot’, ‘gene’, ‘domain_name’, ‘domain_num’, ‘Interpro_ID’, ‘start’, ‘end’]
For example, if you wanted to use uniprot ID first, gene, and the starting and ending position of the domain you would pass in for key_array_order [‘uniprot’, ‘gene’, ‘start’, ‘end’]
This will print the new fasta file at output_fasta and a mapping file, using the base of the output_fasta with _mapping.csv
- Parameters:
- fasta_file: str
location of input fasta file to translate
- output_fasta: str
location to print the output fasta, using the longer headers
- key_array_order: list
List of strings that includes the order and the values to keep from the possible values [‘uniprot’, ‘gene’, ‘domain_name’, ‘domain_num’, ‘Interpro_ID’, ‘start’, ‘end’]
- Returns:
- output_fasta: str
Confirmation of location of output file
- mapping_file: str
Location of the mapping file created. This uses the same base name as the output_fasta and adds _mapping.csv