Configuring your DANSy workspace
For DANSy and deDANSy analysis, we recommend having a DANSy-specific directory for holding reference files created by CoDIAC. This become especially useful if you have multiple projects that use the same reference file(s) for analysis (i.e. the same build of the whole proteome reference file from CoDIAC) or with deDANSy to ensure the same proteome builds are being used across analyses.
We have provided our config module to create this for your convenience, and can be run a single time after installing DANSy.
[1]:
from dansy import config
[ ]:
# We recommend using the path to your working directory, which will then create a DANSY_DATA folder in the provided directory
config.create_DANSy_dirs(target_dir='/user/working/directory')
This directory can then house the CoDIAC reference files that can be generated. Here we show an example which uses the GENCODE SwissProt metadata file to get UniProt IDs. A dedicated script for generating the reference file from the command line can be found at our GitHub repo in the scripts folder, which checks for current csv files that end with the same suffix and verifies all UniProt IDs were retrieved or continues to grab additional ones.
Note: Getting all the InterPro and UniProt information does take >1 hour when run the first time.
[ ]:
from CoDIAC import UniProt
from datetime import date
import pandas as pd
import os
[ ]:
data_folder = config.DANSY_DATA_DIR
retrieval_date = date.today().strftime('%Y%m%d')
file_suffix = retrieval_date+'.csv'
gencode = pd.read_table('gencode.v47.metadata.SwissProt',header=None) # Change this based on which version you download
uniprot_ids = gencode[1].tolist()
ref_file_name = os.path.join(data_folder,'Complete_Proteome_Reference_File'+file_suffix)
_ = UniProt.makeRefFile(uniprot_ids, ref_file_name)
Now we can update the default proteome version that DANSy will look for whenever it needs to import the reference file for the whole proteome.
[ ]:
config.update_proteome_version(file_suffix)
Your DANSy workspace is now fully configured for whole proteome based analysis and/or deDANSy analysis if a reference file is not provided.