Education Template

Sequence Retrivial and Formatting Tools

Let's say you want to analyze co-evolution within or across one or more molecules. Do to this properly, you will need to collect a large set of ortholog sequences of this molecule across many species. Sure, you can simple search through NCBI and select and copy sequences one by one. Another approach is an iterative Blast search, a Psi-blast. But a faster approach is to access these database programmatically. Below are some tips and tools for carrying out these steps. You might also consider an iterative and automatic protein homology search to find orthologs - such can be provided by HMMer and Jack-HMMer searches, which scour Uniprot for sequences. In fact, this is the approach used by servers such as EVCoupling. Once again, some of the tools below will aid this process.


NCBI E-utilities URLs

NCBI databases can be programmatically accessed via the E-utilities made available by NCBI. With these, you can input a search term, and receive a file that contains the IDs of the genes you want to add to the collection. Then, using these IDs, you can fetch the sequence.

* Esearch
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=protein&term=REPLACE_with_SEARCH_TERMS&retmax=REPLACE_with_MAX_NUMBER_of_RESULTS
First blank is search term, second blank is number of sequences to retrieve

* Remove all tags, replace all tags with commas. Input result in Efetch (below)

* Efetch
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=REPLACE_with_IDS&rettype=fasta


If the above solution seems to be too laborious, you can try some of the programs below, including an online form that helps expedite the above instructions. Find_Seqs automates the first step for any type of molecule, and NucSeqFetch is the proper followup for retrieving nucleotide sequences (important for looking at RNA co-evolution).

E-utilities Assistant (Online Version)

Type of molecule to retrieve:

Number of Results to Retrieve: , starting at result index:

Use the above two inputs as a way to retrieve sequences in batches, always changing the starting index - i.e. set to 100 and 0, collect a batch, then set to 100 and 100, then 100 and 200, etc.

Search Term (Currently only works well for Protein searchs!):



Find_Seqs

Please select the files you would like to download.

NucSeqFetch

Please select the files you would like to download.