﻿trace2dbest
version 2.1
User Guide

turning sequence chromatograph traces into expressed sequence tags






Alasdair Anthony, John Parkinson, Mark Blaxter

School of Biological Sciences
University of Edinburgh

for the Natural Environment Research Council
Environmental Genomics Thematic Programme Data Centre






Ashworth Laboratories, King's Buildings, Edinburgh, EH9 3JT, UK

Alasdair Anthony <al.anthony@ed.ac.uk>	Mark Blaxter <mark.blaxter@ed.ac.uk>

http://www.nematodes.org/ 
http://envgen.ceh.ox.ac.uk/

Contents

1 WHAT IS trace2dbest AND WHAT DOES IT DO? 
1.1 What is trace2dbest and what does it do?
1.2 Why use trace2dbest?
1.3 What does trace2dbest do with my sequence traces?
1.4 Why do I have to use a controlled sequence naming scheme?
1.5 What is the naming scheme?
1.6 How do I rename my sequence files to fit the scheme?
1.7 What are  library ,  contact  and  publication  files? 


2 SETTING UP trace2dbest
2.1What do I need to run trace2dbest?

3 USING trace2dbest
3.0 Preparation
3.1 Section 1 Lib, Cont, Pub and EST files
3.2 Section 2 trace2dbest processing information
3.3 Section 3 trace2dbest parameters
3.4 Section 4 Annotation of sequences
3.5 Section 5 Trace processing
3.6 Section 6 Sequence processing
3.7 Section 7 Submission and saving of files

4 trace2dbest output and where it is saved
4.1 Where are the files saved?
4.2 trace2dbest output
1 What is trace2dbest and what does it do?

1.1 What is trace2dbest and what does it do?

trace2dbest is a computer program that takes sequencing chromatograph trace files from expressed sequence tag projects, as produced by fluorescent sequencing machines (such as ABI Prism or Amersham MegaBace instruments) and processes them into quality-checked sequences, ready for submission to the public repository for expressed sequence tags, dbEST. trace2dbest will also help you create Publication, Library and Contact files, needed in addition to EST files for dbEST submissions. If you are not sure what an expressed sequence tag or EST is, please see the dbEST introductory pages at 
http://www.ncbi.nlm.nih.gov/dbEST/index.html

1.2 Why use trace2dbest?

trace2dbest simplifies the process of getting your sequences from the sequencer to the database. With only a few sequences, its possible to do the job  by hand,  relying on manual editing, and individually-tailored responses to possible errors and other issues. When processing a lot of sequences, for example any project with more than 48 individual trace files, it is easier to let a computer do the work. The high-throughput genome sequencing centres have developed a suite of software tools that are simply adapted for use in a low- or medium-throughput setting. What we have done is bundle these together into one, called trace2dbest. We hope that using trace2dbest will be easy and painless, and that it makes the process of generating and using ESTs exciting and rewarding.

1.3 What does trace2dbest do with my sequence traces?

trace2dbest uses the base-calling program phred to get a raw sequence from your trace files. Phred assigns a quality score to each of the bases it calls, based on the strength of the signal, the shape of the peak and the local environment of the peak. trace2dbest then takes this raw sequence through several stages of trimming, the end result being a good quality EST sequence.
trace2dbest uses the program cross_match to identify and trim vector sequence and, optionally, E.coli sequence. trace2dbest will also trim adapter, poly(A) tail and low quality bases from the sequence. All these trimming stages have parameters that can be adjusted by the user.
After the sequence has been trimmed trace2dbest will create a dbEST EST file for it, based on information provided by the user at the start of the session. Once all the sequences have been processed, you have the option of mailing the completed submission file directly to dbEST. Finally, the files will also be saved.
 
1.4 Why do I have to use a controlled sequence naming scheme?

trace2dbest is useful in isolation, but is designed to be used in an integrated set of programs (called PartiGene) that can take EST sequence traces through a series of informatic analyses to produce a  partial genome - a database of analysed, annotated sequences. For this suite to function, it needs to have a consistent naming scheme for all the sequences so that the programs can perform the proper analyses. This consistency allows the software to process files efficiently, extracting information from the file name rather than having to be told by a user what to do. For example, trace2dbest will extract the plate number and plate coordinates from each file name and insert this information into the dbEST EST file.

1.5 What naming schemes can I use?

trace2dbest accepts two naming schemes, the NERC environmental genomics (EG)  scheme and the full STRESSGENES scheme. The naming scheme is essentially a series of tags separated by the underscore ( _ ) character. Trace file naming using the NERC EG naming scheme looks like:
		Lr_adE_02A05
and one using full STRESSGENES naming scheme looks like:
		CcLL03b01a02f2_AbaRb
The NERC EG scheme is described here but the principles of the both schemes are the same (details of the STRESSGENES scheme can be found at http://legr.liv.ac.uk/)
The first tag must be two characters and is used to indicate the species (or major project identifier). The second tag, which may be from 3 to 5 letters long, indicates the library from which the clone sequenced was derived. The third tag indicates the  address of the clone in terms of mictotitre plate number and row/column. Thus, in the example above, 'Lr'  would indicate the species (Lumbricus rubellus), 'adE' the library (say adult Edinburgh) and '02A05' the plate coordinates (plate 02, row A, column 05).Using the naming scheme allows the software, and the user, to usefully interpret and summarise data in terms of species, library or plate.

1.6 How do I rename my sequence files to fit the scheme?

Included in the trace2dbest package is a program called rename_file.pl that helps with you to rename trace files. rename_file.pl replaces one text string that you supply with another one. It can also transform serially numbered files into files numbered  as if  from a 96 well plate (so that 001 becomes A01 and 096 becomes H12). If you run rename_file.pl you will get the following  options  list:

 Usage : rename_files.pl <list of arguments>
-dir <txt> - set directory of traces <dir> to <txt> -add <txt> - <txt> gets added to the beginning of each   	tracefile in directory <dir> 
-txt <txt1> - <txt> gets removed from each file 
-sub <txt2> - (only with -txt set) txt1 is replaced by 	txt2 
-format - Traces are reformatted to correct 96 well 	nomenclature. Single digits are replaced by 	double digits and row ID set to uppercase. In 	addition, if your files do not contain plate 	coordinates, but are numbered sequentially 	e.g. 	trace1, trace2, trace3 etc. this option will 	convert the numbers into 96 well format (it 	assumes 1-12 refer to row A columns 1-12 etc.) 
-help - Get more detailed help



 Thus, to change a set of filenames in a directory (such as  bees ) from an incorrect format (such as  Apisadultw03F03 ) to a correct one (Am_AW1_03F03), you would type:
	 rename_file.pl -dir bees -txt ApisadultW -sub Am_AW1_ 

For more information type 'rename_file.pl -help'.

1.7 What are 'Library', 'Contact' and 'Publication' files?

The public EST repository, dbEST, simplifies data deposition by splitting the information across a set of four types of linked files (see http://www.ncbi.nlm.nih.gov/dbEST/how_to_submit.html for details of the EST submission process). Each individual EST has an 'EST' file: this file holds the sequence and basic information such as the name of the library, the name of any publication existing or planned describing the dataset, and the name of a contact person who can be contacted for more details. Rather than repeat all the information on these three topics in each and every EST file, dbEST holds the data in linked files, called, unsurprisingly 'Lib' for library information, 'Pub' for publication information and 'Cont' for contact information. To get a set of ESTs into dbEST one has to submit these three files along with the sequences. The simplifying feature of this is that once your Cont file is in dbEST, any subsequent EST submissions you make (next week, month, year) need only to refer to this file to access the same contact information. The same is true of the Pub and Lib files. 
trace2dbest version 2.0 allows you to create these files when you come to process your traces.
2 SETTING UP trace2dbest

2.1 What do I need to run trace2dbest? 

trace2dbest is a  pipeline  program: it takes an input file and processes it through a series of steps to give an output. Some of these steps are built in to trace2dbest, while others rely on using external programs. So, to use trace2dbest you will need in addition to the program itself (trace2dbest.pl):

1- a set of sequence chromatographs. These need to follow a consistent naming scheme (see above). We provide a renaming script, rename_files.pl, that can help you adjust sets of file names simply. For example, many sequencing services add a lane or capillary number to each trace name; this can simply be removed. 

2- a UNIX-based computer (for example a Bio-Linux computer). These computer operating systems come with a programming language called perl installed as standard. trace2dbest is written in perl. You need to have at least perl version 5.4 installed. Most versions of UNIX, such as LINUX and MacOSX, include perl. 

3- the sequence chromatographic trace base-calling program phred, and the vector sequence matching software cross_match. phred, cross_match and a third useful program phrap come as a package available under a free academic licence from the program's author, Phil Green, at the University of Washington. Please go to http://www.phrap.org/ etc to get a copy. There is a simple form to fill in (at http://www.phrap.org/consed/academic_agreement.txt): once you have done that, the software is emailed to you. It should be installed under /usr/software/phred/phred/ in Bio-Linux. 

and optionally,

4-  the sequence similarity search suite BLAST. BLAST is the global standard sequence search programme and is available from the NCBI: go to ftp://ftp.ncbi.nih.gov/blast/executables/ etc and follow the instructions for downloading it. Install under /usr/software/blast. (Bio-Linux systems are supplied with BLAST pre-installed).
 
5-  local sequence databases for BLAST. Part of the trace2dbest process allows you to perform a simple, BLAST-based annotation of the sequences before they are submitted. If you want to do this, you should make sure you have the required sequence databases available to you. We would recommend, for most purposes, that you use the nr protein database from
 ftp://ftp.ncbi.nlm.nih.gov/blast/db/nr.tar.gz
 or the SwissProt database from
 ftp://ftp.ebi.ac.uk/pub/databases/swissprot/release/
 (you want the  sprot##.dat  file, where ## is the SwissProt release number). Once you have the database locally you should use the formatdb command of the BLAST suite to format it ready for searching.

2.2 Installing trace2dbest

If you are using Bio-Linux version 3.0 or higher, trace2dbest will already be installed on your system and is ready to use. If trace2dbest is not installed on your system, you will need to download the trace2dbest RPM. To install the rpm, type (as user with root privileges):

	rpm -ivh trace2dbest-2.X-X.i386.rpm

replacing Xs with the appropriate version numbers. This will install the software into /usr/software/trace2dbest/trace2dbest2.1 which is the standard location for software in Bio-Linux. The rpm will also set up a softlink from this install directory to /usr/software/trace2dbest/trace2dbest. 
Additionally, if you want to run trace2dbest without typing the full path to the program, you can put /usr/software/trace2dbest/trace2dbest in your path environment variable, or set up an alias for trace2dbest.pl.
Finally, if you intend using other software in the EST pipeline that trace2dbest is a part of, or if there is more than one person that will be using trace2dbest on the install machine, we recommend that you set up the following directory:
/home/db/est_solutions
and ensure that all users have the appropriate permissions to write to this directory. The reasons for this are given in section 4 of this guide.
3 USING trace2dbest

3.0 Preparation

Before starting trace2dbest you should first ensure that all the traces you wish to process are in a single directory that contains no other files. You should also check that all the trace files match one of the naming schemes described in section 1.5. As part of its quality checking, trace2dbest will identify and remove any files in  your trace directory that do not meet the specified naming scheme.
We recommend that you run trace2dbest from an empty directory, as this is where trace2dbest will initially write its output files. It is important that you do not try to run trace2dbest from your trace directory. If trace2dbest has been installed as described above you can start the program by typing trace2dbest.pl (or whatever you have set up an alias to accept).
In the first part of trace2dbest you will be required to enter information interactively. To make this process easier trace2dbest will make use of the perl Term::ReadLine::Gnu module, if it is found on your system. This module makes features such as filename completion and command history available. If you don't have this module, and wish to use it, download from http://www.cpan.org/.

3.1 Section 1  Lib, Cont, Pub and EST file information

The first part of trace2dbest deals with the creation of dbEST Lib, Pub, Cont and EST files. For the Lib, Pub and Cont files the user is presented with 3 main options, (1) file already submitted, (2) create file now, (3) use saved file.



Obviously, if you have not previously saved any files then you will not have the option to use a saved file. If you select option 1 (file already submitted) then trace2dbest will ask you for a small amount of information relating to the file so that it can fill in the relevant parts of the EST file. (Please note therefore that the information you provide must match exactly that in the submitted file).
If you wish to use trace2dbest to help create a new file, select option 2. trace2dbest will then request from you the information needed to build the file. Most of the information needed is self explanatory. For further details see the dbEST introductory pages at 
http://www.ncbi.nlm.nih.gov/dbEST/index.html

When you have entered the information for the file, trace2dbest will format this according to dbEST standards  and display the file on the screen. At this point you should check the file to ensure it is correct.
You will only be presented with two options for the EST file, enter information now or use a saved file. When asked for the sequencing primer, you should enter the name of the primer, followed, if desired, by the sequence in round brackets ( ). The primer name you  give here will be entered in the SEQ_PRIMER field of the EST file. It will also be appended to the trace file name and entered in the EST# field of the EST file. The information for the forward and reverse PCR primers should be entered if you have it, otherwise hit enter and the field will be left blank. When requested, the date you would like your data to be made public should be entered in the form MM/DD/YYYY. For immediate release, just hit enter (the corresponding field in the EST file will be left blank). Please note that dbEST policy is to have a maximum hold on data of 6 months. 
Whenever a new file is created you will be given the opportunity to save this file for future use.

3.2 Section 2 trace2dbest processing information

In this section you are required to enter information that will help trace2dbest process your traces efficiently. First you are asked to enter an adapter sequence.  You may use a regular expression to represent the adapter sequence  if you wish. trace2dbest will scan the raw sequence for the adapter sequence you have entered, if it is found the adapter sequence and everything before it (upstream) will be trimmed off. If you do not wish to trim adapter sequence, just press return.
trace2dbest will then ask you for the location of the vector.seq file. This file is used by cross_match to scan the raw sequence for vector. If you do not wish to use the vector.seq file included in the trace2dbest package then you may enter the location of an alternative file (which must be in FASTA format). However, we recommend that if your vector sequence is not included in the standard trace2dbest vector.seq file then simply add this sequence (in FASTA format) to the standard vector.seq file. 
You are then asked if you would like to trim stray E.coli sequence from your ESTs. trace2dbest uses cross_match with stringent parameters to trim E.coli sequence, however EST sequence that is very similar to part of the E.coli genome may inadvertently be trimmed off. E.coli screening will add to the sequence processing time.
Finally in this section, trace2dbest will try to find your traces. First indicate which naming scheme you have used, details are given in section 1.5. Then you must enter the full path to your traces. trace2dbest will then check that every file in the specified directory matches the selected naming scheme. You will be notified of any files that do match the naming scheme, at which point you may exit trace2dbest and edit the trace file directory or you may continue and trace2dbest will remove (delete) these files.



3.3 Section 3 trace2dbest parameters

In this section you have the opportunity to set the various parameters that control how the traces will be processed. trace2dbest has default values for all the parameters, these defaults are shown in brackets (). To select the default value for any of the parameters, just hit return. For more details of the cross_match parameters, see the cross_match documentation. When defining the number of bases in a poly(A) tail, you should enter a number between 1 and 99 (inclusive). trace2dbest will scan all but the first 150 bases of the sequence for continuous stretches of As equal to or longer than the length you specify. trace2dbest will also scan the reverse sequence for stretches of Ts in a similar way. If found the poly(A)  tail and all sequence after it will be trimmed and this event recorded in the POLYA field of the EST file. If poly(T) tails are found then all the sequence before it is trimmed. 
You are also given the option of trimming 'spliced leader' sequences from the EST sequences. The C.elegans spliced leader 1 sequence is preloaded and to use this enter yes. To use another spliced leaer sequence just enter its sequence.

3.4 Section 4 Annotation of sequences

In this section you have the option of adding BLAST annotation to your sequences, subject to a bit score cut-off. This annotation is generated by taking the description, score and e-value of the top BLASTx hit. If you wish to add such annotation, you have two options, remote BLAST via NCBI (limited to ~400 BLASTS/day) or local blast. Please note that if, upon starting, trace2dbest could not find the blastcl3 or blastall executables in your PATH then the relevant option will not be available to you. If you choose to carry out a local BLAST then you will be asked to enter the location of your BLAST databases. trace2dbest will then present you with a list of all the protein BLAST databases in this directory. You should select one database by entering the appropriate number.

3.5 Section 5 - Trace processing

In this section, trace2dbest will run phred to base call the traces, then run cross_match to identify vector and then, optionally, cross_match to identify E.coli. If an error occurred while running one of these programs, the error message will be reported to the screen. You should check the logfiles to identify the cause of the problem.

3.6 Section 6 - Sequence processing

In this section, trace2dbest processes the sequences based on the parameters entered before. trace2dbest will report to the screen if any sequences are rejected because they are too short following trimming. Further details of the trimming performed on the sequences can be found in the process directory created by trace2dbest when it has finished (see section 4.2). trace2dbest will also give details here of the BLAST results, if BLAST annotation is carried out. When processing has been completed, a statistical summary is displayed on screen.

3.7 Section 7 - Submission and saving of files

In this final section, trace2dbest gives you the option to view the merged submission files that it has created. If created, the Pub, Lib and Cont files appear at the top of the submission file followed by the EST files. To exit from the viewing program, type 'q'.  
You will be given the option of emailing the submission file directly to dbEST from trace2dbest. To use this facility you will need to provide the name of your SMTP server for outgoing mail (your local computer support will be able to tell you this). If you are a NERC Environmental Genomics awardee then you should also send confirmation of your submission to the EGTDC (non-NERC EG awardees should ignore this option).
Whether you choose to submit your files or not, they will automatically be saved. When saving your files trace2dbest will first try to save in the directory /home/db/est_solutions, if this does not exist, trace2dbest will save the files to your home directory, in a directory called est_solutions. trace2dbest will inform you of the exact location to which your files and all the other trace2dbest outputs have been saved.
4 trace2dbest output and where it is saved

4.1 Where are the files saved?

trace2dbest will try to save its output to the directory /home/db/est_solutions. If this is not possible, trace2dbest will save its outputs to the directory est_solutions in your home directory. We recommend that the directory /home/db/est_solutions is set up for 2 main reasons. First, it means that all trace2dbest sessions run on a particular machine  will be stored in one location, even if the program is run by different users. Secondly, it allows the other programs in the EGTDC EST pipeline to easily access the data produced by trace2dbest. The directory structure that will be set up within est_solutions is shown here:


To ensure that trace2dbest saves its output in the common area (/home/db/est_solutions), you only need to create the est_solutions directory in /home/db and ensure that all users have read/write permissions to this directory. trace2dbest will create the species, tool, event and output subdirectories.

4.2 trace2debest output

In the directory where trace2dbest has saved your files, you will find a comprehensive output consisting of 10 directories and 2 files. These are described here:
dbEST_submission.txt contains the merged dbEST submission files. This file should be emailed to dbEST.
logfile contains progress information from various parts of the trace2dbest process. You generally shouldn't need to look in this file, unless trace2dbest fails unexpectedly.
blast_reports contains two files, blasts_full  (containing the full BLAST reports for each sequence) and blasts_tophit (containing just the top hits)
fasta has two types of file for each sequence: .seq - raw (unprocessed) FASTA format sequence, produced by phred and .seqsc- the seq files with vector and E.coli replaced by Zs and Xs respectively.
fastafiles contains .fsa files for each sequence processed. These are the processed sequences in FASTA format. These files may be used as the input sequence files for PartiGene, the next stage in the EST processing pipeline.
partigene This directory contains .seq and .qual files for use by PartiGene
phd_dir has the .phd files produced by phred. These files contain the base quality scores for the sequence.
process contains information on how trace2dbest has trimmed the sequences. The sub-directory traceinfo contains a file for each sequence which details the trimming performed on that sequence. Depending on what trimming has taken place, there may also be files such as polya_trim, vector_trim, quality_trim and  adapter_trim, which detail the particular types of trimming. There is also a file giving summary statistics.
qual contains .qual files that contain a matrix of phred quality scores for each base.
raw_traces has a copy of the trace files used for the this trace2dbest session.
scf contains the standard chromatogram format files produced by phred.
subfiles contains an individual EST submission file and processed sequence file for each trace processed.  

