**************************************************************
DELILA-GENOME README FILE - INSTALLATION AND THE MAIN FEATURES
**************************************************************


**************************************
Software Requirements:
**************************************

This software runs basically on linux systems.
Perl(version 5 or above) 
	type 'perl -v' to check the Perl version.
Bash (version 2)
	type 'echo $BASH_VERSION' to check the Bash version.


**************************************
Installing the software:
**************************************

Installing the software is as easy as extracting the files from the 'tar'ed 
and zipped file 'Delgen.tar.gz'. For this, type at the command prompt
tar xvzf Delgen.tar.gz

If you are reading this file, then you already must have done that.
After unzipping, you will see some perl files (.pl), files with no extension
(most of these are bash scripts) and three directories, DelgenFront, Genvis
and Docs. Docs contains the documentation for some programs of
Delila-Genome. It also contains the manuscript of the Delila-Genome system 
published in the online journal, 
BMC Bioinformatics 2003 4:38 (published 8 September 2003).

For the documentation of Delila programs, see

http://www.lecb.ncifcrf.gov/~toms/delila/delilaprograms.html

DelgenFront is a java based GUI front-end to the scan and promotsite
programs. We will come back to this later.

genvis.pl produces HTML output for BED(UCSC browser custom track display 
format) files generated by promotsite. The '.gif' used in the generated HTML
pages are given the directory 'Genvis'.

Delila-genome uses some of the programs of Delila. So first install Delila.
Check if the Delila environment variables are set.
'set | grep -i 'delila''	( this works in a bash shell )

You could also install only the required programs of Delila instead of the
full package in which case the above command would not show anything. Thats
ok!

Modify the PATH environment variable to reflect the Delila package programs
as well as the Delila-genome programs.

######  IMPORTANT ######
Set the correct shebang line in the Bash and Perl scripts
The Bash scripts are set with '#!/bin/bash'. This should be pretty constant
everywhere.
The Perl she-bang line used in the scripts is '#!/usr/bin/perl'
To find out the Perl location in your machine, type 
'which perl' and change each of the Perl scripts in this package if this
values is different from the default. All Perl scripts end with a '.pl' in 
this package.

Environment variables:
*********************
Some environment variables (case sensitive) need to be set for Delila-Genome 
to work. This is a VERY IMPORTANT step. Set the below variables in your
login config file.
DELGEN_GENOMES 
- The directory where the delila books for the various genome 
drafts are installed
DELGEN_RESULTS 
- The directory where the results of scan and promotsite will
  be stored when DelgenFront is used.
DELGEN_CONP
- The number of concurrent processes to run. Generally, each process
  indicates a series of operations on a chromosome. If this value is not
  set, then the jobs are performed in sequential order of the
  chromosomes i.e. the default is 1. 
- Set this value based on the performance of your machine
DELGEN_LOADBAL
- If this variable is set to 1, then the load balancing feature will be
  implemented. This means that if there are multiple computational nodes connected
  to your  master node, the genome scans based on the chromosome sizes are 
  distributed over these nodes. 
  This feature is currently supported only for Redhat Scyld 
  (a  cluster  version of linux) OS sytems.
  To check whether your system has the required features of Scyld, type
  
  which mpprun
  which beostat
  
  If one of the above command fails, then the feature is switched off.

  CAVEAT: Even if Scyld is installed on your system and you have set up the
  delila books correctly (explained later), you may not be able to get correct 
  results or you may get a "book not found" error or 'scan' or some other
  program may crash.
  This is because, even though all the partitions on your system are
  available to your master node, only the /home partition (generally) is
  available to the slave nodes. So if your books or data is elsewhere you
  get erroneous results. Ask your system administrator to set up the
  paritions so that the master and slave nodes have access to the genomes
  and results directory.

  Because of this, only executable binaries may run on the slave nodes.
  Some slave nodes may not have the perl environment setup. In our package,
  we load balance only the runs of scan program(which takes up most of the
  time) because of this. 
	
PERL5LIB
- This is the directory where the perl files of the type "*.pm" are
  installed after unpackaging the directory. The *.pm files are in the same
  directory as the other *.pl files unless you move them elsewhere. In bash,
  set this env variable as,
  export PERL5LIB=$PERL5LIB:DELGEN_INSTALLATION_DIR

*******************************************
1. Setting up the genome draft delila books
*******************************************

Before executing the programs, we need the delila books set up for the
genome sequences downloaded from the UCSC web site.

Directory structure for the genome drafts books:
*********************************************************
First download the genome sequence fasta files for each chromosome
and the corresponding  mrna annotation database containing the accession 
numbers from the UCSC web site.

Create Delila books for each chromosome of an organism and store as 
DELGEN_GENOMES/ORG/RELEASE/chrm_N/book
Store the corresponding mrna database as
DELGEN_GENOMES/ORG/RELEASE/chrm_N/mrna.txt

In the above file path, DELGEN_GENOMES is the environment variable set
previously.
Ex:
if, DELGEN_GENOMES=/usr/local/delgen/genomes, then

/usr/local/delgen/genomes/human/apr03/chrm_1/book
/usr/local/delgen/genomes/human/apr03/chrm_1/mrna.txt

For all chromosomes, observe that the directory prefix is 'chrm_'
NOTE: It is important that you retain the above directory structure where a
'book' and a 'mrna.txt' file are stored in the correspoding 'chrm_N'
directory where N reflects the chromosome number. Please read the
description on 'chrmlist' given below to assign the correct value of N.

To make the above process easier, there are some scripts.

Step1. - First create the appropriate directory
***********************************************

For ex:
If you are downloading the human april 2003 release from UCSC, create the
directory
/usr/local/delgen/genomes/human/apr03

############# VERY IMPORTANT STEP #############
Step2.- Create a file 'chrmlist'
********************************
Create a file called 'chrmlist' in the above directory. This file
gives the mapping between a directory name and a chromosome name. This was
created to facilitate easy processing of non-numeric chromosomal names like
the human X and Y.

There are two columns in this file. 
The first column contains the designated_chrm_dir_suffix. All values in this
column are NUMERIC.
The second column contains the actual chromosome name.

Ex:
*Chrm file starts here
*Comment lines start with a *
*ChrDirSuffix ChrName
21 21
22 22
23 X
24 Y

From the above file, we can see that chromosome X is mapped to directory
chrm_23 and Y to directory chrm_24.
### READ CAREFULLY ###
*DONT USE TABS IN THIS FILE. USE SPACES TO SEPARATE THE COLUMNS
*USE ONLY NUMERICALS IN THE FIRST COLUMN UNLESS IT IS A COMMENT LINE
*DO NOT CREATE CROSSLINKED ENTRIES LIKE
1 2
2 1

Note: This file is important for the working of many Delila-genome scripts.
runscan, runscandiff, runf2r, getchrfiles and mklnchr use this file.
Create it correctly. A sample 'chrmlist' file is provided with this package.
The above example has mappings just for 4 chromosomes. Include the mappings
for all the chromosomes.

Step3. - Download files from UCSC web site
******************************************
If you are downloading the latest release (april 2003) of the human genome 
sequence, type 

'getchrfiles 10april2003 &'

For the mouse genome sequence, this would be 
'getchrfiles mmFeb2003 &'
The arguments 10april2003 and mmFeb2003 are directories in 
ftp://genome.cse.ucsc.edu/goldenPath. For more details type
'getchrfiles -h'

'getchrfiles' downloads both the genome fasta files and the annotation
files and unzips these.

Step4. - Create appropriate links
*********************************
Create the directories for each of the chromosomes and prepare to convert
the fasta files to delila books. For this, type

'mklnchr -s .'

The above command creates directories for each of the chromosome as 
specified in the 'chrmlist' file , column 1 with a prefix of 'chrm_', i.e. 
if there is an entry like 10 in column 1, a dir called 'chrm_10' is created.
Two links are created in each directory. 
fsequ - points to the fasta files
mrna.txt - points to the corresponding annotation file

In the command, '.' specifies that the links are to be created for files 
present in this directory (i.e. to the fasta and annotation files downloaded 
previously). The '-s' switch creates symbolic links instead of hard links.
This has the advantage that links can be created across multiple partitions.

Step5. - Create delila books
****************************
Now you are ready to create the delila books. Type,

'runf2r &'

and thats it. The '&' indicates a background process because reading and
writing 3GB of data is going to take some time.

If its lunch time, take a long break and come back. Your delila books 
should be ready and shining by now.

For more options with runf2r, type ... you guessed it!
'runf2r -h'


*****************************
2. Running a genome wide scan
*****************************
There are two ways to this. 
Either you can use the command line options of 'runscan' program or you can
use the DelgenFront Java GUI client for submitting a job.

Submitting a job using 'runscan'
********************************
First create a directory where you want the results stored. 
Create the ribl, scanp and psparams files. Ribl and scanp are inputs to
'scan' while psparams is the parameter file of 'promotsite'.
Consult the respective documentations for creating the above files.
Now type 

'runscan LIBSRCDIR 3 &' 

where LIBSRCDIR is a directory in a directory specified by the $DELGEN_GENOMES
environment variable. '3' indicates that we run both scan and promotsite. 
Use the '-h' switch to find out more.
If you installed the delila books in the format suggested in the part 1 of 
this document, LIBSRCDIR would be similar to
ORG/RELEASE
where ORG is the organism name like 'human' and RELEASE is the draft version.

If you want the scan/promotsite on only a range of chromosomes, then type
'runscan LIBSRCDIR program_option chrmst chrmend &'
where chrmst and chrmend are the column 1 or column 2 entries in 
LIBSRCDIR/chrmlist.

Ex: 'runscan human/apr03 3 22 Y &'

Runs the scan+promotsite on  chromosomes 22, X and Y.

You might consider the following command if you plan to log out before the
runscan completes.

'nohup runscan human/apr03 3 22 Y &' 


DelgenFront - A Java based GUI front end to runscan
***************************************************

Step1. Start the DelgenFront server
***********************************
Before you use the client, you need to set up the delgen server on your
system. This is a v.v. light weight version of a web server. We will move to
using Apache shortly.

The server is the 'server.pl' in the package.
By default, it listens on port 7070. If you dont like this, open the file
'server.pl' and change the '$Port' variable declared in the program. 
You may need to CONSULT YOUR SYSTEM ADMINISTRATOR regarding this because
right now, no security features for user authentication are used.

While the server is up and running, anyone over the internet can access it
and cause a denial of service attack. Hence, start the server only when
needed. If you are on a private network with private IP addresses, then you
are probably safe. Unless, someone at your place wants to crash the system
delibarately.

The server needs the 'DELILA_RESULTS' environment variable to be set up.
This is the base directory where the results of scans submitted through 
DelgenFront are stored.
In this directory, a 'server.log' file is generated which can be used for
diagnostic purposes. The contents might be a little cryptic if multiple jobs
are executing simultaneously but this will be remedied in the next release.

The server also needs the 'DELILA_GENOMES' environment variable to be set up.
This is where the actual genome drafts are installed.
######### IMPORTANT ##############
In the 'DELILA_GENOMES' directory, store a file called 'genversions' which 
basically lists all the genome drafts installed. Given below is a sample 
'genversions' file.

human/nov02
human/apr03
mouse/feb03

The enries reflect actual directories in the delila-genomes directory specified by
the environment variable DELGEN_GENOMES.
Note! DO NOT insert any empty lines or comment lines in this file.

Once the server is started, it disassociates from the terminal. To kill the
server, you have to type

'kill -9 processID'

Because the server does not respond to other signals. 

Step2. Using the front end
***************************
The front end client is given in the directory 'DelgenFront' of this package
as the file 'DelgenFront.jar'. While the server has to run on the linux
machine where the package is installed, the front end can run on any system 
with Java 1.4 installed in it. For this to work, there needs to be a network
connection available between the front end machine and the server machine.


Two ways to start the front end.
First, go to the directory where the .jar file is present.

Using Command line: 
Type,
java -jar DelgenFront.jar
(or)
Just double-click on the 'DelgenFront.jar' icon. 

In the initial GUI screen of DelgenFront, enter the network address and the 
port number (7070 by default) of the server and the program to start on the
server (scan, promotsite or both). A list of genome drafts installed at the
server is pulled and displayed. The user selects one of the drafts and fills
in additional paramters to start the required job.


******************************************************
3. Comparing two different scan results using scandiff
******************************************************

Scandiff compares two different scan results and it takes as its input, the
'psdataop' files of the two scans generated by promotsite.
To do a genome wide comparison, use runscandiff which is a script that calls
the scandiff for each of the chromosomal scans.

If the results of the running scan & promotsite are available in two
different directories A1 and A2, then to execute runscandiff, 

First create the scandiff parameters file called 'scandiffp'. For creating
this file, see the documentation of 'scandiff.pl' option '-f'.
Specify all scandiff.pl options in the scandiffp just as you would enter 
on the command line, but leave out the '-c' and '-f' options out of this
file. Also leave out the data1 and data2 file names.
A example is given below.
If the command for scandiff.pl on chrm Y would be
scandiff.pl -c Y -s 1 -o FsBC -D -B data1 data2
Then the scandiffp contents for using runscandiff are,
-s 1 -o FsBC -D -B

This is because, runscandiff plugs in the values of -c automatically based
on the chromosome it is processing .


Then to start the execution, type,

runscandiff LIBSRCDIR A1 A2 &

LIBSRCDIR is defined by DELGEN_GENOMES like in 'runscan'. A1 and A2 should
contain the ribl, scanp, and the psparams files used in the chromosomal
scans. 

***************************************************
4. Using genvis to create HTML files from BED files
***************************************************

Genvis takes as it input a BED file generated through promotsite or
scandiff, the 'psparams' file used in the creation of these BED files and 
generates a BED file which contains a subset of the sites listed in the 
input BED file, and a HTML file that can be used to link the output BED 
file to various sites of interest like the NCBI, Stanford SOURCE, and the 
UCSC genome browser.

Genvis can sort the binding sites in the output files based on their start
coordinates or their binding strenghts. This option makes use of the
'sortBED' script which runs on a linux machine. If this option is not used,
then genvis can be used in a windows environment with perl installed in it.
Currently, the 'sortBED' script sorts only BED files with a single track
definition (see the UCSC genome browser site for a definition of the BED
format). Hence, BED files with multiple tracks give erroneous results. This
will be remedied shortly. 
######## IMPORTANT #########
Based on a user option, scandiff generates a combined BED file which has
multiple tracks in it. This file cannot be used as input to Genvis if the
sort option is enabled. Other BED files generated by scandiff containing a
single track can be used.

The HTML files contain certain images which are in the 'Genvis' directory of
this package. When using the generated HTML page, copy these files to the 
directory where this HTML file is in.
Some features of genvis like the 'sort' option work only in unix/linux
environments and not on microsoft windows.
