BIRCH/Release To Do list
From Bioinformatics.Org Wiki
Future Releases
Installation and Updating
getbirch
- birchtally.cgi - Currently uses the Python cgi library, which has apparently been deprecated for a long time. The cgi library will be removed in Python 3.13.1. The current latest version is 3.12.1. However, we have a little more time since typical OS releases are always several versions of Python behind.
We should also look a bit at security. Can we make the cgi script executable but not readable? Would it be more secure to compile it as a pyc file?
See https://docs.python.org/3/library/cgi.html#module-cgi for alternatives to the cgi module.
- The framework and binary files are very big, and downloads are often unreliable on WiFi. Two possible fixes:
- split files into several smaller files
- does Java have methods to resume downloads if they fail?
- GetBirch Wizard - need to do better error checking and print better error messages during downloads, unzipping and untarring. In part this is because sometimes downloads are incomplete. There is no point in proceeding if these steps fail.
- CRITICAL! Check for Python2 during install.
- CRITICAL! Check for Java (non-headless) during install.
- CRITICAL! On flamingo, updating BIRCH causes local/admin/platform.profile.source permissions change to not world-readable. What causes this?
These changes can be put into operation without waiting for a new release, since they don't affect the stable release files.
The question is, how?
- Wrap getbirch in a script, and check for what we need there?
- Just make it very prominent on the download page that you need to have these installed? For example, clicking on GETBIRCH could pop up a new web page that tells the user what must be installed, and how to test for it?
- Is there a way to do this with JavaScript in the browser?
- In any case, we need to revise the Install with GetBirch page to show the current look of the GetBirch wizard.
- record of downloads - add failover to a second host, if sending data to flamingo fails
- option to get onto mailing list
- Java 1.8 on Ubuntu - Default doesn't work with getbirch.jar. Need to comment out a line in /etc/java-8-openjdk/accessibility.properties as described in [1]. This really sucks, because there doesn't seem to be a non-root way to fix this. The -Djavax switch recommended by another responder doesn't appear to work.
Test whether jkd is headless
dpkg -l | grep openjdk ii openjdk-8-jre-headless:amd64 8u171-b11-0ubuntu0.18.04.1 amd64 OpenJDK Java runtime, using Hotspot JIT (headless)
Solution: Document on GetBirch site that JDK MUST be full JDK, and not headless. If getbirch.jar won't run, install full JDK
Debian/Ubuntu:
sudo apt-get remove openjdk-8-jre-headless sudo apt-get install openjdk-8-jre
Updating
- who do we notify about new versions of BIRCH, and how do we do it?
Desktop setup
- Need to update desktop setup tutorials to reflect current Linux desktops.
- GNOME3 - avoid like the plague! If you must use, also install Fedy on Fedora. Is Fedy available for RedHat?
- GNOME Classic - preferred to GNOME3
- Xfce
- Need a mechanism for figuring out which desktops are available, and putting the appropriate HTML documentation files into index.html when customdoc.py runs.
local files
- update-local - automated change of local GDE menus to PCD menus
- mechanism for adding a .obsolete extension to files or directories
no longer supported, comparable to NewLocalFiles.list
Dependencies for 3rd party programs
First, we need to begin a list of programs that have special dependencies:
program | dependency |
Weblogo | numpy |
Weblogo | ghostscript (for formats other than eps) |
It's probably a good idea to put this into birchdb, once the necessary fields for the table become stable.
Next, we need a way to detect each dependency during an install or update, and to somehow inform the user of the dependency.
Ideally, there would be a way to install the dependency during install or update, if the user is the BIRCH administrator. However, that would be a lot of work, and may not be stable. There could also be version dependencies, such as needing Python 3, or Java 8 or greater.
BIRCH
-
Time to remove Staden from the local BIRCH site. The binaries are only for solaris. Should also remove Staden lines from local-generic local.cshrc/local.profile. Remove lbirchdb and $doc. $dat? install directory? - Solaris - No one in bioinformatics world seems to use Solaris. However, it should be noted that Oracle still supports Solaris, which is now at version 11.3. We dont' want to be too quick to drop Solaris because it might make a comeback in bioinformatics servers on the cloud. Oracle appears to also be still developing the Sparc chips. Possiblities:
- drop Solaris support
- Solaris appears to be free to download. Could either install on a PC, or run it in a VM (need a 64-bit VM).
PRIORITY!! Phase out of GI numbers - NCBI will begin phasing out GI numbers over 2016.- Based on NCBI documentation, need to set roadmap for BIRCH to comply with GI phaseout. http://www.ncbi.nlm.nih.gov/news/03-02-2016-phase-out-of-GI-numbers/
- Create GI-free datasets for testing ACCESSION.VERSION compliance. Does NCBI provide a set of test files?
- Test current BIRCH implementation to assess actual impact of GI phaseout.
- Modify scripts to deal with ACCESSION.VERSION.
- 3rd party applications may or may not be affected. In particular, we need to see how ReadSeq is affected.
- How are NCBI Eutils affected?
- Re-test as software is updated.
- uniqid.py
- The TCOFFEE package has a program called seq_reformat that can encode and decode names.
- link to Linux man pages - definitive site for Linux command manual pages
- For Unix command summary, make links to specific commands. The downside of the man7.org page is that is has ALL commands, so it's hard to find man pages for the common commands.
More modern looking web site
Requirements
- customdoc.py chooses separate content depending on whether this is a multiuser system or a single user system. This is more subtle than it may seem. For example, if the person installs birch on a lab workstation that is not accessible to the outside world, would we do anything different from a single user who installs on their own account?
Strategy
- For next release: home page only. In subsequent releases, we'll slowly work on modernizing pages
- ONLY use HTML5 with CSS stylesheet. Maybe experiment with Javascript, but there would have to be a compelling reason to do so.
- Start with an alternative web site eg. newindex.html that has to prove it can do everything that index.html does. We need to tread very slowly with the use of styles, because we don't want to break existing pages.
Home Page Layout
We want a lot of links on the front page to make it easy to find things in a few clicks. You don't want to make the user explore a bunch of hierarchical menus.
- Screenshot gallery? One of those graphics that shifts between screen shots to explain visually what BIRCH looks like
Settings for each User
We need to have a directory within the $HOME directory where settings for each user can be stored.
Contents
- Email address for notifications
- Preferred helper applications
- Unix command line prompt
- NCBI_ENTREZ_KEY
Since we set these things anyway, maybe its best to implement this as a bash source file. This would at the same time be executable, as well as readable, since bash variables are in the form VAR=VALUE
Variables in $HOME/.config/BIRCH/BIRCHsettings.source | |
variable name | description |
BL_EMAIL | email address for notifications |
BIRCH_PROMPT | Y or N - tells whether to use the BIRCH command line prompt. Y overrides prompt set in .bashrc file. |
BL_TextEditor | text editor for BioLegato |
BL_PDFViewer | PDF viewer for BioLegato |
BL_PSViewer | PostScript viewer for BioLegato | BL_ImageViewer | bitmap image viewer for BioLegato |
BL_Document | Word Processor for BioLegato |
BL_Spreadsheet | Spreadsheet for BioLegato |
BL_Browser | Web browser for BioLegato |
BL_Terminal | Terminal program for BioLegato |
NCBI_ENTREZ_KEY | API Key for Eutils |
Where do we put it?
- $HOME/.config/BIRCH - Results below makes this directory the best choice. On one hand, it may not matter.
The important thing is once we commit to a location, we're committed! There's no going back.Check: Fedora, Debian, MacOSX- Google searches to see if this question has a somewhat universal answer.
OS/desktop | directory |
RHEL7/Xfce | $HOME/.config |
RHEL7/GNOME | $HOME/.config |
fedora31/GNOME | $HOME/.config |
Ubuntu 18/MATE | $HOME/.config | Mac OSX | $HOME/.config |
https://specifications.freedesktop.org/basedir-spec/latest/ar01s03.html defines environment variables that specify where files for applications are stored. Config files are usually in a directory within $HOME/.config.
OSX - https://eclecticlight.co/2019/08/28/preference-settings-where-to-find-them-in-mojave/ tells us that the place for preferences on the OSX desktop is ~/Library/Preferences. This directory contains a .plist file for each Mac application. The plist file is a binary. My feeling is that since BIRCH will be using a textfile for settings, we avoid this directory, especially because BIRCH is not an app, but in many respects more like an operating system.
Implementation
- I can't think of any reason that the BioLegato Java application needs to be modified to accommodate, or even know about this file. Even in the case where we want the user to be able to create his own .blmenus files, the pcd.properties files includes pcd.menus.path, which has a list of paths to check for .blmenus files. Boy we sure were forward thinking when we set this up!
- Newuser script should create this directory, if it doesn't exist. Probably we don't delete when we run nobirch. As well, we probably need to get newuser to ask for an email address, if one is not already set. The environment variable should be BL_EMAIL.
- At launch, BioLegato should check for $BIRCH/.config/BIRCH/BIRCHsettings.source. However, we don't need to create it. If it doesn't exist, or the particular setting needed is not in this file, just go with the BIRCH default.
- Need a BioLegato way to change info. Should there be a separate birchadmin for non-administrators? First we get it to work and user can edit files by hand. Then we worry about a BioLegato way of changing parameters. Call it 'birchuser', as opposed to 'birchadmin'?
- Should each BioLegato have a link that launches birchuser, maybe under the name of Settings, or something like that?
- Another approach would be to simply take the existing BLHelper menu item from birchadmin and replace it with a single menu that lets the user set all settings. We could organize it as tabs to add all necessary functionality. That way a separate application wouldn't be needed. It might even be possible to implement with a symbolic link to a file in $dat/birch that had the master copy of the .blmenu file, rather than duplicating this in all BioLegato directories.
- Since both newuser and BioLegato (and maybe other things) need to make sure that this directory exists, there should be a standalone Python script that does that. Maybe it's a toolkit, along the lines of blastdbkit, that does all operations on $HOME/.local/share/BIRCH. BioLegato would call this script, as would newuser etc.
Strategy
1. Create a script that creates this directory if necessary, and populates it with settings
2. Get newuser to run it.
3. Get BioLegato to run it
Python
There should be a common Python directory for installed Python packages. We already have $BIRCH/python and should add $BIRCH/local/python. Added $BIRCH/local-generic/python.
Forget the idea of a common Python directory. Because many Python modules such as numpy require compilation of C code, libraries cannot be assumed to be portable across platforms. We will phase out use of $BIRCH/python and $BIRCH/local/python in favor of lib-linux-x86_64/python and lib-osx-x86_64/python.
Python libraries
- Bioconvert - Bioconvert is a collaborative project to facilitate the interconversion of life science data from one format to another. Bioconvert currently contains 44 formats and 95 conversions.
- BioPython - We should probably have a complete BioPython in each lib-xxx-x86_64 directory. There are too many dependencies with platform-specific object libraries to put this in $BIRCH/python. It could end up saving a lot of headaches in the long run.
Python3
- The goal should be to make BIRCH completely compatible with Python3.
- From Ubuntu 16.04 Release notes: "Python 2 is no longer installed by default. Python 3 has been updated to 3.6. This is the last LTS release to include Python 2 in main."
- It probably makes sense to do a batch run of all .py scripts using 2to3, to create an inventory of scripts that need to be updated.
- By now, we can probably count on Python3 being available on all systems. This may make it possible to avoid asking the user to install Python3.
- Spoke too soon. Scientific Linux 7 does NOT have Python3 installed by default. Say what? Really? At the same time, RHEL8 will NOT have Python2, by default. Sheeesh!
- It looks like __future__ is the preferred method for allowing Python2 code to use Python3 features.
- Most features of Python3 that potentially conflict with Python2 have been backported to python2.7.
Is there a way, on a script by script basis, to force use of Python3? That way, as we progress, we can focus on developing for Python3, and do 2to3 conversions that are not backward compatible with Python2.
By now, we can count on Python3 being available, but not necessarily being the system default. We need to explicitly call Python3 in those cases where Python3 is required.
Ubuntu 22 There is no 'python' command on Ubuntu22. There is a 'python3' command in the default install, but you have to explicitly say 'python3'. If you want python2, you install the python2 package and type 'python2'. You can get 'python' to give you python3 if you install the python-is-python3 deb package.
Two possibilities:
- change every BioLegato menu and every python script to explicitly use python3. It is not entirely obvious that this is the best solution, since eventually, all systems will have a 'python' command that gives you python3. (Won't they?)
- add mention of 'python-is-python3' to required packages. As python3 becomes the default, the need for this package should go away.
Machine | Python3 version |
---|---|
flamingo | 3.6.9 |
canola | 3.8.10 |
brassica | 3.10.6 |
triticum | 3.10.6 |
peacock | 3.7.2 |
CCL | 3.6.8 |
maui | 3.8.10 |
fedora31 | 3.7 |
wotan | 3.6.8 |
Comptability issues
- Handled by 2to3
- Print as a function - 2: print >>> 3: print requires ()
- chmod octal format - 2:os.chmod(PROGFN, 0664) 3:os.chmod(PROGFN, 0o664)
- Dict keys as list - 2:PROGLIST = PROGDICT.keys() 3:PROGLIST = list(PROGDICT.keys())
- In Python3, urllib has changed from Python2.See discussion of future.moves and six.moves, as well as examples of urllib intercompatibility at http://python-future.org/
- if os.environ.has_key('BIRCH') replaced by if 'BIRCH' in os.environ
BIRCH Python compatibility
admin
- newuser.py
- nobirch.py
install-scripts
Python2&3 compliant:
- birchhome.py
- UNINSTALL-birch.py
- Update_birch.py
- update_local.py
Not yet compliant:
- birch_install.py - needs urllib.request (try fixing with six)
- setplatform.py - needs urllib.request (try fixing with six)
- test.py - imports urllib but doesn't directly call it. Do we need this declaration?
scripts
Python2&3 compliant:
- BIRCHSettings.py
- blastdbkit.py
- BLHelper.py
- createlauncher.py
- customdoc.py
- htmldoc.py
- linklauncher.py
- vncsetup.py
3rd Party Python3 compatiblilty
How to set PYTHONPATH
The pip command installs Python packages from repositiores. By default, they are installed system-wide, but we want to install them in $BIRCH/python. For example, to install the package gffutils, we type
pip3 install --install-option="--prefix=$birch/lib-$BIRCH_PLATFORM/python" gffutils
In python3.9 (and others?) --install-option is no longer supported. Instead,
pip3 install --root $birch/lib-$BIRCH_PLATFORM/python gffutils
Will have to test to see if --root is in older pip versions.
All packages installed in this manner will be in $BIRCH/lib-$BIRCH_PLATFORM/python.
We would have to add the PYTHONPATH environment variable to profile.source, cshrc.source etc.
PYTHONPATH=$birch/lib-$BIRCH_PLATFORM/python/lib/python3.5/site-packages export PYTHONPATH
To install a package in $BIRCH/local
pip3 install --install-option="--prefix=$birch/local/lib-$BIRCH_PLATFORM/python" gffutils
Platform-dependent Python
In some Python packages (eg. cutadapt), platform-specific libraries (eg. C, C++) are part of the package, usually as .o files. These can be buried several directories down in the package, but they are there.
For such cases, we install in platform-specific python directories:
Linux-x86_64
pip3 install --install-option="--prefix=$birch/lib-linux-x86_64/python"
Mac-OSX
pip3 install --install-option="--prefix=$birch/lib-osx-x86_64/python"
Setting PYTHONPATH then becomes
PYTHONPATH=$birch/lib-$BIRCH_PLATFORM/python/lib/python3.5/site-packages export PYTHONPATH
Portability
One of the big drawbacks with Python is that packages are expected to be installed in the root hierarchy with root privileges. pip3 often fails when trying to install in local directories. Worse, it seems that each package must be wedded to a particular version of Python eg. 3.5, 3.6, 3.7...
These problems are even true within a platform eg. from one Linux distro to another.
Potential solutions:
- Nuitka - Billed as a Python compiler. Claims to create an executable file that is fully standalone. I'll believe it when I see it, but its worth a try.
- package creators - There are various 3rd party programs that will create a single file that has all the dependencies
- PyInstaller
- probably a better description of how to use PyInstaller
- Create a Python wheel file
python4 not in the cards
There will be no python 4. Hooray!
https://www.techrepublic.com/article/programming-languages-why-python-4-0-will-probably-never-arrive-according-to-its-creator/
Java
- needs to be done in NetBeans
- need to re-compile all .jar files
- test all non-BIRCH Java programs
Log4j2 vulnerability At present there are no known vulnerabilities with the Java programs distributed as part of BIRCH. To a large extent, this is due to the intrinsic design of BIRCH. BIRCH does not run web servers or peer-to-peer functions. All applications run with end-user privileges only. Some applications do post requests to web services, either through URL queries or a Java API. Some applications do logging using log4j, but output is written to local files owned by the user.
Out of an abundance of caution, we will recompile existing applications where possible, or obtain updated jar files from the authors. Scans will be done that can detect applications using log4j2, including jar files (which are really just zipped archives, so they can easily be scanned.)
Progress on applications:
Package | Comments | Status |
---|---|---|
Archaeopteryx | ||
ArrayNorm | ||
artemis | uses log4j2; Need to get update from EBI | |
axis2 | ||
BioLegato | needs to be recompiled | |
birchutils | doesn't use log4j2 | OK |
blrevcomp | doesn't use log4j2 | OK |
Blastviewer | ||
BRIG-0.95-dist | ||
Cytoscape | will be upgraded when a patched version is released | |
eutils | ||
FastQC | ||
genographer | ||
getbirch | needs to be recompiled | |
Jalview | upgraded to version 2.11.1.5-j1.8 | OK |
mauve | ||
Mesquite | ||
MWCalculator | ||
readseq | ||
shuffle | doesn't use log4j2 | OK |
TM4 | ||
Trimmomatic | ||
Trinity | has some Java tools; need to check with Trinity about updates, if any |
Gnu Parallel
We should investigate ways in which GNU Parallel can be used to speed up programs. This looks like an amazing and versatile program that has a lot of ways to speed up serial code without changing the programs themselves. There is a good discussion of BioStars.
Libraries
Delete old libraries
especially those associated with GDE. The best way is to rename a library using the .old extension. The libraries to try are:
- fc4libs
gde.fontsopenwin- MeV
Testing: The main programs of concern are acedb and treetool.
OSX Dependencies
- xcode - ABySS - On one Mojave machine, gave the error: "xcrun: error: invalid active developer path (/Library/Developer/CommandLineTools), missing xcrun at: /Library/Developer/CommandLineTools/usr/bin/xcrun". Googling this error indicates that this sometimes appears after an update, but can be resolved in most cases by running "xcode-select --install". It is not clear how widely this will affect OSX machines, and merits further investigation before a fix, if any is done. It may be enough to check for xcrun as part of the pre-install process.
Documentation
- There is a nice description of the meaning of accession numbers on Biostars, with some good links. We should find a place for those links in our documentation.
- BIRCH Administrator's Guide
- Update doc on making BIRCH site web-accessible
- need to install apache2
- How to use an IP address in BIRCH Properties Tool.
- Update doc on making BIRCH site web-accessible
MacOSX M1 (aka Apple Silicon) support
BioLegato
- Get BioLegato to compile in Java 11. We'll keep it in Java 8 for now, but for the future we want things to work with newer versions of Java.
- BioLegato needs to be able to launch programs in the background
- PCD
- VerifyBox - PCD should support insertion of Go/No-go statements into the Shell command. This would pop up a box asking the user to verify that they want to do a task.
Implementation: This is best done in Java. There simply is no GUI for Python that is distributed as a STANDARD library on all platforms. Tkinter/tkinter claim such, but it simply isn't true, because they depend on tck/cl being present, which is not (surprisingly) on all Linux distros. Java has this as a standard feature. The other advantage of doing this in Java is that when we can later put the code directly into BioLegato, once we can add it to the PCD grammar. - need a reset to default values button for menus
- Bugzilla #1204PCD system command appears to have no effect
- set default values or min/max numbers as the result of a command
- conditional grey-out of non-relevant items?
- cascading variables?
- VerifyBox - PCD should support insertion of Go/No-go statements into the Shell command. This would pop up a box asking the user to verify that they want to do a task.
Need mechanism for BioLegato to run commands in the background
At present, there is no way for PCD shell commands to run jobs in the background. That is, the Java Virtual Machine cannot terminate until every shell command has terminated. Even if the command ends with an ampersand, it must terminate before the JVM will terminate. That is an annoyance when we want displayed output to persist even after a BioLegato job has terminated, and a potentially major problem if we want to launch long-running or resource-intensive jobs from BioLegato.
It's probably best to write a short demo program to experiment with different approaches.
Hints:
- src/BioPCD/parser/src/org/biopcd/parser/CommandThread.java is the object that calls Runtime.getRuntime to execute commands as new threads. see shellCommand in this file.
- As a temporary workaround, we can call scripts though a bash wrapper that uses nohup and & to run the script in the background.
Links:
- How do I launch a completely independent process from a Java program?
- Runtime.getRunTime().exec not behaving like C language “system()” command
- Is there a Null OutputStream in Java?
- Create threads to run in background
Solution: In BioLegato 1.0.3, CommandThread.java has been modified so that if a command line ends in '&', it will be run in the background.
Remaining issues:
- On Linux, we can run anything we want in the background using &. This doesn't always work in MacOSX, for reasons that are not clear. If you launch output in the background, the text editor doesn't pop up until the parent BioLegato process is terminated. It looks like what we have to do is to only use & when output is being sent to files. The rest of the time, we don't use &. This is probably okay, because if you logout, you certainly don't expect or want viewers to persist. If you don't logout, there's no need to kill BioLegato. One important thing is to first remove & from the birch launcher and birchadmn. We need to do more thorough testing on the remaining BioLegato programs. In some cases, there may be a need to run programs in the background through wrappers.
- Most shell commands run the command in the background. However, if temp files aren't explicitly saved, they get deleted before the command has a chance to process them. Therefore, we need to go through the .blmenu files and make sure that "save true" is included for all input file declarations eg in1.
birchbirchadminbldna- blprotein
- blnalign
- blpalign
- blnfetch
- blpfetch
- bltree
- blncbi
- blmarker
Development
- make a roadmap of BioLegato structure. Probably several roadmaps, actually
- high-level structure
- initialization
- map for each canvas
- PCD --> menu, how?
- changelog: Wiki, text, web page?
GetInfo - Colourmask: new colours don't display
Hints:
- getInfoAction
- SequenceWindow
- SequenceList
- SequenceTextArea
The Update action is contained in the SequenceWindow. My guess is that we need to pass the SequenceTextArea to the SequenceWindow so that it can call the repaint function for SequenceTextArea. It is worthy of note that there are numerous calls to repaint in SequenceTextArea that specify the area to repaint. This may be for efficiency during actions like select and scroll, and may not be necessary here.
system command appears to have no effect
Hints:
- isSystemSupported may be relevant
get rid of wrappers for text editors
The BioLegato scripts call choose_edit_wrapper.sh, which in turn calls either nedit_wrapper.sh for nedit or gedit_wrapper.sh for gedit.
- We can certanly eliminate nedit_wrapper.sh because the LD_LIBRARY_PATH problem no longer applies. It's probably best to first get rid of nedit_wrapper, because it appears that calling it was we now, do, with several layers of scripts, nedit sometimes doesn't launch.
- Hopefully there is a way to eliminate the gedit wrapper too. gedit has a --new-window option that forces gedit to open a new window, rather than a new tab, which we don't want.
Output to console
We need to decide on a standard way to run programs so that we see the progress as the program runs. Currently this is done using the command stored in $GDE_TERM, but that is not necessarily platform independent. Some possibilities include:
- an external script that we can pass the command to, that will pop up a console display, and an "Okay" button when the job is finished
- an intrinsic BioLegato command that sends output of a job to a console display
Table Canvas
- Need a built-in function to add rows or columns. There is already a Delete Row function, so it should be easy to add. This will be useful for things like blnfetch and blpfetch, when you want to type in ACCESSION numbers by hand.
- MacOSX - table canvas doesn't have border lines between cells, even though you see them on Linux. Is there some Swing or AWT attribute that could be set to force border lines to be shown?
chooseviewer.py
- should be able to open presentations eg. .pptx
- other media like audio, video?
- The problem with deleting an HTML file while it is still in the browser is that at least Firefox won't save the file under a new name if you ask it to. Since the default is to have it pop up in a temporary file, this makes it hard to save a copy of the output after you have viewed it in a temporary file.
blsort.py
- sort dates. We already know how to do dates from blastdbkit.py, so we borrow that code here.
- re-annotate parameters using pydoc @param operator
BLHelper.py
- revisit how BinaryExists looks for apps. The command string could be any of
- open -aWXYZ appname (where WXYZ might be any number of options)
- There should be a refresh button to update birchadmin to show any newly-installed applications
- The command for all Save buttons should include a notify popup to tell the user to restart birchadmin for changes to be seen.
birchadmin
birchadmin is a birch system administration tool.
- menus
- File
- nobirch (some users may wish to turn off birch access. Think about this one. It's really only particularly relevant on multiuser sites. Or is it?)
- Web
- global string change - we need an easy menu for recursively doing a string substitution in all files in directory with certain characteristeics eg. a file extension. The program should make a backup copy of the file, preserving the file's metadata. An example would be "in all files with a .blmenu extension, replace $DOC with $BIRCH/doc"
- change BIRCH URL
- File
birchdb
- doc2ace.py - update list of supported file formats, especially to include newer OpenOffice/LibreOffice formats
- reimplement birchdb using [ hsqldb].
- Need a GUI front end.
- hsqldb has a few GUI tools.
- LibreOffice Base can act as an HSQLDB client.
- Top 10 MySQL GUI Tools
- Valentina Studio
- URGENT! - Need to fix the birchdb and tbirchdb scripts to work properly on CCL, and still work elsewhere.
- Could it be that it works from a csh script, and not a bash script?
The problem is that failure of birchdb to launch Xace or tace has been inconsistent. It works on some days, and not on others. It is as if something keeps getting set or unset.
Although error messages aren't consistent, here's one (on jupiter):
Gtk-WARNING **: Failed to load module "libgail.so": libgail.so: cannot open shared object file: No such file or directory Gtk-WARNING **: Failed to load module "libatk-bridge.so": libatk-bridge.so: cannot open shared object file: No such file or directory
Other times, this script gives a Segmentation Fault error. Once again, the only place I've had this trouble is on CCL.
As well, there is a GUI front end called [ RazorSQL] which may be all that we need to manage birchdb.
- InstallAddUpdate
update - this will have to wait until the new version of Getbirch, and the accompanying install-scripts directory, are ready.- install add-on - ditto. Ideally, it should be possible to choose an addon from a menu and have Getbirch install it.
- InstallAddUpdate
- need to revise documentation, and get doc into database
Quick and dirty patch/addon mechanism
We need a way to apply patches to an existing BIRCH install. This should be a very simple mechanism to start with, which will also teach us some things about exactly what it is that we want it to do. Initially, it should probably be nothing more than running a script that downloads a file and untars it, so that the files just go where they are supposed to go with permissions already set.
We need a mechanism to record in $BIRCH/local which addons are installed. This way, when a BIRCH update is installed, we can make sure to re-install any addons.
Definition of an add-on
An add-on includes:
- a tar file which, when untarred in the $BIRCH directory, will install files in the right places.
- an install script
- a line to be appended to a flat-file database somewhere in $BIRCH/local. This is probably a csv-style line. It should at minimum give a unique name for the add on.
xxxxx.addon.d payload.tar install.py addon_spec.csv
An add-on can either be something new that is installed, or a patch that overwrites existing files, or even a script that runs and changes something. For example, a patch might be as simple as a script that changes important permissions, or changes the name of a file, or does a string substitution to correct an error.
Algorithm
get list of available addons/patches user selects one or more foreach addon selected cd $BIRCH download addon gunzip addon.tar.gz tar xvfp addon.tar cd xxxx.addon.d mv payload.tar $BIRCH cd $BIRCH tar xvfp payload.tar cd xxxxx.addon.d python install.py cat addon_spec.csv >> $BIRCH/local/admin/addons.csv
FSAP, XYLEM
Convert FSAP and XYLEM to Free Pascal?
Free pascal appears to still be supported. They have builds for Linux and MacOS, both on x86_64 and ARM64. There are even RPM and DEB packages.
GNU Pascal looks like it hasn't been supported since 2005.
GNU Pascal has a great deal features aside from the Jensen & Wirth standard, including support for most Borland features, and even abstract object types and methods. The main improvement would be that we could leave behind p2c. This should be done with great care and a lot of testing, because there could be surprises hiding in the implementation. See http://www.gnu-pascal.de/gpc/h-index.html
Have tested Free Pascal on macos-arm64 and linux-x86_64. Newly compiled programs work. Also, code compiled statically using fpc -XS on Ubuntu 18 executes on Red Hat 7. Presumably, RH7 has the oldest C-libraries, so if the binaries run there, they will run on any Linux system.
BLAST+
- blastdbkit.py - NCBI has a penchant for changing the basenames of databases. For example, there used to be swissprot.00.*, and now it's just swissprot.*. Betacoronavirus has undergone similar changes. Is it feasible for blastdbkit to check for basenames in the local database that are not found on the FTP site, and then delete the presumably obsolete local files?
- Need more ways to find out sizes of databases Possibilities:
Add an item to BIRCH, bldna, blprotein to open local database spreadsheet (DONE)- Add a feature to blastdbkit that tacks on the size in Gb of the database to the name of the database in the Biolegato menu items. This shouldn't take up much width on the screen.
Add email notification. DONE v3.71Can we also have some sort of desktop notification as an alternative?- Consider adding some sort of functions for blastdb_aliastool which lets the user create virtual databases which are subsets of the larger databases.
- Consider how to deal with multiple query sequences. If you select more than 1 sequence in BioLegato, both FASTA and BLAST will use each as queries, and the report will have a summary of results for all. Is there a better way to handle multiple queries?
BIRCHv3.60 -Need to let user explicitly set CPU limits. Otherwise, BLAST uses ALL availalbe CPUs.- Web blast appears to have a feature that says "download aligned sequences", meaning that the part of the hit that is in the alignment can be downloades. tblastn appears to download the original DNA sequence, which was translated during the tblastn search. This could potentially be useful, in cases where a complete global alignment of the query and subject exist. It could also be very dangerous, when those conditions don't apply ie. a local alignment. We should look into whether or not we can get the same output from BLAST+. It would probably have to go through the history.
BIRCHv3.60 - BLAST+ 2.8 - need to be compliant with new database format (v5) that improves blast performance when limiting taxa or given a list of ID numbers, as well as retrieval of data using blastcmd. At present, I believe that only most common databases are available by FTP for v5. We need to come up with the best way to either support both old and new versions, or to easily upgrade to v5 databases.- Compute Canada mirror for BLAST databases - It would probably be useful to create a mirror for BLAST databases, so that there would be a 2nd North American mirror. (biomirror doesn't store .md5 data for database files, which undermines its usefulness.) The Rapid Access Service at Compute Canada provides easy access to projects using comparatively "small" resources. For example, the /project filesystem allow up to 10 Tb per group, which should be more than enough for NCBI databases for quite some time.
- blastdbkit.py - could we speed things up a bit using unpigz?
Is there a way to get BLAST to report the length of the library sequence in the hit summary? This would be useful to give us some idea of the types of hits we're getting. Alternatively, could this size go into the blnfetch output as a separate column?- Have a look at blastdbcmd, which takes a sequence id as input and retrieves a lot of types of information from a local BLAST database. This could have a whole lot of uses in BioLegato. Try 'blastdbcmd -help' for a manual page.
In Python:
import multiprocessing multiprocessing.cpu_count()
v3.60: multiple CPUs - simple approach is to just use all of them. Do we want to have a menu item to choose the number of CPUs? Biolegato has chooser to set number of CPUsAutomated install/update of NCBI Blast databases, run from birchadmin.
Algorithm: if BLASTDB not set prompt for directory (default $BIRCH/GenBank) read list of database divisions currently installed read list of database divisions to be installed uninstall those not in the list from previous step install all divisions in the install list
Could do this as:
- shell script with BioLegato front end
- Python script with BioLegato front end. This could be implemented by adapting BLHelper.py. The menu layout would look something like:
Nucleotide (nt) | Installed | Install O</d> | Delete O</d> |
Protein (nr) | Installed | Install O</d> | Delete O</d> |
RefSeq RNA (refseq_rna) | Not installed | Install O</d> | Delete O</d> |
- Java application
- Special cases of est and pdbnt databases
- Where to document, aside from Adding BLAST Databases page
- How should blastdbkit.py deal with these?
- If est or pdbnt are installed, generate message to also install dependencies?
- message in blastdbkit.log?
- don't include in BLAST/FASTA menus unless dependencies are installed
- In updates, automatically update dependencies if est or pdbnt are installed?
upgrade to FASTA 36.3.8 (April 2016)- have a look at FASTA scripts for annotation.
- work out in depth the ways to get the most performance out of local BLAST and FASTA.
- add option to set number of CPUs
- This paper demonstrates the counterintuitive finding that multiple CPUs actually degrade performance when memory is smaller than the size of the database being searched. http://www.nersc.gov/users/computational-systems/genepool/performance-and-optimization. It may be good to add environment variables that are set at install time and default to reasonable values for one's system.
- output should always show real, user and sys times for search. Maybe these could be pseudocomments in table-report?
- blastdbcmd integration into BioLegato (akin to blncbi)
- launch BLAST/FASTA from Artemis
- better support for the more specialized blasts such as rpsblast and psiblast
- In BioLegato (and in reports?) should display both the descriptions of databases, and the codes for the databases.
FASTA output in multiple formats. With fasta, run fasta using multiple output formats
eg.-m "F8 $NAME.fasta.tsv" -m "F2 $NAME.fasta2"
Blast output viewers
- Firefox is a pain in the ass because if you run BioLegato on a remote server, sending output to Firefox will give an error message if you also have Firefox running on your local desktop. Two possible options:
- Does Firefox have a command line option that lets you avoid the lock?
- The Java API has a browser that can display HTML, and probably visit links. Could we launch this one as a default browser for BLAST output?
Blastviewer - [2]; The most intuitive of the tools I've tried. Requires NCBI XML file as input using -outfmt 5 option. Current version doesn't take filenames at the command line. I have contacted the author about fixing this. DONE: New version takes files at command line. Added to blast searches as an output option.- BLASTGrabber - BMC Bioinformatics 2014 15:128.
- Pro: Nice GUI, seems pretty solid; written in Java; fairly easy to figure out; most tables can be exported to csv; is able to parse BLAST+2.10.0 text output files.
- Con: doesn't take input files from the command line; no apparent way to retrieve hit sequences directly from the program; not as intuitive to user as Blastviewer.
- jamblast - requires an SQL server, which makes installation non-trivial; last update 2013. [3]
- Blast Output Viewer Generator - Looks like what you do is run BLAST at the NCBI web site, save output to a text file, and BOVG generates a web page that views the output interactively.
- [BlasterJS https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0205286] - Looks like a JavaScript library that could be included in other web applications that use BLAST.
Phylogeny
HIGH PRIORITY: preserving metadata using phyloXML. Most newer phylogeny programs use phyloXml to represent trees. We need to have a way to link sequence IDs to sequence metadata eg. taxonomy. PhyloXml does this. There are BioJava and BioPython classes for doing at least some of his. One approacy might be to read Phylip Newick files, get metadata from IDs, and genarate PhyloXml files from that. Solution: bltree can add bootstrap values to a phylogenetic tree, and save output to a phyloxml file.- Booster - bootstrapping for large datasets. See Title: Renewing Felsenstein's phylogenetic bootstrap in the era of big data. Authors: Lemoine F, Domelevo Entfellner JB, Wilkinson E, Correia D, Davila Felipe M, de Oliveira T, Gascuel O. Journal: Nature,:doi:10.1038/s41586-018-0043-0 (2018)
Phylip programs don't allow special characters in sequence names eg :[]() etc. Solution: This is documented in the man page for uniqid.py. By default, uniqid.py will not use these characters.add uniqid to the pipeline to get over the limitation in Phylip that you need to have names <= 10 characters in length.- Archaeopteryx
change all Phylip programs to launch Archaeopteryx instead of ATV.Update tutorials for Archaeopteryxdelete ATV from birchdbdelete ATV from $BIRCH/java
bldna
- FASTA search of local database seems to mess up columns in output, at least when the database is a User-created file. This may affect blprotein as well.
- Replace readseq whereever possible. Readseq is no longer supported. SeqKit has many similar options.
- Replace TCAG with something else. Candidates
- Sequence Extractor - generates mouse-aware HTML with restriction sites and primers
- FEATURES - Extract by feature keys. Using the Create a list of features option results in empty output. Only tested using NG_008290. Don't know if this is a problem for all GenBank files.
Prevention of runaway jobs: Failure to select a sequence for BACHREST still results in runaway jobs. This was fixed with a simple change to bachrest.py, that was already present in numseq.py.- New GenBank feature key: propeptide - needs to be implemented in GETOB and bldna. See GenBank Release Notes under Upcoming Changes.
- When reading a GenBank file, if no ACCESSION number is given, insert an arbitrary and invalid number eg. XXXXXX. What this does is to make it possible to use FEATURES directly with a GenBank file created by Sequin.
Eliminate duplicates - Before constructing a multiple alignment, it is important to eliminate duplicate sequences, which would otherwise bias the alignment towards sequences with many duplicates. We need to add a tool that would let you eliminate duplicates, based on various criteria. For example, number of mismatches, alignment scores etc. This could either appear as a separate step in doing an alinment, of as a pre-filtering step in the menu for programs such as T-COFFEE or DIALIGN-T. It would probably be useful to have the output generate a list of deleted duplicates, either as a list of names, or as output to new bldna or blprotein windows.
Notes:- we need a program that works with both DNA and proteins
Possibilities - MUST SCALE WELL! This is an O(N**2) problem.
- Possibilities:
- CD-HIT http://www.bioinformatics.org/cd-hit/
- fastx_collapser which is part of the FASTX toolkit. Poorly documented as to what it does. Loses the names for duplicates.
- Sequence Cleaner (uses BioPython) http://biopython.org/wiki/Sequence_Cleaner. Looks like a crude script, but might be okay. Relies on a new sequence being in a set of those tested. Efficiency will depend on how Python's string test works.
- Jalview has a 'remove redundancy' function
- filter_fasta.py from Qiime http://qiime.org/scripts/filter_fasta.html Requies pip to install, and numpy.
- we need a program that works with both DNA and proteins
- Primer3 -- new capabilities and interfaces (probably should replace Primer3. Is this what NCBI uses in primer Blast? https://www.ncbi.nlm.nih.gov/pubmed/22730293
Source available at https://sourceforge.net/projects/primer3/ - Dotplots
- This page links to a large number of dotplot programs. In particular, see D-Genies.
- DXHOM - Is it possible to simplify running these programs by automatically creating an opposite strand if one is specified? The current DXHOM programs demand that the X-axis be the opposite. Is there a conceptual downside to automatically running blrevcomp.py, as opposed to making the user do this manually? Maybe we can change the menu so that the only place you can specify an opposite strand is for the X-axis. That is, put STARTX and FINISHX and opposite all on one panel, and STARTY and FINISHY on a separate panel. That might not work because the menu would end up being pretty wide.
- dotter - compile for MacOSX - No luck so far in compiling dotter on albacore. Can't find libraries, even though they are installed. Maybe try compiling on another Mac?
There is a 2016 paper describing SeqTools, which includes dotter. There is a more recent dotter that could be installed, which includes a Mac version. Also, Dotter claims to be able to read gff3 files, but I have so far not been able to generate gff3 files from GenBank files that dotter can read. The new version is probably worth looking into.
blfeatures
How about a BioLegato that displays a GenBank features table. It would be an output option from the Features program. blfeatures would use the table canvas to display feature information:
Accession FeatureKey Location Qualifiers....
You could do the usual scan/sort/extract operations to get a narrowed-down list of features. Then retrieve the features you want from the GenBank files.
This might be far more useful than one might originally think.
blprotein
- Need to add the ability to extract features from GenBank protein files. In particular, most proteins have a /coded_by= tag that gives the exact Feature expression to retreive the source CDS from the protein. This would be tremendously useful. Is there something in Eutils that does this?
Add Clustal Omega, delete TCOFFEEcd-hit Eliminate duplicates - Before constructing a multiple alignment, it is important to eliminate duplicate sequences, which would otherwise bias the alignment towards sequences with many duplicates. We need to add a tool that would let you eliminate duplicates, based on various criteria. For example, number of mismatches, alignment scores etc. This could either appear as a separate step in doing an alinment, of as a pre-filtering step in the menu for programs such as T-COFFEE or DIALIGN-T. It would probably be useful to have the output generate a list of deleted duplicates, either as a list of names, or as output to new bldna or blprotein windows.- software for helical wheel plot
- blprotein - PXHOM can't work with lowercase amino acids. Where do we do the fix? PXHOM, hom.py, or BioLegato?
Bugzilla #1216Short term Fix: set readseq to convert aa sequences to uppercase before running PXHOM.- Better long term fix is to modify P1HOM and P2HOM to accept lowercase. Actually, this will require modifying sunmods.p, p1hom.rp, p2hom.rp and prostat.rp. While we're at it, we should add support for the aminoacids pyrrolysine (Pyl,O) and Leu/Ile (Xle,J).
blncbi
- Do Eutils allow you to retrieve sequences from TSA? Current blncbi can't do that.
- In Molecule type pull down menu, change label for mrna to "mRNA (cDNA"). Search term should still be mrna.
Entrez API keys - Need to add compliance with the new apikey parameter.see below - has NCBI fixed the limitation on epost with a comma-separated list of ACCESSION numbers?Etools have an option to add email address and things like the name of the program calling EUtils eg. seqfetch.py. This is a good idea, because it helps NCBI figure out where requests are coming from.Maybe each user should have an NCBI_API_KEY environment variable, or at least we check for that before calling Eutils.- If exceeding allowed number of requests per second, API generates an error message. We should probably be able to handle that.
- Add a UNIQ option to the blsort.
- Search of non-sequence databases - high priorty should be GEO and PubMed. Also, ability to go from sequence to GEO and PubMed, and back.
>>>>CRITICAL>>>>>Compliance with NCBI requirement for https.
As of Sept. 1, 2016, http queries to NCBI will automatically be forwarded to https. See http://www.ncbi.nlm.nih.gov/news/06-10-2016-ncbi-https/. The current BioPython Entrez methods distributed with BIRCH (v1.65) still uses http for running cgi scripts. At present, the API appears to work, even for efetch retrievals of > 200 sequences. (The documentation implies that these would not work with https.) This would seem to imply that the change shouldn't break the current Entez API, but hopefully a new BioPython release will explicitly use https. It might be best to contact the BioPython Entrez developers to get them to check this.seqfetch.py - Currently limited by NCBI's epost command. At present, epost will only accept GI numbers, not ACCESSION numbers. Presumably, NCBI will get around to fixing this, since they are committed to retiring GI numbers. Until they do, seqfetch.py uses efetch to retrieve based on ACCESSION numbers. This prevents the inclusion of equery terms. One practical consequence is that when retrieving based on ACCESSION, we can't specify minimum or maximum size limits.- File --> Export to spreadsheet
- Visualization of NCBI search results
The web interface for Entrez searches is not very useful. When there are a lot of results, you just get short summaries of the first 20, 50 etc. results that really aren't manageable. We need a better way to visualize hits when there are a large number of them.- Feature map - Is there a nice tool that given a set of GenBank entries as input, will generate a graphic map of each each sequence? This might have to be a two step process. Step 1 would be to extract features into a GFF file, and then a program that renders GFF files into a graphical display. This pipeline could be called by blncbi to generate a document that lets you browse quickly through a bunch of sequences. For example, suppose you were shopping for vectors, and had it narrowed down to a short list of a few hundred. You could then browse through the maps to get a better idea of what was in them. We might be able to do this in a quick and dirty way by first generating a text-based GFF file to browse, and later introducing a feature display.
- Pie chart - This will emphasize the most prominent categories of hits
- Bar graph - this will usually be something like a decay curve, where a few cohorts take up most of the hits.
- Word cloud/Tag Cloud - might be useful if we wanted to look at some secondary search field, and see the relative proportions of, say, some key words.
The main point is that each of these provides us with some information that might be helpful in narrowing down subsequent searches. It also tells us something about the data itself.
The question is, do eUtis tools make it easy to generate the statistics needed to generate these views? Ideally, we'd rather not have to iterate 1 by 1 through 200,000 hits to find a key:value pair for each hit.Word/Tag Cloud programs:- Semanitc Word Cloud Visualization - Web application, and the .jar file can be downloaded as a standalone http://wordcloud.cs.arizona.edu/index.html
- Nucleotide database query - In the Output tab, the Output Format combo box includes an obsolete choice for GI number. Can we replace this with an option to retrieve Accession numbers?
- Entrezpy - claims to be a more complete set of Python tools for Entrez than the BioPython Entrez libraries. We should have a look.
bltable
- csvtk - toolkit for performing operations on csv files
- Function for splitting lines into columns based on a specific separator such as '|'
- Look over options for reading column headings.
- we should add a -uniq option to blsort.py
- All BioLegato interfaces using the Table canvas should have a function to export to a spreadsheet, as determined by the $BL_Spreadsheet variable. ie. blmarker, blnfetch, blpfetch, blncbi, bltable
blreads
- Quast has several specialized versions, including MetaQUAST and QUAST-LG (large genomes). We should incorporate these into blreads.
- Additional RNAseq read mappers - It is a bit too simplistic to just have Hisat2. One paper shows that bwa and bowtie2 both were more accurate on a test dataset than hisat2. Hisat2 typically had AUCs less than 0.97, while bwa and bowtie2 were usually between around 0.98 - 0.99. Hisat2 uses less RAM.
- Adaptor trimming: check out BBDuk from BBMap package. One comment on BioStars said that this was more accurate and faster than Trimmomatic. This, by the way, would be a good topic for a video.
- As long as we're giving alternatives for read mapping, we should also see if there are tools that would let us compare the accuracy of mapping, much as we compare different genome assemblies in a spreadsheet.
- File --> Open directory - Currently, you must select a directory to open. It would also be good if, when no directory was selected, a file chooser popped up so that you could open the new window in someplace other than a directory contained in the current working directory.
- Would it be useful to have a function or two to calculate read coverage? There are two potential applications. A priori, given genome size and read size, calculate expected read numbers needed for a certain coverage, or range of coverages. The a posteriori, take read files as input, with information on genome size, and calculate actual coverage. In each case, generate an HTML report.
- Removing contaminating reads from fastq/fasta files - There are many situations in which the read pool will contain reads both from the primary organism being investigated, as well as symbionts. One example would be if you isolate a microbe from a plant, you might want to get rid of the plant reads. The other would be if you have a host and you want to get rid of the reads from the microbial symbiont eg. sequencing an alga and you get bacterial symbionts. Some of these appear to involve making a reference BLAST database of the contaminant genome, and then BLASTing all reads against the database, and running a filter to remove hits based on names.
One consideration: In principle, you could do the assembly, and then eliminate contigs that appear to be contaminants. Is that a better approach?- using Bowtie (metagenomics)
- SeqTrim
- CLC protocol
- MEGAN
- DeconSeq
- Ideas from BioStars
- miRNA
- sPARTA
- last-bisulfite - align bisulfite-converted DNA reads to a genome
- PacBio support
- SequelTools - Tools for QC and processing of Pacbio reads
- A good paper illustrating combining PacBio and Illumina is Gulvik CA et al. (2019) Complete Genome Sequence of Nocardia farcinica W6977T Obtained by Combining Illumina and PacBio Reads, Microbiology Resource Announcements DOI: 10.1128/MRA.01373-18
- FALCON - PacBio assembler for diploid genomes.
- nanopack - long read processing
- Transrate is no longer supported. Let's try
- rnaQUAST
- BUSCO
- Genome assembly: Test on large datasets, different sequencing platforms - Need to do more testing on larger datasets such as human genome, and with other platforms. Most testing has been on Illumina, 454 and Ion Torrent so far.
- Genome In A Bottle
- [CHM13 - Telomere to telomere consortium https://github.com/nanopore-wgs-consortium/chm13]
blpandas
The Pandas API seems ideally suited for a BioLegato front end. The data paradigm seems to be the data frame (df). Pandas does an operation on data in a data frame, and the output is another data frame. Sound familiar? See http://pandas.pydata.org
Here's how to do this:
- Break out BioLegato as a standalone project, perhaps in a Git repo.
- Create a demo blpandas
- Advertise blpandas on the Pandas Stack Overflow forum. Solicit collaborators from the Pandas community.
Multiple Alignment
- reform: put a blank space between sequence names and alignment. Can we add tick marks to the numbers, maybe as an option?
- cd-hit - On OSX, won't run. It needs to have libgomp.1.dylib. This is fixed by installing gcc using homebrew. Can we package that library in bin-osx-x86_64?
- Is there a neat, automated way to extract only those positions from a multiple alignment that exceed some confidence score? Does JalView do this? One way might be if there was a script that could calculate that from a raw alignment. This function, along with the elimination of gaps, might make a nice script. See GUIDANCE2, Sela et al. (2015) NAR 43, W7-W14.
- There are programs for automated removal of gap positions from multiple alignments. As well, some phylogeny programs automatically ignore, remove or handle gap positions. This is actually a complex problem, but it seems useful if blpalign and blnalign had one or more programs available to remove or mask gap positions. Googling "remove gaps before phylogeny" brings up a lot of good discussion on this topic, as well as suggestions of programs for gap removal.
- Programs:
High throughput multiple alignment programs
- HAlign-II https://almob.biomedcentral.com/articles/10.1186/s13015-017-0116-x
- Kalign3 - https://academic.oup.com/bioinformatics/article/36/6/1928/5607735
MAFFT
Implement multithreading using --thread and $BIRCH_CORESUpgrade to version 7
Grishin Lab Software
The Grishin Lab at HHMI has a lot of publications and tools related to protein evolution, structure and multiple alignment. The Grishin scoring matrix is one of the ones used in NCBI BLAST. See http://prodata.swmed.edu/Lab/Software.htm
MstatX
Calculates statistics for multiple sequence alignments. Output includes various scores for multiple alignment. This should be a good way for comparing the quality of alignments based on different methods or parameters.
https://github.com/gcollet/MstatX
GUIDANCE2
GUIDANCE2 seems to be a comprehensive package for evaluating multiple alignments. It is more polished than mstatx, and gives some pretty good output. The downside is it requires BioPerl and BioRuby modules, which may be an annoyance to install.
TCOFFEE
Replace TCOFFEE!!!
On MacOSX, t_coffee is v8.14. It has not been possible so far to get later versions to run on albacore. It was possible to compile the generic version but that also generates errors. It is not certain whether this is a problem with albacore specificially, or MacOSX in general. I ONCE installed TCOFFEE in an account on OSX, and the binary didn't work. Nonetheless, I was unable to run any previous version of TCOFFEE. Even after removing all of the TCOFFEE environment variables from all .rc files, and from the .MacOSX directory, every time I tried to run a 8.14, it would create a new ~/tcoffee directory with the new version in it! This thing is like a virus. You just can't get rid of it. Somewhere in this account, there is a tcoffee script or settings lurking in a file.
Fortunately, the problem is limited to a single account.
There is now a Clustal Omega, which the authors claim is "The last alignment program you'll ever need". Maybe.
blnalign, blpalign
- Updated Weblogo to v3.7. This version fails to parse quoted arguments containing blank spaces. Menus for blpalign and blnalign have been revised to warn user that blanks are not allowed. Issue was reported to weblogo github as issue #102.
- Another odd thing: On OSX, using -t '%ltitle' generates an empty EPS file, whereas using --title '%ltitle%' works.
- consensus only works for DNA. It should be removed from blpalign. Is there a more recent consensus program we can use?
- add Quasi analysis to identify positions under positive selection. Stewart et al. (2001) BMC Bioinformatics 2,1.
- programs to create profiles
- Edit --> Extract - Add START and FINISH sliders to Extract to allow the user to select a part of a sequence or alignment. This is usually way easier than trying to select in the BioLegato sequence pane. For blnalign and blpalign, you would be able to select parts of an entire alignment. For bldna and blprotein, you would need to restrict it to a single sequence, since it doesn't make sense to extract blocks of sequence when there if they aren't aligned. This is a good idea, but needs some careful thought.
- For distance phylogeny programs:
- Create a separate menu item for calculating a distance matrix (DNADIST, PROTDIST, RESTDIST. Output to bltable.
- Create an option in phylogeny program menus to read in an already-existing distance matrix.
- Alternative programs for calculating distance matrices?
- blnalign - method for translating a DNA alignment into a protein alignment
Multiple Alignment Tutorial
Eliminate DIALIGNFix alignment editing tutorial to match alignment- Update to latest version of MAAFT
- New section on alignment colours
blnfetch, blpfetch
According to the GenBank release notes section 1.4.2, GI numbers will be phased out. This has already begun with WGS and TSA sequences. Eventally, new sequences will ONLY get an Accession.Version ID, and GI numbers will no longer be assigned.- add
Sortand Find functions. These make it easier to hunt for a ACCESSION number when looking at BLAST output For BLAST output - could we replace these programs with programs called blpblast and blnblast, that give us a multi-column BLAST output, where one column has the ACCESSION numbers? Essentially, these would be live BLAST output, where you can select the ACCESSION numbers you want and then retrieve the sequences.- Is there a straightforward way to get blnfetch and blpfetch to retrieve sequences with BLAST hits when the database was a User-Created FASTA file? Maybe seqkit grep will do the job (seqkit grep --help for more info)?
bltree
Archyopteryx https://sites.google.com/site/cmzmasek/home/software/archaeopteryx Solution: Now replaces ATV.- GUI is a bit weird and takes some getting used to
- We need to revise drawgram and drawtree to be able to use chooseviewer.py --ext. If that could be made to work, it would expand the list of bitmap viewers that can be used on OSX. The big problem with the OSX open command is that it is highly dependent on file extensions, and also that the formats generated by drawtree and drawgram are incredibly obsolete. This may not be worth the effort.
- Treevolution http://vis.usal.es/treevolution
blmarker
- Edit --> Sort rows
- Sort columns, as a separate function?
- break up phylogeny menus into tabs
- separate function for creating distance matrix
- add a choice to read in a distance matrix
- delete empty rows before doing analysis
blcont
BioLegato for continuous data. This would be an implementation of bltable, targeted at data expressed in real numbers, such as phenotypic data. We would start out with the appropriate programs from Phylip:
- contrast
- contml
- gendist
mauve
DONE
The latest release of mauve is from 2015. This works fine with Java8, but will not work with Java11.
Short term fix: Run under JDK8
https://edwards.sdsu.edu/research/running-mauve-with-java-10/
Modify mauve script to look for Java8. If found, use it. If not found, pop up a message saying to install Java8.
primer3
- New version: There is a new version 2.4 dated 2017. It looks like the command structure is the same, so hopefully it will be backward compatible with existing scripts. Primer3 is available at Sourceforge.net.
- Parse output so that it can go to bltable.
- Initially, I thought it doesn't look like it really is worth doing. The raw primer tables are not that useful, and the main report page already gives some alternative primers.
- However, Primer3 by defaults shows you the BEST primer pair. In many cases, one or both primers aren't where you'd like them to be. If the alternative primer lists were in bltable, they could be sorted. Most useful would be to sort by position.
- Toward that end, it may be that the best thing is to use the generic bltable, rather than creating a BioLegato specifically for Primere3 primer lists. We need to add BLsort, and there are probably a few other things that could be used to bring the functionality of bltable up to what is found in blmarker, blncbi, blnfetch etc.
- To support PCR-based cloning:
modify primer3.py and bldna to support TARGET command- The Artemis manual makes it look like you can put a script into the artemis etc folder and get it to run. This is what it already does for runblastp and such. It might be very useful to get primer3 to run on sequences selected in Artemis. That would make it easy to go from a gene to a PCR fragment.
- It might be easy to implement electronic PCR from bldna.
mrtrans
DONE
It may be time to replace mrtrans with something better
- mrtrans only uses universal genetic code
- mrtrans has bizarre problems with both input and output
Possible replacements:
- BioPython module Bio.codonalign seems to do this.
- Done. <strike>pal2nal - Perl script; can read different genetic codes
- pycogent - Python; doesn't seem to have a way to use alternative genetic codes
- tranalign from EMBOSS (actually, a rewriting of mrtrans, but can read alternative genetic codes, among many other options); This is a good starting point. Justin has already installed EMBOSS 6.4.0 in /home/psgendb/local/install.
- write a replacement in Java
- just write a wrapper, similar to mrtrans.csh
</strike>
Basic Genomics Tools
It should be possible to identify a set of basic genomics tools that are used by common 3rd party packages.
- could save on disk space
- could make it easier to get some things to work
- exposes documentation for these tools to the documentation hierarchy, even if they are "hidden" many layers down in packages that include them.
Examples:
samtools- bamtools
trimmomatic- fastq
jellyfish