From Bioinformatics.Org Wiki

Jump to: navigation, search

1 Future Releases
- 1.1 Installation and Updating
  - 1.1.1 getbirch
  - 1.1.2 Updating
  - 1.1.3 Desktop setup
  - 1.1.4 local files
- 1.2 Dependencies for 3rd party programs
- 1.3 BIRCH
  - 1.3.1 More modern looking web site
  - 1.3.2 Settings for each User
  - 1.3.3 Python
  - 1.3.4 Java
  - 1.3.5 Gnu Parallel
  - 1.3.6 Libraries
    - 1.3.6.1 Delete old libraries
    - 1.3.6.2 OSX Dependencies
  - 1.3.7 Documentation
  - 1.3.8 MacOSX M1 (aka Apple Silicon) support
- 1.4 BioLegato
  - 1.4.1 Need mechanism for BioLegato to run commands in the background
  - 1.4.2 Development
  - 1.4.3 GetInfo - Colourmask: new colours don't display
  - 1.4.4 system command appears to have no effect
  - 1.4.5 get rid of wrappers for text editors
  - 1.4.6 Output to console
  - 1.4.7 Table Canvas
  - 1.4.8 chooseviewer.py
  - 1.4.9 blsort.py
  - 1.4.10 BLHelper.py
- 1.5 birchadmin
- 1.6 birchdb
- 1.7 Quick and dirty patch/addon mechanism
  - 1.7.1 Definition of an add-on
  - 1.7.2 Algorithm
- 1.8 FSAP, XYLEM
  - 1.8.1 add support for gap in location
  - 1.8.2 Convert FSAP and XYLEM to Free Pascal?
- 1.9 BLAST+
  - 1.9.1 Blast output viewers
- 1.10 Phylogeny
- 1.11 bldna
- 1.12 blfeatures
- 1.13 blprotein
- 1.14 blncbi
- 1.15 bltable
- 1.16 blreads
- 1.17 blpandas
- 1.18 Multiple Alignment
- 1.19 High throughput multiple alignment programs
  - 1.19.1 MAFFT
  - 1.19.2 Grishin Lab Software
  - 1.19.3 MstatX
  - 1.19.4 GUIDANCE2
  - 1.19.5 TCOFFEE
  - 1.19.6 blnalign, blpalign
- 1.20 Multiple Alignment Tutorial
- 1.21 blnfetch, blpfetch
- 1.22 bltree
- 1.23 blmarker
- 1.24 blcont
- 1.25 mauve
- 1.26 Artemis
- 1.27 primer3
- 1.28 mrtrans
- 1.29 Basic Genomics Tools
2 BIRCHv4.01 (Current Development Version, UNSTABLE)
3 BIRCHv4.00 (Current Production Version, STABLE)
4 BIRCHv3.90
5 BIRCHv3.87
6 BIRCHv3.86
7 BIRCHv3.85
8 BIRCHv3.80
9 BIRCHv3.71
10 BIRCHv3.70
11 BIRCHv3.60
12 BIRCHv3.50
13 BIRCHv3.40
14 BIRCHv3.30
15 BIRCHv3.20
16 BIRCHv3.15
17 BIRCHv3.10
18 BIRCHv3.00
19 BIRCHv2.9
20 BIRCHv2.8

Future Releases

Installation and Updating

getbirch

Apparently, IST allows CGI scripts on personal home pages:

Brian,
Another option is to use our Homepages (home.cc) server.
It seems you already have psgendb set up on there. 
This server supports CGI and additionally uses SUexec, which causes
the CGI to run as the account owning the files (rather than the system httpd 
account), so it should be able to write to a log in its central CC home
directory.
https://home.cc.umanitoba.ca/cgi.html
This mechanism does impose some strict security rules, which can result in obtuse  
error messages. If you want to try this approach and run into "Internal Server  
Error"s, let us know, so we can troubleshoot.
Cam

birchtally.cgi - Currently uses the Python cgi library, which has apparently been deprecated for a long time. The cgi library will be removed in Python 3.13.1. The current latest version is 3.12.1. However, we have a little more time since typical OS releases are always several versions of Python behind.

We should also look a bit at security. Can we make the cgi script executable but not readable? Would it be more secure to compile it as a pyc file?
See https://docs.python.org/3/library/cgi.html#module-cgi for alternatives to the cgi module.

The framework and binary files are very big, and downloads are often unreliable on WiFi. Two possible fixes:
- split files into several smaller files
- does Java have methods to resume downloads if they fail?
GetBirch Wizard - need to do better error checking and print better error messages during downloads, unzipping and untarring. In part this is because sometimes downloads are incomplete. There is no point in proceeding if these steps fail.
CRITICAL! Check for Java (non-headless) during install.

These changes can be put into operation without waiting for a new release, since they don't affect the stable release files.
The question is, how?

- Wrap getbirch in a script, and check for what we need there?
- Just make it very prominent on the download page that you need to have these installed? For example, clicking on GETBIRCH could pop up a new web page that tells the user what must be installed, and how to test for it?
- Is there a way to do this with JavaScript in the browser?
- In any case, we need to revise the Install with GetBirch page to show the current look of the GetBirch wizard.

record of downloads - add failover to a second host, if sending data to flamingo fails
option to get onto mailing list

Java 1.8 on Ubuntu - Default doesn't work with getbirch.jar. Need to comment out a line in /etc/java-8-openjdk/accessibility.properties as described in [1]. This really sucks, because there doesn't seem to be a non-root way to fix this. The -Djavax switch recommended by another responder doesn't appear to work.

Test whether jkd is headless

dpkg -l | grep openjdk
ii  openjdk-8-jre-headless:amd64          8u171-b11-0ubuntu0.18.04.1            amd64        OpenJDK Java runtime, using Hotspot JIT (headless)

Solution: Document on GetBirch site that JDK MUST be full JDK, and not headless. If getbirch.jar won't run, install full JDK

Debian/Ubuntu:

sudo apt-get remove openjdk-8-jre-headless
sudo apt-get install openjdk-8-jre

Updating

who do we notify about new versions of BIRCH, and how do we do it?

Desktop setup

Need to update desktop setup tutorials to reflect current Linux desktops.
- GNOME3 - avoid like the plague! If you must use, also install Fedy on Fedora. Is Fedy available for RedHat?
- GNOME Classic - preferred to GNOME3
- Xfce
Need a mechanism for figuring out which desktops are available, and putting the appropriate HTML documentation files into index.html when customdoc.py runs.

local files

update-local - automated change of local GDE menus to PCD menus
mechanism for adding a .obsolete extension to files or directories

no longer supported, comparable to NewLocalFiles.list

Dependencies for 3rd party programs

First, we need to begin a list of programs that have special dependencies:

program	dependency
Weblogo	numpy
Weblogo	ghostscript (for formats other than eps)

It's probably a good idea to put this into birchdb, once the necessary fields for the table become stable.

Next, we need a way to detect each dependency during an install or update, and to somehow inform the user of the dependency.

Ideally, there would be a way to install the dependency during install or update, if the user is the BIRCH administrator. However, that would be a lot of work, and may not be stable. There could also be version dependencies, such as needing Python 3, or Java 8 or greater.

BIRCH

Time to remove Staden from the local BIRCH site. The binaries are only for solaris. Should also remove Staden lines from local-generic local.cshrc/local.profile. Remove lbirchdb and $doc. $dat? install directory?
Solaris - No one in bioinformatics world seems to use Solaris. However, it should be noted that Oracle still supports Solaris, which is now at version 11.3. We dont' want to be too quick to drop Solaris because it might make a comeback in bioinformatics servers on the cloud. Oracle appears to also be still developing the Sparc chips. Possiblities:
- drop Solaris support
- ~~Solaris appears to be free to download. Could either install on a PC, or run it in a VM (need a 64-bit VM).~~
~~PRIORITY!! Phase out of GI numbers - NCBI will begin phasing out GI numbers over 2016.~~
- ~~Re-test as software is updated.~~
Create a generic setgroup script for BIRCH
Build name substitution into scripts for running programs that may truncate long names (eg. Phylip).
- uniqid.py
- The TCOFFEE package has a program called seq_reformat that can encode and decode names.
add support for FISH shell http://www.fishshell.com
FAQ - Some of the questions and answers in the main BIRCH FAQ are a bit dated. This document needs to be reviewed and updated.
better documentation on how to get email notification working
add mechanism for desktop notification eg. GNOME, MAC?
more email notifications for long running jobs (eg. phylogeny, multiple aligmnent)
Using Unix pages
- link to Linux man pages - definitive site for Linux command manual pages
- For Unix command summary, make links to specific commands. The downside of the man7.org page is that is has ALL commands, so it's hard to find man pages for the common commands.

More modern looking web site

Requirements

customdoc.py chooses separate content depending on whether this is a multiuser system or a single user system. This is more subtle than it may seem. For example, if the person installs birch on a lab workstation that is not accessible to the outside world, would we do anything different from a single user who installs on their own account?

Strategy

For next release: home page only. In subsequent releases, we'll slowly work on modernizing pages
ONLY use HTML5 with CSS stylesheet. Maybe experiment with Javascript, but there would have to be a compelling reason to do so.
Start with an alternative web site eg. newindex.html that has to prove it can do everything that index.html does. We need to tread very slowly with the use of styles, because we don't want to break existing pages.

Home Page Layout

We want a lot of links on the front page to make it easy to find things in a few clicks. You don't want to make the user explore a bunch of hierarchical menus.

Screenshot gallery? One of those graphics that shifts between screen shots to explain visually what BIRCH looks like

Settings for each User

We need to have a directory within the $HOME directory where settings for each user can be stored.

Variables in $HOME/.config/BIRCH/BIRCHsettings.source
variable name	description
BL_EMAIL	email address for notifications
BIRCH_PROMPT	Y or N - tells whether to use the BIRCH command line prompt. Y overrides prompt set in .bashrc file.
BL_TextEditor	text editor for BioLegato
BL_PDFViewer	PDF viewer for BioLegato
BL_PSViewer	PostScript viewer for BioLegato
BL_ImageViewer	bitmap image viewer for BioLegato
BL_Document	Word Processor for BioLegato
BL_Spreadsheet	Spreadsheet for BioLegato
BL_Browser	Web browser for BioLegato
BL_Terminal	Terminal program for BioLegato
NCBI_ENTREZ_KEY	API Key for Eutils

Where do we put it?

$HOME/.config/BIRCH - Results below makes this directory the best choice. On one hand, it may not matter.
The important thing is once we commit to a location, we're committed! There's no going back.
- ~~Check: Fedora, Debian, MacOSX~~
- ~~Google searches to see if this question has a somewhat universal answer.~~

</tr>

OS/desktop	directory
RHEL7/Xfce	$HOME/.config
RHEL7/GNOME	$HOME/.config
fedora31/GNOME	$HOME/.config
Ubuntu 18/MATE	$HOME/.config
Mac OSX	$HOME/.config

https://specifications.freedesktop.org/basedir-spec/latest/ar01s03.html defines environment variables that specify where files for applications are stored. Config files are usually in a directory within $HOME/.config.

OSX - https://eclecticlight.co/2019/08/28/preference-settings-where-to-find-them-in-mojave/ tells us that the place for preferences on the OSX desktop is ~/Library/Preferences. This directory contains a .plist file for each Mac application. The plist file is a binary. My feeling is that since BIRCH will be using a textfile for settings, we avoid this directory, especially because BIRCH is not an app, but in many respects more like an operating system.

Implementation

I can't think of any reason that the BioLegato Java application needs to be modified to accommodate, or even know about this file. Even in the case where we want the user to be able to create his own .blmenus files, the pcd.properties files includes pcd.menus.path, which has a list of paths to check for .blmenus files. Boy we sure were forward thinking when we set this up!
Newuser script should create this directory, if it doesn't exist. Probably we don't delete when we run nobirch. As well, we probably need to get newuser to ask for an email address, if one is not already set. The environment variable should be BL_EMAIL.
At launch, BioLegato should check for $BIRCH/.config/BIRCH/BIRCHsettings.source. However, we don't need to create it. If it doesn't exist, or the particular setting needed is not in this file, just go with the BIRCH default.
Need a BioLegato way to change info. Should there be a separate birchadmin for non-administrators? First we get it to work and user can edit files by hand. Then we worry about a BioLegato way of changing parameters. Call it 'birchuser', as opposed to 'birchadmin'?
Should each BioLegato have a link that launches birchuser, maybe under the name of Settings, or something like that?
Another approach would be to simply take the existing BLHelper menu item from birchadmin and replace it with a single menu that lets the user set all settings. We could organize it as tabs to add all necessary functionality. That way a separate application wouldn't be needed. It might even be possible to implement with a symbolic link to a file in $dat/birch that had the master copy of the .blmenu file, rather than duplicating this in all BioLegato directories.
Since both newuser and BioLegato (and maybe other things) need to make sure that this directory exists, there should be a standalone Python script that does that. Maybe it's a toolkit, along the lines of blastdbkit, that does all operations on $HOME/.local/share/BIRCH. BioLegato would call this script, as would newuser etc.

Strategy

1. Create a script that creates this directory if necessary, and populates it with settings
2. Get newuser to run it.
3. Get BioLegato to run it

Python

~~There should be a common Python directory for installed Python packages. We already have $BIRCH/python and should add $BIRCH/local/python. Added $BIRCH/local-generic/python.~~ Forget the idea of a common Python directory. Because many Python modules such as numpy require compilation of C code, libraries cannot be assumed to be portable across platforms. We will phase out use of $BIRCH/python and $BIRCH/local/python in favor of lib-linux-x86_64/python and lib-osx-x86_64/python.

Python libraries

Bioconvert - Bioconvert is a collaborative project to facilitate the interconversion of life science data from one format to another. Bioconvert currently contains 44 formats and 95 conversions.
BioPython - We should probably have a complete BioPython in each lib-xxx-x86_64 directory. There are too many dependencies with platform-specific object libraries to put this in $BIRCH/python. It could end up saving a lot of headaches in the long run.

Python3

The goal should be to make BIRCH completely compatible with Python3.
From Ubuntu 16.04 Release notes: "Python 2 is no longer installed by default. Python 3 has been updated to 3.6. This is the last LTS release to include Python 2 in main."
It probably makes sense to do a batch run of all .py scripts using 2to3, to create an inventory of scripts that need to be updated.
By now, we can probably count on Python3 being available on all systems. This may make it possible to avoid asking the user to install Python3.
- Spoke too soon. Scientific Linux 7 does NOT have Python3 installed by default. Say what? Really? At the same time, RHEL8 will NOT have Python2, by default. Sheeesh!
It looks like __future__ is the preferred method for allowing Python2 code to use Python3 features.
Most features of Python3 that potentially conflict with Python2 have been backported to python2.7.

Is there a way, on a script by script basis, to force use of Python3? That way, as we progress, we can focus on developing for Python3, and do 2to3 conversions that are not backward compatible with Python2.

By now, we can count on Python3 being available, but not necessarily being the system default. We need to explicitly call Python3 in those cases where Python3 is required.

Ubuntu 22 There is no 'python' command on Ubuntu22. There is a 'python3' command in the default install, but you have to explicitly say 'python3'. If you want python2, you install the python2 package and type 'python2'. You can get 'python' to give you python3 if you install the python-is-python3 deb package.

Two possibilities:

change every BioLegato menu and every python script to explicitly use python3. It is not entirely obvious that this is the best solution, since eventually, all systems will have a 'python' command that gives you python3. (Won't they?)
add mention of 'python-is-python3' to required packages. As python3 becomes the default, the need for this package should go away.

Machine	Python3 version
flamingo	3.6.9
canola	3.8.10
brassica	3.10.6
triticum	3.10.6
peacock	3.7.2
CCL	3.6.8
maui	3.8.10
fedora31	3.7
wotan	3.6.8

Comptability issues

Handled by 2to3
- Print as a function - 2: print >>> 3: print requires ()
- chmod octal format - 2:os.chmod(PROGFN, 0664) 3:os.chmod(PROGFN, 0o664)
- Dict keys as list - 2:PROGLIST = PROGDICT.keys() 3:PROGLIST = list(PROGDICT.keys())
- In Python3, urllib has changed from Python2.See discussion of future.moves and six.moves, as well as examples of urllib intercompatibility at http://python-future.org/
- if os.environ.has_key('BIRCH') replaced by if 'BIRCH' in os.environ

BIRCH Python compatibility

~~admin~~

newuser.py
nobirch.py

install-scripts
Python2&3 compliant:

birchhome.py
UNINSTALL-birch.py
Update_birch.py
update_local.py

Not yet compliant:

birch_install.py - needs urllib.request (try fixing with six)
setplatform.py - needs urllib.request (try fixing with six)
test.py - imports urllib but doesn't directly call it. Do we need this declaration?

scripts

Python2&3 compliant:

BIRCHSettings.py
blastdbkit.py
BLHelper.py
createlauncher.py
customdoc.py
htmldoc.py
linklauncher.py
vncsetup.py

3rd Party Python3 compatiblilty

How to set PYTHONPATH

The pip command installs Python packages from repositiores. By default, they are installed system-wide, but we want to install them in $BIRCH/python. For example, to install the package gffutils, we type

pip3 install --install-option="--prefix=$birch/lib-$BIRCH_PLATFORM/python" gffutils

In python3.9 (and others?) --install-option is no longer supported. Instead,

pip3 install --root $birch/lib-$BIRCH_PLATFORM/python gffutils

Will have to test to see if --root is in older pip versions.

All packages installed in this manner will be in $BIRCH/lib-$BIRCH_PLATFORM/python.

We would have to add the PYTHONPATH environment variable to profile.source, cshrc.source etc.

PYTHONPATH=$birch/lib-$BIRCH_PLATFORM/python/lib/python3.5/site-packages
export PYTHONPATH

To install a package in $BIRCH/local

pip3 install --install-option="--prefix=$birch/local/lib-$BIRCH_PLATFORM/python" gffutils

Platform-dependent Python
In some Python packages (eg. cutadapt), platform-specific libraries (eg. C, C++) are part of the package, usually as .o files. These can be buried several directories down in the package, but they are there.

For such cases, we install in platform-specific python directories:

Linux-x86_64

pip3 install --install-option="--prefix=$birch/lib-linux-x86_64/python"

Mac-OSX

pip3 install --install-option="--prefix=$birch/lib-osx-x86_64/python"

Setting PYTHONPATH then becomes

PYTHONPATH=$birch/lib-$BIRCH_PLATFORM/python/lib/python3.5/site-packages
export PYTHONPATH

Portability

One of the big drawbacks with Python is that packages are expected to be installed in the root hierarchy with root privileges. pip3 often fails when trying to install in local directories. Worse, it seems that each package must be wedded to a particular version of Python eg. 3.5, 3.6, 3.7...

These problems are even true within a platform eg. from one Linux distro to another.

Potential solutions:

zipapp - Part of Python since 3.5. Not sure whether this will include dependencies but it's worth chacking out.
Nuitka - Billed as a Python compiler. Claims to create an executable file that is fully standalone. I'll believe it when I see it, but its worth a try.
package creators - There are various 3rd party programs that will create a single file that has all the dependencies
- PyInstaller
- probably a better description of how to use PyInstaller
- Create a Python wheel file

python4 not in the cards

There will be no python 4. Hooray!
https://www.techrepublic.com/article/programming-languages-why-python-4-0-will-probably-never-arrive-according-to-its-creator/

Java

needs to be done in NetBeans
need to re-compile all .jar files
test all non-BIRCH Java programs

Log4j2 vulnerability At present there are no known vulnerabilities with the Java programs distributed as part of BIRCH. To a large extent, this is due to the intrinsic design of BIRCH. BIRCH does not run web servers or peer-to-peer functions. All applications run with end-user privileges only. Some applications do post requests to web services, either through URL queries or a Java API. Some applications do logging using log4j, but output is written to local files owned by the user.

Out of an abundance of caution, we will recompile existing applications where possible, or obtain updated jar files from the authors. Scans will be done that can detect applications using log4j2, including jar files (which are really just zipped archives, so they can easily be scanned.)

Progress on applications:

Package	Comments	Status
Archaeopteryx
ArrayNorm
artemis	uses log4j2; Need to get update from EBI
axis2
BioLegato	needs to be recompiled
birchutils	doesn't use log4j2	OK
blrevcomp	doesn't use log4j2	OK
Blastviewer
BRIG-0.95-dist
Cytoscape	will be upgraded when a patched version is released
eutils
FastQC
genographer
getbirch	needs to be recompiled
Jalview	upgraded to version 2.11.1.5-j1.8	OK
mauve
Mesquite
MWCalculator
readseq
shuffle	doesn't use log4j2	OK
TM4
Trimmomatic
Trinity	has some Java tools; need to check with Trinity about updates, if any

Gnu Parallel

We should investigate ways in which GNU Parallel can be used to speed up programs. This looks like an amazing and versatile program that has a lot of ways to speed up serial code without changing the programs themselves. There is a good discussion of BioStars.

Libraries

Delete old libraries

especially those associated with GDE. The best way is to rename a library using the .old extension. The libraries to try are:

fc4libs
~~gde.fonts~~
~~openwin~~
MeV

Testing: The main programs of concern are acedb and treetool.

OSX Dependencies

xcode - ABySS - On one Mojave machine, gave the error: "xcrun: error: invalid active developer path (/Library/Developer/CommandLineTools), missing xcrun at: /Library/Developer/CommandLineTools/usr/bin/xcrun". Googling this error indicates that this sometimes appears after an update, but can be resolved in most cases by running "xcode-select --install". It is not clear how widely this will affect OSX machines, and merits further investigation before a fix, if any is done. It may be enough to check for xcrun as part of the pre-install process.

Documentation

There is a nice description of the meaning of accession numbers on Biostars, with some good links. We should find a place for those links in our documentation.
BIRCH Administrator's Guide
- Update doc on making BIRCH site web-accessible
  - need to install apache2
  - How to use an IP address in BIRCH Properties Tool.

MacOSX M1 (aka Apple Silicon) support

MacOS-ARM Development

BioLegato

Get BioLegato to compile in Java 11. All the current Linux and MacOS default to at least Java 17, with the exception of Ubuntu 20 which defaults to Java 11. We would also be abandoning MacOS High Sierra which defaults to Java 8, although High Sierra really has to be deprecated. We'll keep it in Java 8 for now, but for the future we want things to work with newer versions of Java.
BioLegato needs to be able to launch programs in the background
PCD
- VerifyBox - PCD should support insertion of Go/No-go statements into the Shell command. This would pop up a box asking the user to verify that they want to do a task.
  Implementation: This is best done in Java. There simply is no GUI for Python that is distributed as a STANDARD library on all platforms. Tkinter/tkinter claim such, but it simply isn't true, because they depend on tck/cl being present, which is not (surprisingly) on all Linux distros. Java has this as a standard feature. The other advantage of doing this in Java is that when we can later put the code directly into BioLegato, once we can add it to the PCD grammar.
- need a reset to default values button for menus
- Bugzilla #1204PCD system command appears to have no effect
- set default values or min/max numbers as the result of a command
- conditional grey-out of non-relevant items?
- cascading variables?

Need mechanism for BioLegato to run commands in the background

At present, there is no way for PCD shell commands to run jobs in the background. That is, the Java Virtual Machine cannot terminate until every shell command has terminated. Even if the command ends with an ampersand, it must terminate before the JVM will terminate. That is an annoyance when we want displayed output to persist even after a BioLegato job has terminated, and a potentially major problem if we want to launch long-running or resource-intensive jobs from BioLegato.

It's probably best to write a short demo program to experiment with different approaches.

Hints:

src/BioPCD/parser/src/org/biopcd/parser/CommandThread.java is the object that calls Runtime.getRuntime to execute commands as new threads. see shellCommand in this file.
As a temporary workaround, we can call scripts though a bash wrapper that uses nohup and & to run the script in the background.

Links:

Solution: In BioLegato 1.0.3, CommandThread.java has been modified so that if a command line ends in '&', it will be run in the background.

Remaining issues:

On Linux, we can run anything we want in the background using &. This doesn't always work in MacOSX, for reasons that are not clear. If you launch output in the background, the text editor doesn't pop up until the parent BioLegato process is terminated. It looks like what we have to do is to only use & when output is being sent to files. The rest of the time, we don't use &. This is probably okay, because if you logout, you certainly don't expect or want viewers to persist. If you don't logout, there's no need to kill BioLegato. One important thing is to first remove & from the birch launcher and birchadmn. We need to do more thorough testing on the remaining BioLegato programs. In some cases, there may be a need to run programs in the background through wrappers.
Most shell commands run the command in the background. However, if temp files aren't explicitly saved, they get deleted before the command has a chance to process them. Therefore, we need to go through the .blmenu files and make sure that "save true" is included for all input file declarations eg in1.
~~birch~~
~~birchadmin~~
~~bldna~~
blprotein
blnalign
blpalign
blnfetch
blpfetch
bltree
blncbi
blmarker

Development

make a roadmap of BioLegato structure. Probably several roadmaps, actually
- high-level structure
- initialization
- map for each canvas
- PCD --> menu, how?
changelog: Wiki, text, web page?

GetInfo - Colourmask: new colours don't display

Bugzilla #1201

Hints:

getInfoAction
SequenceWindow
SequenceList
SequenceTextArea

The Update action is contained in the SequenceWindow. My guess is that we need to pass the SequenceTextArea to the SequenceWindow so that it can call the repaint function for SequenceTextArea. It is worthy of note that there are numerous calls to repaint in SequenceTextArea that specify the area to repaint. This may be for efficiency during actions like select and scroll, and may not be necessary here.

system command appears to have no effect

Bugzilla #1204

Hints:

isSystemSupported may be relevant

get rid of wrappers for text editors

The BioLegato scripts call choose_edit_wrapper.sh, which in turn calls either nedit_wrapper.sh for nedit or gedit_wrapper.sh for gedit.

We can certanly eliminate nedit_wrapper.sh because the LD_LIBRARY_PATH problem no longer applies. It's probably best to first get rid of nedit_wrapper, because it appears that calling it was we now, do, with several layers of scripts, nedit sometimes doesn't launch.
Hopefully there is a way to eliminate the gedit wrapper too. gedit has a --new-window option that forces gedit to open a new window, rather than a new tab, which we don't want.

Output to console

We need to decide on a standard way to run programs so that we see the progress as the program runs. Currently this is done using the command stored in $GDE_TERM, but that is not necessarily platform independent. Some possibilities include:

an external script that we can pass the command to, that will pop up a console display, and an "Okay" button when the job is finished
an intrinsic BioLegato command that sends output of a job to a console display

Table Canvas

Need a built-in function to add rows or columns. There is already a Delete Row function, so it should be easy to add. This will be useful for things like blnfetch and blpfetch, when you want to type in ACCESSION numbers by hand.
MacOSX - table canvas doesn't have border lines between cells, even though you see them on Linux. Is there some Swing or AWT attribute that could be set to force border lines to be shown?

chooseviewer.py

should be able to open presentations eg. .pptx
other media like audio, video?
The problem with deleting an HTML file while it is still in the browser is that at least Firefox won't save the file under a new name if you ask it to. Since the default is to have it pop up in a temporary file, this makes it hard to save a copy of the output after you have viewed it in a temporary file.

blsort.py

sort dates. We already know how to do dates from blastdbkit.py, so we borrow that code here.
re-annotate parameters using pydoc @param operator

BLHelper.py

revisit how BinaryExists looks for apps. The command string could be any of
- open -aWXYZ appname (where WXYZ might be any number of options)
There should be a refresh button to update birchadmin to show any newly-installed applications
The command for all Save buttons should include a notify popup to tell the user to restart birchadmin for changes to be seen.

birchadmin

birchadmin is a birch system administration tool.

menus
- File
  - nobirch (some users may wish to turn off birch access. Think about this one. It's really only particularly relevant on multiuser sites. Or is it?)
- Web
  - global string change - we need an easy menu for recursively doing a string substitution in all files in directory with certain characteristeics eg. a file extension. The program should make a backup copy of the file, preserving the file's metadata. An example would be "in all files with a .blmenu extension, replace $DOC with $BIRCH/doc"
  - change BIRCH URL

birchdb

doc2ace.py - update list of supported file formats, especially to include newer OpenOffice/LibreOffice formats
reimplement birchdb using [ hsqldb].
Need a GUI front end.
- hsqldb has a few GUI tools.
- LibreOffice Base can act as an HSQLDB client.
- Top 10 MySQL GUI Tools
- Valentina Studio
URGENT! - Need to fix the birchdb and tbirchdb scripts to work properly on CCL, and still work elsewhere.
- Could it be that it works from a csh script, and not a bash script?

The problem is that failure of birchdb to launch Xace or tace has been inconsistent. It works on some days, and not on others. It is as if something keeps getting set or unset.

Although error messages aren't consistent, here's one (on jupiter):

Gtk-WARNING **: Failed to load module "libgail.so": libgail.so: cannot open shared object file: No such file or directory
Gtk-WARNING **: Failed to load module "libatk-bridge.so": libatk-bridge.so: cannot open shared object file: No such file or directory

Other times, this script gives a Segmentation Fault error. Once again, the only place I've had this trouble is on CCL.

As well, there is a GUI front end called [ RazorSQL] which may be all that we need to manage birchdb.

- InstallAddUpdate
  - ~~update - this will have to wait until the new version of Getbirch, and the accompanying install-scripts directory, are ready.~~
  - install add-on - ditto. Ideally, it should be possible to choose an addon from a menu and have Getbirch install it.
need to revise documentation, and get doc into database

Quick and dirty patch/addon mechanism

We need a way to apply patches to an existing BIRCH install. This should be a very simple mechanism to start with, which will also teach us some things about exactly what it is that we want it to do. Initially, it should probably be nothing more than running a script that downloads a file and untars it, so that the files just go where they are supposed to go with permissions already set.

We need a mechanism to record in $BIRCH/local which addons are installed. This way, when a BIRCH update is installed, we can make sure to re-install any addons.

Definition of an add-on

An add-on includes:

a tar file which, when untarred in the $BIRCH directory, will install files in the right places.
an install script
a line to be appended to a flat-file database somewhere in $BIRCH/local. This is probably a csv-style line. It should at minimum give a unique name for the add on.

xxxxx.addon.d
    payload.tar
    install.py
    addon_spec.csv

An add-on can either be something new that is installed, or a patch that overwrites existing files, or even a script that runs and changes something. For example, a patch might be as simple as a script that changes important permissions, or changes the name of a file, or does a string substitution to correct an error.

Algorithm

get list of available addons/patches
user selects one or more
foreach addon selected
    cd $BIRCH
    download addon
    gunzip addon.tar.gz
    tar xvfp addon.tar
    cd xxxx.addon.d
    mv payload.tar $BIRCH
    cd $BIRCH
    tar xvfp payload.tar
    cd xxxxx.addon.d
    python install.py
    cat addon_spec.csv >> $BIRCH/local/admin/addons.csv

FSAP, XYLEM

add support for gap in location

The GenBank release notes define three forms of the gap operator as part of a feature location. For example:

CONTIG      join(complement(AADE01002756.1:1..10234),gap(1206),
            AADE01006160.1:1..1963,gap(323),AADE01002525.1:1..11915,gap(1633),
            AADE01005641.1:1..2377)

This is distinct from the gap feature key, which is supported by GETOB

Convert FSAP and XYLEM to Free Pascal?

Free pascal appears to still be supported. They have builds for Linux and MacOS, both on x86_64 and ARM64. There are even RPM and DEB packages.

GNU Pascal looks like it hasn't been supported since 2005. GNU Pascal has a great deal features aside from the Jensen & Wirth standard, including support for most Borland features, and even abstract object types and methods. The main improvement would be that we could leave behind p2c. This should be done with great care and a lot of testing, because there could be surprises hiding in the implementation. See http://www.gnu-pascal.de/gpc/h-index.html

Have tested Free Pascal on macos-arm64 and linux-x86_64. Newly compiled programs work. Also, code compiled statically using fpc -XS on Ubuntu 18 executes on Red Hat 7. Presumably, RH7 has the oldest C-libraries, so if the binaries run there, they will run on any Linux system.

BLAST+

blastdbkit.py - NCBI packages human_genome and mouse_genome files names that start with GCF_. For now the exact file prefixes are hard-coded into the script, but we need a better way to "discover" the filenames.
blastdbkit.py - NCBI has a penchant for changing the basenames of databases. For example, there used to be swissprot.00.*, and now it's just swissprot.*. Betacoronavirus has undergone similar changes. Is it feasible for blastdbkit to check for basenames in the local database that are not found on the FTP site, and then delete the presumably obsolete local files?
Need more ways to find out sizes of databases Possibilities:
- ~~Add an item to BIRCH, bldna, blprotein to open local database spreadsheet (DONE)~~
- Add a feature to blastdbkit that tacks on the size in Gb of the database to the name of the database in the Biolegato menu items. This shouldn't take up much width on the screen.
~~Add email notification. DONE v3.71~~ Can we also have some sort of desktop notification as an alternative?
Consider adding some sort of functions for blastdb_aliastool which lets the user create virtual databases which are subsets of the larger databases.
Consider how to deal with multiple query sequences. If you select more than 1 sequence in BioLegato, both FASTA and BLAST will use each as queries, and the report will have a summary of results for all. Is there a better way to handle multiple queries?
~~BIRCHv3.60 -Need to let user explicitly set CPU limits. Otherwise, BLAST uses ALL availalbe CPUs.~~
Web blast appears to have a feature that says "download aligned sequences", meaning that the part of the hit that is in the alignment can be downloades. tblastn appears to download the original DNA sequence, which was translated during the tblastn search. This could potentially be useful, in cases where a complete global alignment of the query and subject exist. It could also be very dangerous, when those conditions don't apply ie. a local alignment. We should look into whether or not we can get the same output from BLAST+. It would probably have to go through the history.
BIRCHv3.60 - BLAST+ 2.8 - need to be compliant with new database format (v5) that improves blast performance when limiting taxa or given a list of ID numbers, as well as retrieval of data using blastcmd. At present, I believe that only most common databases are available by FTP for v5. We need to come up with the best way to either support both old and new versions, or to easily upgrade to v5 databases.
Compute Canada mirror for BLAST databases - It would probably be useful to create a mirror for BLAST databases, so that there would be a 2nd North American mirror. (biomirror doesn't store .md5 data for database files, which undermines its usefulness.) The Rapid Access Service at Compute Canada provides easy access to projects using comparatively "small" resources. For example, the /project filesystem allow up to 10 Tb per group, which should be more than enough for NCBI databases for quite some time.
blastdbkit.py - could we speed things up a bit using unpigz?
Is there a way to get BLAST to report the length of the library sequence in the hit summary? This would be useful to give us some idea of the types of hits we're getting. Alternatively, could this size go into the blnfetch output as a separate column?
- Have a look at blastdbcmd, which takes a sequence id as input and retrieves a lot of types of information from a local BLAST database. This could have a whole lot of uses in BioLegato. Try 'blastdbcmd -help' for a manual page.
~~Create an environment variable that will tell the number of cores, so that BLAST and FASTA can use multithreading.~~BIRCH_CORES.

In Python:

import multiprocessing
multiprocessing.cpu_count()

~~v3.60: multiple CPUs - simple approach is to just use all of them. Do we want to have a menu item to choose the number of CPUs? Biolegato has chooser to set number of CPUs~~
~~Automated install/update of NCBI Blast databases, run from birchadmin.~~

Algorithm: if BLASTDB not set prompt for directory (default $BIRCH/GenBank) read list of database divisions currently installed read list of database divisions to be installed uninstall those not in the list from previous step install all divisions in the install list

Could do this as:

shell script with BioLegato front end
Python script with BioLegato front end. This could be implemented by adapting BLHelper.py. The menu layout would look something like:

Nucleotide (nt) Installed Install O</d> Delete O</d>

Protein (nr) Installed Install O</d> Delete O</d>

RefSeq RNA (refseq_rna) Not installed Install O</d> Delete O</d>

Java application

Special cases of est and pdbnt databases
- Where to document, aside from Adding BLAST Databases page
- How should blastdbkit.py deal with these?
  - If est or pdbnt are installed, generate message to also install dependencies?
  - message in blastdbkit.log?
  - don't include in BLAST/FASTA menus unless dependencies are installed
  - In updates, automatically update dependencies if est or pdbnt are installed?
~~upgrade to FASTA 36.3.8 (April 2016)~~
- have a look at FASTA scripts for annotation.
work out in depth the ways to get the most performance out of local BLAST and FASTA.
add option to set number of CPUs
- This paper demonstrates the counterintuitive finding that multiple CPUs actually degrade performance when memory is smaller than the size of the database being searched. http://www.nersc.gov/users/computational-systems/genepool/performance-and-optimization. It may be good to add environment variables that are set at install time and default to reasonable values for one's system.

output should always show real, user and sys times for search. Maybe these could be pseudocomments in table-report?
blastdbcmd integration into BioLegato (akin to blncbi)
launch BLAST/FASTA from Artemis
better support for the more specialized blasts such as rpsblast and psiblast
In BioLegato (and in reports?) should display both the descriptions of databases, and the codes for the databases.
FASTA output in multiple formats. With fasta, run fasta using multiple output formats
eg.-m "F8 $NAME.fasta.tsv" -m "F2 $NAME.fasta2"

Blast output viewers

Firefox is a pain in the ass because if you run BioLegato on a remote server, sending output to Firefox will give an error message if you also have Firefox running on your local desktop. Two possible options:
- Does Firefox have a command line option that lets you avoid the lock?
- The Java API has a browser that can display HTML, and probably visit links. Could we launch this one as a default browser for BLAST output?
Blastviewer - [2]; The most intuitive of the tools I've tried. Requires NCBI XML file as input using -outfmt 5 option. Current version doesn't take filenames at the command line. I have contacted the author about fixing this. DONE: New version takes files at command line. Added to blast searches as an output option.
BLASTGrabber - BMC Bioinformatics 2014 15:128.
- Pro: Nice GUI, seems pretty solid; written in Java; fairly easy to figure out; most tables can be exported to csv; is able to parse BLAST+2.10.0 text output files.
- Con: doesn't take input files from the command line; no apparent way to retrieve hit sequences directly from the program; not as intuitive to user as Blastviewer.
jamblast - requires an SQL server, which makes installation non-trivial; last update 2013. [3]
Blast Output Viewer Generator - Looks like what you do is run BLAST at the NCBI web site, save output to a text file, and BOVG generates a web page that views the output interactively.
[BlasterJS https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0205286] - Looks like a JavaScript library that could be included in other web applications that use BLAST.

Phylogeny

HIGH PRIORITY: preserving metadata using phyloXML. Most newer phylogeny programs use phyloXml to represent trees. We need to have a way to link sequence IDs to sequence metadata eg. taxonomy. PhyloXml does this. There are BioJava and BioPython classes for doing at least some of his. One approacy might be to read Phylip Newick files, get metadata from IDs, and genarate PhyloXml files from that. Solution: bltree can add bootstrap values to a phylogenetic tree, and save output to a phyloxml file.
Booster - bootstrapping for large datasets. See Title: Renewing Felsenstein's phylogenetic bootstrap in the era of big data. Authors: Lemoine F, Domelevo Entfellner JB, Wilkinson E, Correia D, Davila Felipe M, de Oliveira T, Gascuel O. Journal: Nature,:doi:10.1038/s41586-018-0043-0 (2018)
~~Phylip programs don't allow special characters in sequence names eg :[]() etc. Solution: This is documented in the man page for uniqid.py. By default, uniqid.py will not use these characters.~~
~~add uniqid to the pipeline to get over the limitation in Phylip that you need to have names <= 10 characters in length.~~
Archaeopteryx
- NEED TO REPLACE ARCHAEOPTERYX - The developer seems to have re-designed Archaeopteryx as a JavaScript applet at https://github.com/cmzmasek/archaeopteryx-js/tree/master. It doesn't seem to be runnable at this site, as the various links usually give a 404 Not found error. Downloading the 2024 package doesn't help, as all of the test html pages give errors when trying to find libraries. The Developers' documentation gives no insights. A skilled JavaScript developer might be able to get this to work, but the lack of support suggests that it is best to stick with the older standalone application for the timebeing, and to look for a better option for drawing phylogenetic trees.
  There is an outside chance that the Not Found errors could be due to phyloxml.org begin temporarily down. We should have a look some other time against the possibility that this might be fixed.
- ~~change all Phylip programs to launch Archaeopteryx instead of ATV.~~
- ~~Update tutorials for Archaeopteryx~~
- ~~delete ATV from birchdb~~
- ~~delete ATV from $BIRCH/java~~

bldna

FASTA search of local database seems to mess up columns in output, at least when the database is a User-created file. This may affect blprotein as well.
Replace readseq whereever possible. Readseq is no longer supported. SeqKit has many similar options.
Replace TCAG with something else. Candidates
- Sequence Extractor - generates mouse-aware HTML with restriction sites and primers
FEATURES - Extract by feature keys. Using the Create a list of features option results in empty output. Only tested using NG_008290. Don't know if this is a problem for all GenBank files.
~~Prevention of runaway jobs: Failure to select a sequence for BACHREST still results in runaway jobs. This was fixed with a simple change to bachrest.py, that was already present in numseq.py.~~
New GenBank feature key: propeptide - needs to be implemented in GETOB and bldna. See GenBank Release Notes under Upcoming Changes.
When reading a GenBank file, if no ACCESSION number is given, insert an arbitrary and invalid number eg. XXXXXX. What this does is to make it possible to use FEATURES directly with a GenBank file created by Sequin.
Eliminate duplicates - Before constructing a multiple alignment, it is important to eliminate duplicate sequences, which would otherwise bias the alignment towards sequences with many duplicates. We need to add a tool that would let you eliminate duplicates, based on various criteria. For example, number of mismatches, alignment scores etc. This could either appear as a separate step in doing an alinment, of as a pre-filtering step in the menu for programs such as T-COFFEE or DIALIGN-T. It would probably be useful to have the output generate a list of deleted duplicates, either as a list of names, or as output to new bldna or blprotein windows.
Notes:
- ~~Possibilities:~~
  - ~~filter_fasta.py from Qiime http://qiime.org/scripts/filter_fasta.html Requies pip to install, and numpy.~~

Primer3 -- new capabilities and interfaces (probably should replace Primer3. Is this what NCBI uses in primer Blast? https://www.ncbi.nlm.nih.gov/pubmed/22730293
Source available at https://sourceforge.net/projects/primer3/
Dotplots
- This page links to a large number of dotplot programs. In particular, see D-Genies.
- DXHOM - Is it possible to simplify running these programs by automatically creating an opposite strand if one is specified? The current DXHOM programs demand that the X-axis be the opposite. Is there a conceptual downside to automatically running blrevcomp.py, as opposed to making the user do this manually? Maybe we can change the menu so that the only place you can specify an opposite strand is for the X-axis. That is, put STARTX and FINISHX and opposite all on one panel, and STARTY and FINISHY on a separate panel. That might not work because the menu would end up being pretty wide.
- dotter - compile for MacOSX - No luck so far in compiling dotter on albacore. Can't find libraries, even though they are installed. Maybe try compiling on another Mac?
  There is a 2016 paper describing SeqTools, which includes dotter. There is a more recent dotter that could be installed, which includes a Mac version. Also, Dotter claims to be able to read gff3 files, but I have so far not been able to generate gff3 files from GenBank files that dotter can read. The new version is probably worth looking into.

blfeatures

How about a BioLegato that displays a GenBank features table. It would be an output option from the Features program. blfeatures would use the table canvas to display feature information:

Accession    FeatureKey    Location    Qualifiers....

You could do the usual scan/sort/extract operations to get a narrowed-down list of features. Then retrieve the features you want from the GenBank files.
This might be far more useful than one might originally think.

blprotein

Need to add the ability to extract features from GenBank protein files. In particular, most proteins have a /coded_by= tag that gives the exact Feature expression to retreive the source CDS from the protein. This would be tremendously useful. Is there something in Eutils that does this?
~~Add Clustal Omega, delete TCOFFEE~~
cd-hit Eliminate duplicates - Before constructing a multiple alignment, it is important to eliminate duplicate sequences, which would otherwise bias the alignment towards sequences with many duplicates. We need to add a tool that would let you eliminate duplicates, based on various criteria. For example, number of mismatches, alignment scores etc. This could either appear as a separate step in doing an alinment, of as a pre-filtering step in the menu for programs such as T-COFFEE or DIALIGN-T. It would probably be useful to have the output generate a list of deleted duplicates, either as a list of names, or as output to new bldna or blprotein windows.
software for helical wheel plot
blprotein - PXHOM can't work with lowercase amino acids. Where do we do the fix? PXHOM, hom.py, or BioLegato?
Bugzilla #1216
- ~~Short term Fix: set readseq to convert aa sequences to uppercase before running PXHOM.~~
- Better long term fix is to modify P1HOM and P2HOM to accept lowercase. Actually, this will require modifying sunmods.p, p1hom.rp, p2hom.rp and prostat.rp. While we're at it, we should add support for the aminoacids pyrrolysine (Pyl,O) and Leu/Ile (Xle,J).

blncbi

Do Eutils allow you to retrieve sequences from TSA? Current blncbi can't do that.
In Molecule type pull down menu, change label for mrna to "mRNA (cDNA"). Search term should still be mrna.
Entrez API keys - Need to add compliance with the new apikey parameter.
- ~~see below - has NCBI fixed the limitation on epost with a comma-separated list of ACCESSION numbers?~~
- ~~Etools have an option to add email address and things like the name of the program calling EUtils eg. seqfetch.py. This is a good idea, because it helps NCBI figure out where requests are coming from.~~
- ~~Maybe each user should have an NCBI_API_KEY environment variable, or at least we check for that before calling Eutils.~~
- If exceeding allowed number of requests per second, API generates an error message. We should probably be able to handle that.
Add a UNIQ option to the blsort.
Search of non-sequence databases - high priorty should be GEO and PubMed. Also, ability to go from sequence to GEO and PubMed, and back.
>>>>CRITICAL>>>>>Compliance with NCBI requirement for https.
As of Sept. 1, 2016, http queries to NCBI will automatically be forwarded to https. See http://www.ncbi.nlm.nih.gov/news/06-10-2016-ncbi-https/. The current BioPython Entrez methods distributed with BIRCH (v1.65) still uses http for running cgi scripts. At present, the API appears to work, even for efetch retrievals of > 200 sequences. (The documentation implies that these would not work with https.) This would seem to imply that the change shouldn't break the current Entez API, but hopefully a new BioPython release will explicitly use https. It might be best to contact the BioPython Entrez developers to get them to check this.
seqfetch.py - Currently limited by NCBI's epost command. At present, epost will only accept GI numbers, not ACCESSION numbers. Presumably, NCBI will get around to fixing this, since they are committed to retiring GI numbers. Until they do, seqfetch.py uses efetch to retrieve based on ACCESSION numbers. This prevents the inclusion of equery terms. One practical consequence is that when retrieving based on ACCESSION, we can't specify minimum or maximum size limits.
File --> Export to spreadsheet
Visualization of NCBI search results
The web interface for Entrez searches is not very useful. When there are a lot of results, you just get short summaries of the first 20, 50 etc. results that really aren't manageable. We need a better way to visualize hits when there are a large number of them.
- Feature map - Is there a nice tool that given a set of GenBank entries as input, will generate a graphic map of each each sequence? This might have to be a two step process. Step 1 would be to extract features into a GFF file, and then a program that renders GFF files into a graphical display. This pipeline could be called by blncbi to generate a document that lets you browse quickly through a bunch of sequences. For example, suppose you were shopping for vectors, and had it narrowed down to a short list of a few hundred. You could then browse through the maps to get a better idea of what was in them. We might be able to do this in a quick and dirty way by first generating a text-based GFF file to browse, and later introducing a feature display.
- Pie chart - This will emphasize the most prominent categories of hits
- Bar graph - this will usually be something like a decay curve, where a few cohorts take up most of the hits.
- Word cloud/Tag Cloud - might be useful if we wanted to look at some secondary search field, and see the relative proportions of, say, some key words.
  The main point is that each of these provides us with some information that might be helpful in narrowing down subsequent searches. It also tells us something about the data itself.
  The question is, do eUtis tools make it easy to generate the statistics needed to generate these views? Ideally, we'd rather not have to iterate 1 by 1 through 200,000 hits to find a key:value pair for each hit.Word/Tag Cloud programs:
  - Semanitc Word Cloud Visualization - Web application, and the .jar file can be downloaded as a standalone http://wordcloud.cs.arizona.edu/index.html
Nucleotide database query - In the Output tab, the Output Format combo box includes an obsolete choice for GI number. Can we replace this with an option to retrieve Accession numbers?
Entrezpy - claims to be a more complete set of Python tools for Entrez than the BioPython Entrez libraries. We should have a look.

bltable

csvtk - toolkit for performing operations on csv files
Function for splitting lines into columns based on a specific separator such as '|'
Look over options for reading column headings.
we should add a -uniq option to blsort.py
All BioLegato interfaces using the Table canvas should have a function to export to a spreadsheet, as determined by the $BL_Spreadsheet variable. ie. blmarker, blnfetch, blpfetch, blncbi, bltable

blreads

Quast has several specialized versions, including MetaQUAST and QUAST-LG (large genomes). We should incorporate these into blreads.
Additional RNAseq read mappers - It is a bit too simplistic to just have Hisat2. One paper shows that bwa and bowtie2 both were more accurate on a test dataset than hisat2. Hisat2 typically had AUCs less than 0.97, while bwa and bowtie2 were usually between around 0.98 - 0.99. Hisat2 uses less RAM.
Adaptor trimming: check out BBDuk from BBMap package. One comment on BioStars said that this was more accurate and faster than Trimmomatic. This, by the way, would be a good topic for a video.
As long as we're giving alternatives for read mapping, we should also see if there are tools that would let us compare the accuracy of mapping, much as we compare different genome assemblies in a spreadsheet.
File --> Open directory - Currently, you must select a directory to open. It would also be good if, when no directory was selected, a file chooser popped up so that you could open the new window in someplace other than a directory contained in the current working directory.
Would it be useful to have a function or two to calculate read coverage? There are two potential applications. A priori, given genome size and read size, calculate expected read numbers needed for a certain coverage, or range of coverages. The a posteriori, take read files as input, with information on genome size, and calculate actual coverage. In each case, generate an HTML report.
Removing contaminating reads from fastq/fasta files - There are many situations in which the read pool will contain reads both from the primary organism being investigated, as well as symbionts. One example would be if you isolate a microbe from a plant, you might want to get rid of the plant reads. The other would be if you have a host and you want to get rid of the reads from the microbial symbiont eg. sequencing an alga and you get bacterial symbionts. Some of these appear to involve making a reference BLAST database of the contaminant genome, and then BLASTing all reads against the database, and running a filter to remove hits based on names.
One consideration: In principle, you could do the assembly, and then eliminate contigs that appear to be contaminants. Is that a better approach?
- using Bowtie (metagenomics)
- SeqTrim
- CLC protocol
- MEGAN
- DeconSeq
- Ideas from BioStars
miRNA
sPARTA
last-bisulfite - align bisulfite-converted DNA reads to a genome
PacBio support
- SequelTools - Tools for QC and processing of Pacbio reads
- A good paper illustrating combining PacBio and Illumina is Gulvik CA et al. (2019) Complete Genome Sequence of Nocardia farcinica W6977T Obtained by Combining Illumina and PacBio Reads, Microbiology Resource Announcements DOI: 10.1128/MRA.01373-18
- FALCON - PacBio assembler for diploid genomes.
- nanopack - long read processing
Transrate is no longer supported. Let's try
- rnaQUAST
- BUSCO
Genome assembly: Test on large datasets, different sequencing platforms - Need to do more testing on larger datasets such as human genome, and with other platforms. Most testing has been on Illumina, 454 and Ion Torrent so far.
- Genome In A Bottle
- [CHM13 - Telomere to telomere consortium https://github.com/nanopore-wgs-consortium/chm13]

blpandas

The Pandas API seems ideally suited for a BioLegato front end. The data paradigm seems to be the data frame (df). Pandas does an operation on data in a data frame, and the output is another data frame. Sound familiar? See http://pandas.pydata.org

Here's how to do this:

Break out BioLegato as a standalone project, perhaps in a Git repo.
Create a demo blpandas
Advertise blpandas on the Pandas Stack Overflow forum. Solicit collaborators from the Pandas community.

Multiple Alignment

Alternatives to WebLogo - Weblogo is a god-awful program to install and maintain. Possible alternatives:
- LogoMaker
- 33 Software Tools To Generate Sequence Logos
reform: put a blank space between sequence names and alignment. Can we add tick marks to the numbers, maybe as an option?
cd-hit - On OSX, won't run. It needs to have libgomp.1.dylib. This is fixed by installing gcc using homebrew. Can we package that library in bin-osx-x86_64?
Is there a neat, automated way to extract only those positions from a multiple alignment that exceed some confidence score? Does JalView do this? One way might be if there was a script that could calculate that from a raw alignment. This function, along with the elimination of gaps, might make a nice script. See GUIDANCE2, Sela et al. (2015) NAR 43, W7-W14.
There are programs for automated removal of gap positions from multiple alignments. As well, some phylogeny programs automatically ignore, remove or handle gap positions. This is actually a complex problem, but it seems useful if blpalign and blnalign had one or more programs available to remove or mask gap positions. Googling "remove gaps before phylogeny" brings up a lot of good discussion on this topic, as well as suggestions of programs for gap removal.
Programs:
- Done.~~Gblocks~~
- trimal - Appears to have a number of useful tools for evaluating and trimming multiple alignments.

High throughput multiple alignment programs

MAFFT

~~Implement multithreading using --thread and $BIRCH_CORES~~
~~Upgrade to version 7~~

Grishin Lab Software

The Grishin Lab at HHMI has a lot of publications and tools related to protein evolution, structure and multiple alignment. The Grishin scoring matrix is one of the ones used in NCBI BLAST. See http://prodata.swmed.edu/Lab/Software.htm

MstatX

Calculates statistics for multiple sequence alignments. Output includes various scores for multiple alignment. This should be a good way for comparing the quality of alignments based on different methods or parameters.

https://github.com/gcollet/MstatX

GUIDANCE2

GUIDANCE2 seems to be a comprehensive package for evaluating multiple alignments. It is more polished than mstatx, and gives some pretty good output. The downside is it requires BioPerl and BioRuby modules, which may be an annoyance to install.

TCOFFEE

~~Replace TCOFFEE!!!~~

On MacOSX, t_coffee is v8.14. It has not been possible so far to get later versions to run on albacore. It was possible to compile the generic version but that also generates errors. It is not certain whether this is a problem with albacore specificially, or MacOSX in general. I ONCE installed TCOFFEE in an account on OSX, and the binary didn't work. Nonetheless, I was unable to run any previous version of TCOFFEE. Even after removing all of the TCOFFEE environment variables from all .rc files, and from the .MacOSX directory, every time I tried to run a 8.14, it would create a new ~/tcoffee directory with the new version in it! This thing is like a virus. You just can't get rid of it. Somewhere in this account, there is a tcoffee script or settings lurking in a file.

~~Fortunately, the problem is limited to a single account.~~

~~There is now a Clustal Omega, which the authors claim is "The last alignment program you'll ever need". Maybe.~~

blnalign, blpalign

Updated Weblogo to v3.7. This version fails to parse quoted arguments containing blank spaces. Menus for blpalign and blnalign have been revised to warn user that blanks are not allowed. Issue was reported to weblogo github as issue #102.
- Another odd thing: On OSX, using -t '%ltitle' generates an empty EPS file, whereas using --title '%ltitle%' works.
consensus only works for DNA. It should be removed from blpalign. Is there a more recent consensus program we can use?
add Quasi analysis to identify positions under positive selection. Stewart et al. (2001) BMC Bioinformatics 2,1.
programs to create profiles
Edit --> Extract - Add START and FINISH sliders to Extract to allow the user to select a part of a sequence or alignment. This is usually way easier than trying to select in the BioLegato sequence pane. For blnalign and blpalign, you would be able to select parts of an entire alignment. For bldna and blprotein, you would need to restrict it to a single sequence, since it doesn't make sense to extract blocks of sequence when there if they aren't aligned. This is a good idea, but needs some careful thought.
For distance phylogeny programs:
- Create a separate menu item for calculating a distance matrix (DNADIST, PROTDIST, RESTDIST. Output to bltable.
- Create an option in phylogeny program menus to read in an already-existing distance matrix.
- Alternative programs for calculating distance matrices?
blnalign - method for translating a DNA alignment into a protein alignment

Multiple Alignment Tutorial

~~Eliminate DIALIGN~~
~~Fix alignment editing tutorial to match alignment~~
Update to latest version of MAAFT
New section on alignment colours

blnfetch, blpfetch

According to the GenBank release notes section 1.4.2, GI numbers will be phased out. This has already begun with WGS and TSA sequences. Eventally, new sequences will ONLY get an Accession.Version ID, and GI numbers will no longer be assigned.
add ~~Sort~~ and Find functions. These make it easier to hunt for a ACCESSION number when looking at BLAST output
For BLAST output - could we replace these programs with programs called blpblast and blnblast, that give us a multi-column BLAST output, where one column has the ACCESSION numbers? Essentially, these would be live BLAST output, where you can select the ACCESSION numbers you want and then retrieve the sequences.
Is there a straightforward way to get blnfetch and blpfetch to retrieve sequences with BLAST hits when the database was a User-Created FASTA file? Maybe seqkit grep will do the job (seqkit grep --help for more info)?

bltree

~~Archyopteryx https://sites.google.com/site/cmzmasek/home/software/archaeopteryx Solution: Now replaces ATV.~~
- GUI is a bit weird and takes some getting used to
- We need to revise drawgram and drawtree to be able to use chooseviewer.py --ext. If that could be made to work, it would expand the list of bitmap viewers that can be used on OSX. The big problem with the OSX open command is that it is highly dependent on file extensions, and also that the formats generated by drawtree and drawgram are incredibly obsolete. This may not be worth the effort.
Treevolution http://vis.usal.es/treevolution

blmarker

Edit --> Sort rows
Sort columns, as a separate function?
break up phylogeny menus into tabs
separate function for creating distance matrix
add a choice to read in a distance matrix
delete empty rows before doing analysis

blcont

BioLegato for continuous data. This would be an implementation of bltable, targeted at data expressed in real numbers, such as phenotypic data. We would start out with the appropriate programs from Phylip:

contrast
contml
gendist

mauve

DONE ~~The latest release of mauve is from 2015. This works fine with Java8, but will not work with Java11.~~

https://github.com/AdoptOpenJDK/openjdk-systemtest/issues/148

Short term fix: Run under JDK8
https://edwards.sdsu.edu/research/running-mauve-with-java-10/

~~Modify mauve script to look for Java8. If found, use it. If not found, pop up a message saying to install Java8.~~

Artemis

update to latest version on Artemis GitHub. This is a low priority, because the change logs after v17.1 are mostly to do with downloads, installation, and adapting Artemis to GitHub.

primer3

New version: There is a new version 2.4 dated 2017. It looks like the command structure is the same, so hopefully it will be backward compatible with existing scripts. Primer3 is available at Sourceforge.net.
Parse output so that it can go to bltable.
- Initially, I thought it doesn't look like it really is worth doing. The raw primer tables are not that useful, and the main report page already gives some alternative primers.
- However, Primer3 by defaults shows you the BEST primer pair. In many cases, one or both primers aren't where you'd like them to be. If the alternative primer lists were in bltable, they could be sorted. Most useful would be to sort by position.
- Toward that end, it may be that the best thing is to use the generic bltable, rather than creating a BioLegato specifically for Primere3 primer lists. We need to add BLsort, and there are probably a few other things that could be used to bring the functionality of bltable up to what is found in blmarker, blncbi, blnfetch etc.
To support PCR-based cloning:
- ~~modify primer3.py and bldna to support TARGET command~~
- The Artemis manual makes it look like you can put a script into the artemis etc folder and get it to run. This is what it already does for runblastp and such. It might be very useful to get primer3 to run on sequences selected in Artemis. That would make it easy to go from a gene to a PCR fragment.
- It might be easy to implement electronic PCR from bldna.

mrtrans

DONE
~~It may be time to replace mrtrans with something better~~

mrtrans only uses universal genetic code
mrtrans has bizarre problems with both input and output

Possible replacements:

~~BioPython module Bio.codonalign seems to do this.~~

~~Done. <strike>pal2nal - Perl script; can read different genetic codes~~
pycogent - Python; doesn't seem to have a way to use alternative genetic codes
tranalign from EMBOSS (actually, a rewriting of mrtrans, but can read alternative genetic codes, among many other options); This is a good starting point. Justin has already installed EMBOSS 6.4.0 in /home/psgendb/local/install.
write a replacement in Java
just write a wrapper, similar to mrtrans.csh

</strike>

Basic Genomics Tools

It should be possible to identify a set of basic genomics tools that are used by common 3rd party packages.

could save on disk space
could make it easier to get some things to work
exposes documentation for these tools to the documentation hierarchy, even if they are "hidden" many layers down in packages that include them.

Examples:

~~samtools~~
bamtools
~~trimmomatic~~
fastq
~~jellyfish~~

Nucleotide (nt)	Installed	Install O</d>	Delete O</d>
Protein (nr)	Installed	Install O</d>	Delete O</d>
RefSeq RNA (refseq_rna)	Not installed	Install O</d>	Delete O</d>

BIRCH/Release To Do list

From Bioinformatics.Org Wiki

Contents

Future Releases

Installation and Updating

getbirch

Updating

Desktop setup

local files

Dependencies for 3rd party programs

BIRCH

More modern looking web site

Requirements

Strategy

Home Page Layout

Settings for each User

Contents

Where do we put it?

Implementation

Strategy

Python

Python libraries

Python3

Comptability issues

BIRCH Python compatibility

3rd Party Python3 compatiblilty

How to set PYTHONPATH

Portability

python4 not in the cards

Java

Gnu Parallel

Libraries

Delete old libraries

OSX Dependencies

Documentation

MacOSX M1 (aka Apple Silicon) support

BioLegato

Need mechanism for BioLegato to run commands in the background

Development

GetInfo - Colourmask: new colours don't display

system command appears to have no effect

get rid of wrappers for text editors

Output to console

Table Canvas

chooseviewer.py

blsort.py

BLHelper.py

birchadmin

birchdb

Quick and dirty patch/addon mechanism

Definition of an add-on

Algorithm

FSAP, XYLEM

add support for gap in location

Convert FSAP and XYLEM to Free Pascal?

BLAST+

Blast output viewers

Phylogeny

bldna

blfeatures

blprotein

blncbi

bltable

blreads

blpandas

Multiple Alignment

High throughput multiple alignment programs

MAFFT

Grishin Lab Software

MstatX

GUIDANCE2

TCOFFEE

blnalign, blpalign

Multiple Alignment Tutorial

blnfetch, blpfetch

bltree

blmarker

blcont

mauve

Artemis