[Biococoa-dev] reading macvector files

Davide Cittaro davide.cittaro at ifom-ieo-campus.it
Tue Mar 20 05:21:50 EDT 2007


Hi all, since we have people here that came with macvector data, I  
had to write something to read macvector files and convert them to  
open format (I'm doing in python since we run on several OS).
I downloaded biococoa which lacks MV reading capabilities (or simply  
I couldn't see them looking at BCSequenceReader.m). Unfortunately I  
have so few time to collaborate, nevertheless I would like to share  
how to read a MV file (or, at least, what I've found out...), so if  
you want you can include MV reading capabilities into BioCocoa code.
What I'm writing is valid for nucleic acid sequences, I haven't  
looked into AA sequence files... BTW

A hexdump of a MV file is something like:

00000000  00 00 00 01 01 01 43 80  01 06 07 60 00 00 00 00   
|......C....`....|
00000010  00 00 00 01 00 00 11 09  00 00 00 01 00 00 11 09   
|................|
00000020  00 00 11 09 08 08 02 08  02 01 08 04 08 08 08 04   
|................|
00000030  01 02 01 04 02 08 08 01  08 02 01 08 02 04 01 08   
|................|
00000040  01 01 04 02 08 08 08 01  01 08 04 02 04 04 08 01   
|................|

The first line contains a header but it seems to change among  
different files, I still have to understand why... only the first 7  
bytes are pretty conserved.
At byte 32 starts a 4 bytes offset that is the sequence length. Be  
careful that MV saves files with PPC endianness (big endian), so if  
you want to build universal binary you should use some foundation  
class that allows this (I can't recall the name...). After that you  
are ready to read from byte 36 to byte 36+length, that is the  
sequence. I've found that every byte is a nucleotide with this encoding:

0x00 => -
0x01 => A
0x02 => C
0x03 => M
0x04 => G
0x05 => R
0x06 => S
0x07 => V
0x08 => T
0x09 => W
0x0a => Y
0x0b => H
0x0c => K
0x0d => D
0x0e => B
0x0f => N

So, in the sequence example before you have TTCTCATGTTTGACAGCTTAT....
These infos are enough to read at least the sequence.
Immediatly after the sequence there is the "features" section. If you  
are interested I will post another mail for that, even if I still  
haven't completely undestood it.

Cheers

d

/*
Davide Cittaro
HPC and Bioinformatics Systems @ Informatics Core

IFOM - Istituto FIRC di Oncologia Molecolare
via adamello, 16
20139 Milano
Italy

tel.: +39(02)574303007
e-mail: davide.cittaro at ifom-ieo-campus.it
*/


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.bioinformatics.org/pipermail/biococoa-dev/attachments/20070320/93f175b4/attachment.html>


More information about the Biococoa-dev mailing list