Generic Feature Format

From Bioinformatics.Org Wiki

Revision as of 01:56, 7 May 2010 by Jeff (Talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

The Generic Feature Format (GFF) is a data format for identifying the features of a sequence. Unlike GenBank and XML documents, GFF presents feature data in a tab-delimited table, one feature per line, which makes it ideal for use with the text manipulation and data analysis tools that work with tabular data: spreadsheets and various Unix commands.

Contents

GFF Version 2

In 2000, the Wellcome Trust Sanger Institute published the specification for GFF Version 2, with the following explanation:

The main change from Version 1 to Version 2 is the requirement for a tag-value type structure (essentially semicolon-separated .ace format) for any additional material on the line, following the mandatory fields. Version 2 also allows '.' as a score, for features for which there is no score.

Example from the Sanger specification:

SEQ1    EMBL    atg     103     105     .       +       0
SEQ1    EMBL    exon    103     172     .       +       0
SEQ1    EMBL    splice5 172     173     .       +       .
SEQ1    netgene splice5 172     173     0.94    +       .
SEQ1    genie   sp5-20  163     182     2.3     +       .
SEQ1    genie   sp5-10  168     177     2.1     +       .
SEQ2    grail   ATG     17      19      2.1     -       0

Forks

Forks of GFF, such as GTF, were created to address the specific needs of some projects.

GFF Version 3

In 2006, Lincoln Stein wrote the specification for GFF Version 3, with the following explanation:

Although there are many richer ways of representing genomic features via XML, the stubborn persistence of a variety of ad-hoc tab-delimited flat file formats declares the bioinformatics community's need for a simple format that can be modified with a text editor and processed with shell tools like grep. The GFF format, although widely used, has fragmented into multiple incompatible dialects. When asked why they have modified the published Sanger specification, bioinformaticists frequently answer that the format was insufficient for their needs, and they needed to extend it. The proposed GFF3 format addresses the most common extensions to GFF, while preserving backward compatibility with previous formats. The new format:
  1. adds a mechanism for representing more than one level of hierarchical grouping of features and subfeatures
  2. separates the ideas of group membership and feature name/id
  3. constrains the feature type field to be taken from a controlled vocabulary
  4. allows a single feature, such as an exon, to belong to more than one group at a time
  5. provides an explicit convention for pairwise alignments
  6. provides an explicit convention for features that occupy disjunct regions

Example from the Sequence Ontology Project specification:

##gff-version   3
##sequence-region   ctg123 1 1497228
ctg123 . gene            1000  9000  .  +  .  ID=gene00001;Name=EDEN
ctg123 . TF_binding_site 1000  1012  .  +  .  Parent=gene00001
ctg123 . mRNA            1050  9000  .  +  .  ID=mRNA00001;Parent=gene00001
ctg123 . mRNA            1050  9000  .  +  .  ID=mRNA00002;Parent=gene00001
ctg123 . mRNA            1300  9000  .  +  .  ID=mRNA00003;Parent=gene00001
ctg123 . exon            1300  1500  .  +  .  Parent=mRNA00003
ctg123 . exon            1050  1500  .  +  .  Parent=mRNA00001,mRNA00002
ctg123 . exon            3000  3902  .  +  .  Parent=mRNA00001,mRNA00003
ctg123 . exon            5000  5500  .  +  .  Parent=mRNA00001,mRNA00002,mRNA00003
ctg123 . exon            7000  9000  .  +  .  Parent=mRNA00001,mRNA00002,mRNA00003
ctg123 . CDS             1201  1500  .  +  0  ID=cds00001;Parent=mRNA00001
ctg123 . CDS             3000  3902  .  +  0  ID=cds00001;Parent=mRNA00001
ctg123 . CDS             5000  5500  .  +  0  ID=cds00001;Parent=mRNA00001
ctg123 . CDS             7000  7600  .  +  0  ID=cds00001;Parent=mRNA00001
ctg123 . CDS             1201  1500  .  +  0  ID=cds00002;Parent=mRNA00002
ctg123 . CDS             5000  5500  .  +  0  ID=cds00002;Parent=mRNA00002
ctg123 . CDS		   7000  7600  .  +  0  ID=cds00002;Parent=mRNA00002
ctg123 . CDS             3301  3902  .  +  0  ID=cds00003;Parent=mRNA00003
ctg123 . CDS		   5000  5500  .  +  1  ID=cds00003;Parent=mRNA00003
ctg123 . CDS		   7000  7600  .  +  2  ID=cds00003;Parent=mRNA00003
ctg123 . CDS             3391  3902  .  +  0  ID=cds00004;Parent=mRNA00003
ctg123 . CDS		   5000  5500  .  +  1  ID=cds00004;Parent=mRNA00003
ctg123 . CDS		   7000  7600  .  +  2  ID=cds00004;Parent=mRNA00003

Validation

Software

Personal tools
Namespaces
Variants
Actions
wiki navigation
Toolbox