Format description of GENCODE GTF

A. TAB-separated standard GTF columns

column-number content values/format
1 chromosome name chr{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,X,Y,M} or GRC accession a
2 annotation source {ENSEMBL,HAVANA}
3 feature type {gene,transcript,exon,CDS,UTR,start_codon,stop_codon,Selenocysteine}
4 genomic start location integer-value (1-based)
5 genomic end location integer-value
6 score(not used) .
7 genomic strand {+,-}
8 genomic phase (for CDS features) {0,1,2,.}
9 additional information as key-value pairs see below

a Scaffolds, patches and haplotypes names correspond to their GRC accessions. Please note that these are different from the Ensembl names.

B. Key-value pairs in 9th column (format: key "value"; )

B.1. Mandatory fields

key name feature type(s) value format release
gene_id all ENSGXXXXXXXXXXX.X b,c _Xg all
transcript_id d all except gene ENSTXXXXXXXXXXX.X b,c _Xg all
gene_type all list of biotypes all
gene_status e all {KNOWN, NOVEL, PUTATIVE} until 25 and M11
gene_name all string all
transcript_type d all except gene list of biotypes all
transcript_statusd,e all except gene {KNOWN, NOVEL, PUTATIVE} until 25 and M11
transcript_name d all except gene string all
exon_number f all except gene/transcript/Selenocysteine integer (exon position in the transcript from its 5' end) all
exon_id f all except gene/transcript/Selenocysteine ENSEXXXXXXXXXXX.X b _Xg all
level all 1 (verified loci),
2 (manually annotated loci),
3 (automatically annotated loci)
all

b From version 7 the gene/transcript version number was appended to gene and transcript ids (eg. ENSG00000160087.16).

c Gene and trancript ids on the chrY PAR regions have "_PAR_Y" appended (from release 25), or are in the format ENSGRXXXXXXXXXX and ENSTRXXXXXXXXXX (until release 24) to avoid redundancy.

d Until releases 21 and M4, the gene lines included transcript attributes.

e The 'gene_status' and 'transcript_status' attributes were removed after releases 25 (human) and M11 (mouse).

f Except in gene and transcript lines.

g In the annotation mapped back to GRCh37, mapping versions are appended to the identifiers (eg. ENSG00000228327.3_2).

B.2. Optional fields

key name value format
tag part of a special set [*]: list of tags
ccdsid official CCDS id [*]; CCDS*
havana_gene gene id in the havana db [0,1]; OTTHUMGXXXXXXXXXXX.X
havana_transcript transcript id in the havana db [0,1] ; OTTHUMTXXXXXXXXXXX.X
protein_id ENSPXXXXXXXXXXX.X [0,1]
ont pseudogene (or other) ontology ids [*]; {PGO:0000004 and others}
transcript_support_level {1,2,3,4,5,NA} [0,1]
transcripts are scored according to how well mRNA and EST alignments match over its full length:
1 (all splice junctions of the transcript are supported by at least one non-suspect mRNA),
2 (the best supporting mRNA is flagged as suspect or the support is from multiple ESTs),
3 (the only support is from a single EST),
4 (the best supporting EST is flagged as suspect),
5 (no single transcript supports the model structure),
NA (the transcript was not analyzed)
remap_status
remap_original_id
remap_original_location
remap_num_mappings
remap_target_status
remap_substituted_missing_target
Mapping attributes [0,1] - only for GRCh38 annotation lifted back to GRCh37.
hgnc_id HGNC id in human [0,1]; HGNC:*
mgi_id MGI id in mouse [0,1]; MGI:*

Number of occurrences: [*] - zero or multiple, [0,1] - zero or one

Example GTF lines:

chr19   HAVANA   gene   405438   409170   .   -   .   gene_id "ENSG00000183186.7"; gene_type "protein_coding"; gene_name "C2CD4C"; level 2; havana_gene "OTTHUMG00000180534.3";
chr19   HAVANA   transcript   405438   409170   .   -   .   gene_id "ENSG00000183186.7"; transcript_id "ENST00000332235.7"; gene_type "protein_coding"; gene_name "C2CD4C"; transcript_type "protein_coding"; transcript_name "C2CD4C-001"; level 2; protein_id "ENSP00000328677.4"; transcript_support_level "2"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS45890.1"; havana_gene "OTTHUMG00000180534.3"; havana_transcript "OTTHUMT00000451789.3";
chr19   HAVANA   exon   409006   409170   .   -   .   gene_id "ENSG00000183186.7"; transcript_id "ENST00000332235.7"; gene_type "protein_coding"; gene_name "C2CD4C"; transcript_type "protein_coding"; transcript_name "C2CD4C-001"; exon_number 1; exon_id "ENSE00001322986.5"; level 2; protein_id "ENSP00000328677.4"; transcript_support_level "2"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS45890.1"; havana_gene "OTTHUMG00000180534.3"; havana_transcript "OTTHUMT00000451789.3";
chr19   HAVANA   exon   405438   408401   .   -   .   gene_id "ENSG00000183186.7"; transcript_id "ENST00000332235.7"; gene_type "protein_coding"; gene_name "C2CD4C"; transcript_type "protein_coding"; transcript_name "C2CD4C-001"; exon_number 2; exon_id "ENSE00001290344.6"; level 2; protein_id "ENSP00000328677.4"; transcript_support_level "2"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS45890.1"; havana_gene "OTTHUMG00000180534.3"; havana_transcript "OTTHUMT00000451789.3";
chr19   HAVANA   CDS   407099   408361   .   -   0   gene_id "ENSG00000183186.7"; transcript_id "ENST00000332235.7"; gene_type "protein_coding"; gene_name "C2CD4C"; transcript_type "protein_coding"; transcript_name "C2CD4C-001"; exon_number 2; exon_id "ENSE00001290344.6"; level 2; protein_id "ENSP00000328677.4"; transcript_support_level "2"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS45890.1"; havana_gene "OTTHUMG00000180534.3"; havana_transcript "OTTHUMT00000451789.3";
chr19   HAVANA   start_codon   408359   408361   .   -   0   gene_id "ENSG00000183186.7"; transcript_id "ENST00000332235.7"; gene_type "protein_coding"; gene_name "C2CD4C"; transcript_type "protein_coding"; transcript_name "C2CD4C-001"; exon_number 2; exon_id "ENSE00001290344.6"; level 2; protein_id "ENSP00000328677.4"; transcript_support_level "2"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS45890.1"; havana_gene "OTTHUMG00000180534.3"; havana_transcript "OTTHUMT00000451789.3";
chr19   HAVANA   stop_codon   407096   407098   .   -   0   gene_id "ENSG00000183186.7"; transcript_id "ENST00000332235.7"; gene_type "protein_coding"; gene_name "C2CD4C"; transcript_type "protein_coding"; transcript_name "C2CD4C-001"; exon_number 2; exon_id "ENSE00001290344.6"; level 2; protein_id "ENSP00000328677.4"; transcript_support_level "2"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS45890.1"; havana_gene "OTTHUMG00000180534.3"; havana_transcript "OTTHUMT00000451789.3";
chr19   HAVANA   UTR   409006   409170   .   -   .   gene_id "ENSG00000183186.7"; transcript_id "ENST00000332235.7"; gene_type "protein_coding"; gene_name "C2CD4C"; transcript_type "protein_coding"; transcript_name "C2CD4C-001"; exon_number 1; exon_id "ENSE00001322986.5"; level 2; protein_id "ENSP00000328677.4"; transcript_support_level "2"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS45890.1"; havana_gene "OTTHUMG00000180534.3"; havana_transcript "OTTHUMT00000451789.3";
chr19   HAVANA   UTR   405438   407098   .   -   .   gene_id "ENSG00000183186.7"; transcript_id "ENST00000332235.7"; gene_type "protein_coding"; gene_name "C2CD4C"; transcript_type "protein_coding"; transcript_name "C2CD4C-001"; exon_number 2; exon_id "ENSE00001290344.6"; level 2; protein_id "ENSP00000328677.4"; transcript_support_level "2"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS45890.1"; havana_gene "OTTHUMG00000180534.3"; havana_transcript "OTTHUMT00000451789.3";
chr19   HAVANA   UTR   408362   408401   .   -   .   gene_id "ENSG00000183186.7"; transcript_id "ENST00000332235.7"; gene_type "protein_coding"; gene_name "C2CD4C"; transcript_type "protein_coding"; transcript_name "C2CD4C-001"; exon_number 2; exon_id "ENSE00001290344.6"; level 2; protein_id "ENSP00000328677.4"; transcript_support_level "2"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS45890.1"; havana_gene "OTTHUMG00000180534.3"; havana_transcript "OTTHUMT00000451789.3";

Examples for fetching specific parts from the file [Unix command line]:

  1. Get all "gene" lines:
    awk '{if($3=="gene"){print $0}}' gencode.gtf
    
  2. Get all "protein-coding transcript" lines:
    awk '{if($3=="transcript" && $20=="\"protein_coding\";"){print $0}}' gencode.gtf
    
  3. Get level 1 & 2 annotation (manually annotated) only:
    awk '{if($0~"level (1|2);"){print $0}}' gencode.gtf
    

Example for parsing the file [Perl]

#!/usr/bin/perl

use strict;

my $gencode_file = "gencode.v23.annotation.gtf";
open(IN, "<$gencode_file") or die "Can't open $gencode_file.\n";
my %all_genes;
while(<IN>){
  next if(/^##/); #ignore header
  chomp;
  my %attribs = ();
  my ($chr, $source, $type, $start, $end, $score,
    $strand, $phase, $attributes) = split("\t");
  #store nine columns in hash
  my %fields = (
    chr        => $chr,
    source     => $source,
    type       => $type,
    start      => $start,
    end        => $end,
    score      => $score,
    strand     => $strand,
    phase      => $phase,
    attributes => $attributes,
  );
  my @add_attributes = split(";", $attributes);
  # store ids and additional information in second hash
  foreach my $attr ( @add_attributes ) {
     next unless $attr =~ /^\s*(.+)\s(.+)$/;
     my $c_type  = $1;
     my $c_value = $2;
     $c_value =~ s/\"//g;
     if($c_type  && $c_value){
       if(!exists($attribs{$c_type})){
         $attribs{$c_type} = [];
       }
       push(@{ $attribs{$c_type} }, $c_value);
     }
  }
  #work with the information from the two hashes...
  #eg. store them in a hash of arrays by gene_id:
  if(!exists($all_genes{$attribs{'gene_id'}->[0]})){
    $all_genes{$attribs{'gene_id'}->[0]} = [];
  }
  push(@{ $all_genes{$attribs{'gene_id'}->[0]} }, \%fields);
}
print "Example entry ENSG00000183186.7: ".
  $all_genes{"ENSG00000183186.7"}->[0]->{"type"}.", ".
  $all_genes{"ENSG00000183186.7"}->[0]->{"chr"}." ".
  $all_genes{"ENSG00000183186.7"}->[0]->{"start"}."-".
  $all_genes{"ENSG00000183186.7"}->[0]->{"end"}."\n";

GTF Parsers in Other Programming Language

A number of programming languages already have GTF parsers developed by third party libraries. We have listed a number of these below and should be used in preference to writing your own parser.

Java
BioJava (JavaDoc)
Perl
BioPerl (metacpan)
Python
gtfparse, gffutils
Ruby
germ

For further questions, please contact our helpdesk.