Human
Mouse
How to access data
FAQ
Documentation
- Data format
- Tags
- Biotypes
- Custom array
- Benchmarking
- LRGASP
- RGASP
About us

Format description of GENCODE GTF

A. TAB-separated standard GTF columns

column-number	content	values/format
1	chromosome name	chr{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,X,Y,M} or GRC accession ^a
2	annotation source	{ENSEMBL,HAVANA}
3	feature type	{gene,transcript,exon,CDS,UTR,start_codon,stop_codon,Selenocysteine}
4	genomic start location	integer-value (1-based)
5	genomic end location	integer-value
6	score(not used)	.
7	genomic strand	{+,-}
8	genomic phase (for CDS features)	{0,1,2,.}
9	additional information as key-value pairs	see below

^a Scaffolds, patches and haplotypes names correspond to their GRC accessions. Please note that these are different from the Ensembl names.

B. Key-value pairs in 9th column (format: key "value"; )

B.1. Mandatory fields

key name	feature type(s)	value format	release
gene_id	all	ENSGXXXXXXXXXXX.X ^b,c _X^g	all
transcript_id ^d	all except gene	ENSTXXXXXXXXXXX.X ^b,c _X^g	all
gene_type	all	list of biotypes	all
gene_status ^e	all	{KNOWN, NOVEL, PUTATIVE}	until 25 and M11
gene_name	all	string	all
transcript_type ^d	all except gene	list of biotypes	all
transcript_status^d,e	all except gene	{KNOWN, NOVEL, PUTATIVE}	until 25 and M11
transcript_name ^d	all except gene	string	all
exon_number ^f	all except gene/transcript/Selenocysteine	integer (exon position in the transcript from its 5' end)	all
exon_id ^f	all except gene/transcript/Selenocysteine	ENSEXXXXXXXXXXX.X ^b _X^g	all
level	all	1 (verified loci), 2 (manually annotated loci), 3 (automatically annotated loci)	all

^b From version 7 the gene/transcript version number was appended to gene and transcript ids (eg. ENSG00000160087.16).

^c Gene and trancript ids on the chrY PAR regions have "_PAR_Y" appended (from release 25), or are in the format ENSGRXXXXXXXXXX and ENSTRXXXXXXXXXX (until release 24) to avoid redundancy.

^d Until releases 21 and M4, the gene lines included transcript attributes.

^e The 'gene_status' and 'transcript_status' attributes were removed after releases 25 (human) and M11 (mouse).

^f Except in gene and transcript lines.

^g In the annotation mapped back to GRCh37, mapping versions are appended to the identifiers (eg. ENSG00000228327.3_2).

B.2. Optional fields

key name	value format
tag	part of a special set [*]: list of tags
ccdsid	official CCDS id []; CCDS
havana_gene	gene id in the havana db [0,1]; OTTHUMGXXXXXXXXXXX.X
havana_transcript	transcript id in the havana db [0,1] ; OTTHUMTXXXXXXXXXXX.X
protein_id	ENSPXXXXXXXXXXX.X [0,1]
ont	pseudogene (or other) ontology ids [*]; {PGO:0000004 and others}
transcript_support_level	{1,2,3,4,5,NA} [0,1] transcripts are scored according to how well mRNA and EST alignments match over its full length: 1 (all splice junctions of the transcript are supported by at least one non-suspect mRNA), 2 (the best supporting mRNA is flagged as suspect or the support is from multiple ESTs), 3 (the only support is from a single EST), 4 (the best supporting EST is flagged as suspect), 5 (no single transcript supports the model structure), NA (the transcript was not analyzed)
remap_status remap_original_id remap_original_location remap_num_mappings remap_target_status remap_substituted_missing_target	Mapping attributes [0,1] - only for GRCh38 annotation lifted back to GRCh37.
hgnc_id	HGNC id in human [0,1]; HGNC:*
mgi_id	MGI id in mouse [0,1]; MGI:*

Number of occurrences: [*] - zero or multiple, [0,1] - zero or one

Example GTF lines:

chr19   HAVANA   gene   405438   409170   .   -   .   gene_id "ENSG00000183186.7"; gene_type "protein_coding"; gene_name "C2CD4C"; level 2; havana_gene "OTTHUMG00000180534.3";
chr19   HAVANA   transcript   405438   409170   .   -   .   gene_id "ENSG00000183186.7"; transcript_id "ENST00000332235.7"; gene_type "protein_coding"; gene_name "C2CD4C"; transcript_type "protein_coding"; transcript_name "C2CD4C-001"; level 2; protein_id "ENSP00000328677.4"; transcript_support_level "2"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS45890.1"; havana_gene "OTTHUMG00000180534.3"; havana_transcript "OTTHUMT00000451789.3";
chr19   HAVANA   exon   409006   409170   .   -   .   gene_id "ENSG00000183186.7"; transcript_id "ENST00000332235.7"; gene_type "protein_coding"; gene_name "C2CD4C"; transcript_type "protein_coding"; transcript_name "C2CD4C-001"; exon_number 1; exon_id "ENSE00001322986.5"; level 2; protein_id "ENSP00000328677.4"; transcript_support_level "2"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS45890.1"; havana_gene "OTTHUMG00000180534.3"; havana_transcript "OTTHUMT00000451789.3";
chr19   HAVANA   exon   405438   408401   .   -   .   gene_id "ENSG00000183186.7"; transcript_id "ENST00000332235.7"; gene_type "protein_coding"; gene_name "C2CD4C"; transcript_type "protein_coding"; transcript_name "C2CD4C-001"; exon_number 2; exon_id "ENSE00001290344.6"; level 2; protein_id "ENSP00000328677.4"; transcript_support_level "2"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS45890.1"; havana_gene "OTTHUMG00000180534.3"; havana_transcript "OTTHUMT00000451789.3";
chr19   HAVANA   CDS   407099   408361   .   -   0   gene_id "ENSG00000183186.7"; transcript_id "ENST00000332235.7"; gene_type "protein_coding"; gene_name "C2CD4C"; transcript_type "protein_coding"; transcript_name "C2CD4C-001"; exon_number 2; exon_id "ENSE00001290344.6"; level 2; protein_id "ENSP00000328677.4"; transcript_support_level "2"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS45890.1"; havana_gene "OTTHUMG00000180534.3"; havana_transcript "OTTHUMT00000451789.3";
chr19   HAVANA   start_codon   408359   408361   .   -   0   gene_id "ENSG00000183186.7"; transcript_id "ENST00000332235.7"; gene_type "protein_coding"; gene_name "C2CD4C"; transcript_type "protein_coding"; transcript_name "C2CD4C-001"; exon_number 2; exon_id "ENSE00001290344.6"; level 2; protein_id "ENSP00000328677.4"; transcript_support_level "2"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS45890.1"; havana_gene "OTTHUMG00000180534.3"; havana_transcript "OTTHUMT00000451789.3";
chr19   HAVANA   stop_codon   407096   407098   .   -   0   gene_id "ENSG00000183186.7"; transcript_id "ENST00000332235.7"; gene_type "protein_coding"; gene_name "C2CD4C"; transcript_type "protein_coding"; transcript_name "C2CD4C-001"; exon_number 2; exon_id "ENSE00001290344.6"; level 2; protein_id "ENSP00000328677.4"; transcript_support_level "2"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS45890.1"; havana_gene "OTTHUMG00000180534.3"; havana_transcript "OTTHUMT00000451789.3";
chr19   HAVANA   UTR   409006   409170   .   -   .   gene_id "ENSG00000183186.7"; transcript_id "ENST00000332235.7"; gene_type "protein_coding"; gene_name "C2CD4C"; transcript_type "protein_coding"; transcript_name "C2CD4C-001"; exon_number 1; exon_id "ENSE00001322986.5"; level 2; protein_id "ENSP00000328677.4"; transcript_support_level "2"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS45890.1"; havana_gene "OTTHUMG00000180534.3"; havana_transcript "OTTHUMT00000451789.3";
chr19   HAVANA   UTR   405438   407098   .   -   .   gene_id "ENSG00000183186.7"; transcript_id "ENST00000332235.7"; gene_type "protein_coding"; gene_name "C2CD4C"; transcript_type "protein_coding"; transcript_name "C2CD4C-001"; exon_number 2; exon_id "ENSE00001290344.6"; level 2; protein_id "ENSP00000328677.4"; transcript_support_level "2"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS45890.1"; havana_gene "OTTHUMG00000180534.3"; havana_transcript "OTTHUMT00000451789.3";
chr19   HAVANA   UTR   408362   408401   .   -   .   gene_id "ENSG00000183186.7"; transcript_id "ENST00000332235.7"; gene_type "protein_coding"; gene_name "C2CD4C"; transcript_type "protein_coding"; transcript_name "C2CD4C-001"; exon_number 2; exon_id "ENSE00001290344.6"; level 2; protein_id "ENSP00000328677.4"; transcript_support_level "2"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS45890.1"; havana_gene "OTTHUMG00000180534.3"; havana_transcript "OTTHUMT00000451789.3";

Examples for fetching specific parts from the file [Unix command line]:

Get all "gene" lines:

awk '{if($3=="gene"){print $0}}' gencode.gtf

Get all "protein-coding transcript" lines:

awk '{if($3=="transcript" && $20=="\"protein_coding\";"){print $0}}' gencode.gtf

Get level 1 & 2 annotation (manually annotated) only:

awk '{if($0~"level (1|2);"){print $0}}' gencode.gtf

Example for parsing the file [Perl]

#!/usr/bin/perl

use strict;

my $gencode_file = "gencode.v23.annotation.gtf";
open(IN, "<$gencode_file") or die "Can't open $gencode_file.\n";
my %all_genes;
while(<IN>){
  next if(/^##/); #ignore header
  chomp;
  my %attribs = ();
  my ($chr, $source, $type, $start, $end, $score,
    $strand, $phase, $attributes) = split("\t");
  #store nine columns in hash
  my %fields = (
    chr        => $chr,
    source     => $source,
    type       => $type,
    start      => $start,
    end        => $end,
    score      => $score,
    strand     => $strand,
    phase      => $phase,
    attributes => $attributes,
  );
  my @add_attributes = split(";", $attributes);
  # store ids and additional information in second hash
  foreach my $attr ( @add_attributes ) {
     next unless $attr =~ /^\s*(.+)\s(.+)$/;
     my $c_type  = $1;
     my $c_value = $2;
     $c_value =~ s/\"//g;
     if($c_type  && $c_value){
       if(!exists($attribs{$c_type})){
         $attribs{$c_type} = [];
       }
       push(@{ $attribs{$c_type} }, $c_value);
     }
  }
  #work with the information from the two hashes...
  #eg. store them in a hash of arrays by gene_id:
  if(!exists($all_genes{$attribs{'gene_id'}->[0]})){
    $all_genes{$attribs{'gene_id'}->[0]} = [];
  }
  push(@{ $all_genes{$attribs{'gene_id'}->[0]} }, \%fields);
}
print "Example entry ENSG00000183186.7: ".
  $all_genes{"ENSG00000183186.7"}->[0]->{"type"}.", ".
  $all_genes{"ENSG00000183186.7"}->[0]->{"chr"}." ".
  $all_genes{"ENSG00000183186.7"}->[0]->{"start"}."-".
  $all_genes{"ENSG00000183186.7"}->[0]->{"end"}."\n";

GTF Parsers in Other Programming Language

A number of programming languages already have GTF parsers developed by third party libraries. We have listed a number of these below and should be used in preference to writing your own parser.

Java
: BioJava (JavaDoc)
Perl
: BioPerl (metacpan)
Python
: gtfparse, gffutils
Ruby
: germ

For further questions, please contact our helpdesk.

Cookies policy | Terms of use