GENCODE

Format description of GENCODE GTF

A. TAB-separated standard GTF columns

column-number content values/format
1 chromosome name chr{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,X,Y,M} or GRC accession a
2 annotation source {ENSEMBL,HAVANA}
3 feature type {gene,transcript,exon,CDS,UTR,start_codon,stop_codon,Selenocysteine}
4 genomic start location integer-value (1-based)
5 genomic end location integer-value
6 score (not used)  .
7 genomic strand {+,-}
8 genomic phase (for CDS features)  {0,1,2,.}
9 additional information as key-value pairs see below
aScaffolds, patches and haplotypes names correspond to their GRC accessions. Please note that these are different from the Ensembl names.

B. Key-value pairs in 9th column (format: key "value"; )

B.1. Mandatory fields

key name value format
gene_id ENSGXXXXXXXXXXX.X b,c _Xg
transcript_id d ENSTXXXXXXXXXXX.X b,c _Xg
gene_type list of biotypes
gene_status e {KNOWN, NOVEL, PUTATIVE}
gene_name string
transcript_type d list of biotypes
transcript_status d,e {KNOWN, NOVEL, PUTATIVE}
transcript_name d string
exon_number f indicates the biological position of the exon in the transcript
exon_id f ENSEXXXXXXXXXXX.X b _Xg
level 1 (verified loci),
2 (manually annotated loci),
3 (automatically annotated loci)
bFrom version 7 the gene/transcript version number was appended to gene and transcript ids (eg. ENSG00000160087.16).
cGene and trancript ids on the chrY PAR regions have "_PAR_Y" appended (from release 25), or are in the format ENSGRXXXXXXXXXX and ENSTRXXXXXXXXXX (until release 24) to avoid redundancy.
dUntil releases 21 and M4, transcript attributes were included in the gene lines.
eThe 'gene_status' and 'transcript_status' attributes were removed after releases 25 (human) and M11 (mouse).
fExcept in gene and transcript lines.
gIn the annotation mapped back to GRCh37, mapping versions are appended to the identifiers (eg. ENSG00000228327.3_2).

B.2. Optional fields

key name value format
tag part of a special set [*]: list of tags
ccdsid official CCDS id [*]; CCDS*
havana_gene gene-id in the havana db [0,1]; OTTHUMGXXXXXXXXXXX.X
havana_transcript transcript-id in the havana db [0,1] ; OTTHUMTXXXXXXXXXXX.X
protein_id ENSPXXXXXXXXXXX.X [0,1]
ont pseudogene (or other) ontology ids [*]; {PGO:0000004 and others}
transcript_support_level transcripts are scored according to how well mRNA and EST alignments match over its full length [0,1]
1 (all splice junctions of the transcript are supported by at least one non-suspect mRNA),
2 (the best supporting mRNA is flagged as suspect or the support is from multiple ESTs),
3 (the only support is from a single EST),
4 (the best supporting EST is flagged as suspect),
5 (no single transcript supports the model structure),
NA (the transcript was not analyzed)
remap_status
remap_original_id
remap_original_location
remap_num_mappings
remap_target_status
remap_substituted_missing_target
Only for annotation lifted back to GRCh37.
Please see details here
Number of occurrences: [*] - zero or multiple, [0,1] - zero or one

Example GTF lines:

chr19   HAVANA   gene   405438   409139   .   -   .   gene_id "ENSG00000183186.6"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "C2CD4C";  level 2; havana_gene "OTTHUMG00000180534.2";
chr19   HAVANA   transcript   405438   409139   .   -   .   gene_id "ENSG00000183186.6"; transcript_id "ENST00000332235.6"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "C2CD4C"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "C2CD4C-001"; level 2; protein_id "ENSP00000328677.4"; tag "basic"; transcript_support_level "2"; tag "CCDS"; ccdsid "CCDS45890.1"; havana_gene "OTTHUMG00000180534.2"; havana_transcript "OTTHUMT00000451789.2";
chr19   HAVANA   exon   409006   409139   .   -   .   gene_id "ENSG00000183186.6"; transcript_id "ENST00000332235.6"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "C2CD4C"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "C2CD4C-001"; exon_number 1; exon_id "ENSE00001322986.4"; level 2; protein_id "ENSP00000328677.4"; tag "basic"; transcript_support_level "2"; tag "CCDS"; ccdsid "CCDS45890.1"; havana_gene "OTTHUMG00000180534.2"; havana_transcript "OTTHUMT00000451789.2";
chr19   HAVANA   exon   405438   408401   .   -   .   gene_id "ENSG00000183186.6"; transcript_id "ENST00000332235.6"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "C2CD4C"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "C2CD4C-001"; exon_number 2; exon_id "ENSE00001290344.6"; level 2; protein_id "ENSP00000328677.4"; tag "basic"; transcript_support_level "2"; tag "CCDS"; ccdsid "CCDS45890.1"; havana_gene "OTTHUMG00000180534.2"; havana_transcript "OTTHUMT00000451789.2";
chr19   HAVANA   CDS   407099   408361   .   -   .   gene_id "ENSG00000183186.6"; transcript_id "ENST00000332235.6"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "C2CD4C"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "C2CD4C-001"; exon_number 2; exon_id "ENSE00001290344.6"; level 2; protein_id "ENSP00000328677.4"; tag "basic"; transcript_support_level "2"; tag "CCDS"; ccdsid "CCDS45890.1"; havana_gene "OTTHUMG00000180534.2"; havana_transcript "OTTHUMT00000451789.2";
chr19   HAVANA   start_codon   408359   408361   .   -   .   gene_id "ENSG00000183186.6"; transcript_id "ENST00000332235.6"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "C2CD4C"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "C2CD4C-001"; exon_number 2; exon_id "ENSE00001290344.6"; level 2; protein_id "ENSP00000328677.4"; tag "basic"; transcript_support_level "2"; tag "CCDS"; ccdsid "CCDS45890.1"; havana_gene "OTTHUMG00000180534.2"; havana_transcript "OTTHUMT00000451789.2";
chr19   HAVANA   stop_codon   407096   407098   .   -   .   gene_id "ENSG00000183186.6"; transcript_id "ENST00000332235.6"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "C2CD4C"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "C2CD4C-001"; exon_number 2; exon_id "ENSE00001290344.6"; level 2; protein_id "ENSP00000328677.4"; tag "basic"; transcript_support_level "2"; tag "CCDS"; ccdsid "CCDS45890.1"; havana_gene "OTTHUMG00000180534.2"; havana_transcript "OTTHUMT00000451789.2";
chr19   HAVANA   UTR   409006   409139   .   -   .   gene_id "ENSG00000183186.6"; transcript_id "ENST00000332235.6"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "C2CD4C"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "C2CD4C-001"; level 2; protein_id "ENSP00000328677.4"; tag "basic"; transcript_support_level "2"; tag "CCDS"; ccdsid "CCDS45890.1"; havana_gene "OTTHUMG00000180534.2"; havana_transcript "OTTHUMT00000451789.2";
chr19   HAVANA   UTR   405438   407098   .   -   .   gene_id "ENSG00000183186.6"; transcript_id "ENST00000332235.6"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "C2CD4C"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "C2CD4C-001"; level 2; protein_id "ENSP00000328677.4"; tag "basic"; transcript_support_level "2"; tag "CCDS"; ccdsid "CCDS45890.1"; havana_gene "OTTHUMG00000180534.2"; havana_transcript "OTTHUMT00000451789.2";
chr19   HAVANA   UTR   408362   408401   .   -   .   gene_id "ENSG00000183186.6"; transcript_id "ENST00000332235.6"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "C2CD4C"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "C2CD4C-001"; level 2; protein_id "ENSP00000328677.4"; tag "basic"; transcript_support_level "2"; tag "CCDS"; ccdsid "CCDS45890.1"; havana_gene "OTTHUMG00000180534.2"; havana_transcript "OTTHUMT00000451789.2";

Examples for fetching specific parts from the file [Unix command line]:

  1. Get all "gene" lines:
    awk '{if($3=="gene"){print $0}}' gencode.gtf
    
  2. Get all "protein-coding transcript" lines:
    awk '{if($3=="transcript" && $20=="\"protein_coding\";"){print $0}}' gencode.gtf
    
  3. Get level 1 & 2 annotation (manually annotated) only:
    awk '{if($0~"level (1|2);"){print $0}}' gencode.gtf
    

Example for parsing the file [Perl]:

#!/usr/bin/perl

use strict;

my $gencode_file = "gencode.v23.annotation.gtf";
open(IN, "<$gencode_file") or die "Can't open $gencode_file.\n";
my %all_genes;
while(<IN>){
  next if(/^##/); #ignore header
  chomp;
  my %attribs = ();
  my ($chr, $source, $type, $start, $end, $score,
    $strand, $phase, $attributes) = split("\t");
  #store nine columns in hash
  my %fields = (
    chr        => $chr,
    source     => $source,
    type       => $type,
    start      => $start,
    end        => $end,
    score      => $score,
    strand     => $strand,
    phase      => $phase,
    attributes => $attributes,
  );
  my @add_attributes = split(";", $attributes);
  # store ids and additional information in second hash
  foreach my $attr ( @add_attributes ) {
     next unless $attr =~ /^\s*(.+)\s(.+)$/;
     my $c_type  = $1;
     my $c_value = $2;
     $c_value =~ s/\"//g;
     if($c_type  && $c_value){
       if(!exists($attribs{$c_type})){
         $attribs{$c_type} = [];
       }
       push(@{ $attribs{$c_type} }, $c_value);
     }
  }
  #work with the information from the two hashes...
  #eg. store them in a hash of arrays by gene_id:
  if(!exists($all_genes{$attribs{'gene_id'}->[0]})){
    $all_genes{$attribs{'gene_id'}->[0]} = [];
  }
  push(@{ $all_genes{$attribs{'gene_id'}->[0]} }, \%fields);
}
print "Example entry ENSG00000183186.7: ".
  $all_genes{"ENSG00000183186.7"}->[0]->{"type"}.", ".
  $all_genes{"ENSG00000183186.7"}->[0]->{"chr"}." ".
  $all_genes{"ENSG00000183186.7"}->[0]->{"start"}."-".
  $all_genes{"ENSG00000183186.7"}->[0]->{"end"}."\n";

For further questions, please contact gencode_help.

 
Cookies policy | Terms & Conditions. This site is hosted by the Wellcome Trust Sanger Institute.