GENCODE

Format description of GENCODE GTF

A. TAB-separated standard GTF columns

column-number content values/format
1 chromosome name chr{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,X,Y,M}
2 annotation source {ENSEMBL,HAVANA}
3 feature-type {gene,transcript,exon,CDS,UTR,start_codon,stop_codon,Selenocysteine}
4 genomic start location integer-value (1-based)
5 genomic end location integer-value
6 score (not used)  .
7 genomic strand {+,-}
8 genomic phase (for CDS features)  {0,1,2,.}
9 additional information as key-value pairs see below

B. Key-value pairs in 9th column (format: key "value"; )

B.1. Mandatory fields

key name value format
gene_id ENSGXXXXXXXXXXX *
transcript_id ENSTXXXXXXXXXXX *
gene_type list of biotypes
gene_status {KNOWN, NOVEL, PUTATIVE}
gene_name string
transcript_type list of biotypes
transcript_status {KNOWN, NOVEL, PUTATIVE}
transcript_name string
exon_number indicates the biological position of the exon in the transcript
exon_id ENSEXXXXXXXXXXX *
level 1 (verified loci),
2 (manually annotated loci),
3 (automatically annotated loci)

*From version 7 the gene/transcript version number has been appended to ensembl and havana gene and transcript ids (e.g. ENSG00000160087.16, OTTHUMG00000001911.7).
In a few cases (PAR regions) ids are in the format ENSGRXXXXXXXXXX and ENSTRXXXXXXXXXX to avoid redundancy. These genes are tagged with "PAR" on the Y chromosome.

B.2. Optional fields

key name value format
tag part of a special set [*]:  {pseudo_consens,CCDS,seleno};
or annotation remarks ["cds_start_NF", "mRNA_end_NF", etc.]
list of tags
ccdsid official CCDS id [*];  CCDS*
havana_gene gene-id in the havana db [0,1];  OTTHUMG*
havana_transcript transcript-id in the havana db [0,1] ;  OTTHUMT*
protein_id ENSPXXXXXXXXXXX [0,1] (Ensembl protein id of protein coding transcript)
ont pseudogene (or other) ontology ids [*];  {PGO:0000004 and others}
transcript_support_level transcripts are scored according to how well mRNA and EST alignments match over its full length [0,1]
1 (all splice junctions of the transcript are supported by at least one non-suspect mRNA),
2 (the best supporting mRNA is flagged as suspect or the support is from multiple ESTs),
3 (the only support is from a single EST),
4 (the best supporting EST is flagged as suspect),
5 (no single transcript supports the model structure),
NA (the transcript was not analyzed)

number of occurrences:
    [*] - zero or multiple
    [0,1] - zero or one
Up to version 5 the selenocystein transcript tag lists the position within the translation where the modification occurs ("seleno_354"). From version 6 there is a tag ("seleno") and a separate GTF line with its coordinates.
Pre-release 4, "cds_start_NF" was listed as "cds start not found", etc. These "not-found" tags indicate incomplete transcripts We believe that the region is longer, but the (cDNA, etc) evidence is missing.

Example GTF lines:

chr21   HAVANA  transcript      10862622        10863067        .       +       .       gene_id "ENSG00000169861"; transcript_id "ENST00000302092"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IGHV1OR15-5"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IGHV1OR15-5-001"; level 2; havana_gene "OTTHUMG00000074130"; havana_transcript "OTTHUMT00000157419";
chr21   HAVANA  exon    10862622        10862667        .       +       .       gene_id "ENSG00000169861"; transcript_id "ENST00000302092"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IGHV1OR15-5"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IGHV1OR15-5-001"; level 2; havana_gene "OTTHUMG00000074130"; havana_transcript "OTTHUMT00000157419";
chr21   HAVANA  CDS     10862622        10862667        .       +       0       gene_id "ENSG00000169861"; transcript_id "ENST00000302092"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IGHV1OR15-5"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IGHV1OR15-5-001"; level 2; havana_gene "OTTHUMG00000074130"; havana_transcript "OTTHUMT00000157419";
chr21   HAVANA  start_codon     10862622        10862624        .       +       0       gene_id "ENSG00000169861"; transcript_id "ENST00000302092"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IGHV1OR15-5"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IGHV1OR15-5-001"; level 2; havana_gene "OTTHUMG00000074130"; havana_transcript "OTTHUMT00000157419";
chr21   HAVANA  exon    10862751        10863067        .       +       .       gene_id "ENSG00000169861"; transcript_id "ENST00000302092"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IGHV1OR15-5"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IGHV1OR15-5-001"; level 2; havana_gene "OTTHUMG00000074130"; havana_transcript "OTTHUMT00000157419";
chr21   HAVANA  CDS     10862751        10863064        .       +       2       gene_id "ENSG00000169861"; transcript_id "ENST00000302092"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IGHV1OR15-5"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IGHV1OR15-5-001"; level 2; havana_gene "OTTHUMG00000074130"; havana_transcript "OTTHUMT00000157419";
chr21   HAVANA  stop_codon      10863065        10863067        .       +       0       gene_id "ENSG00000169861"; transcript_id "ENST00000302092"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IGHV1OR15-5"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IGHV1OR15-5-001"; level 2; havana_gene "OTTHUMG00000074130"; havana_transcript "OTTHUMT00000157419";
chr21   HAVANA  UTR     10863065        10863067        .       +       .       gene_id "ENSG00000169861"; transcript_id "ENST00000302092"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IGHV1OR15-5"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IGHV1OR15-5-001"; level 2; havana_gene "OTTHUMG00000074130"; havana_transcript "OTTHUMT00000157419";

Examples for fetching specific parts from the file [Unix command line]:

  1. get all "gene" lines:
    awk '{if($3=="gene"){print $0}}' gencode.gtf
    
  2. get all "protein-coding transcript" lines:
    awk '{if($3=="transcript" && $20=="\"protein_coding\";"){print $0}}' gencode.gtf
    
  3. get level 1 & 2 annotation (manual annotated) only:
    awk '{if($26~/1|2/){print $0}}' gencode.gtf
    

Example for parsing the file [Perl]:

#!/usr/bin/perl

use strict;

open(IN, "<$gencode_file") or die "Can't open $gencode_file.\n";
my %all_genes;
while(<in>){
  next if(/^##/); #ignore header
  chomp;
  my %attribs = ();
  my ($chr, $source, $type, $start, $end, $score, 
    $strand, $phase, $attributes) = split("\t");
  #store nine columns in hash
  my %fields = (
    chr        => $chr,
    source     => $source,
    type       => $type,
    start      => $start,
    end        => $end,
    score      => $score,
    strand     => $strand,
    phase      => $phase,
    attributes => $attributes,
  );
  my @add_attributes = split(";", $attributes);
  # store ids and additional information in second hash
  foreach my $attr ( @add_attributes ) {
     next unless $attr =~ /^\s*(.+)\s(.+)$/;
     $c_type  = $1;
     $c_value = $2;
     if($c_type  && $c_value){
       if(!exists($attribs{$c_type})){
         $attribs{$c_type} = [];
       }
       push(@{ $attribs{$c_type} }, $c_value);
     }
  }
  #work with the information from the two hashes...
  #eg. store them in a hash of arrays by gene_id:
  if(!exists($all_genes{$attribs{'gene_id'}->[0]})){
    $all_genes{$attribs{'gene_id'}->[0]} = [];
  }
  push(@{ $all_genes{$attribs{'gene_id'}->[0]} }, \%fields);
}
print "Example entry ENSG00000223972: ".
  $all_genes->{"ENSG00000223972"}->[0]->{"type"}.", ".
  $all_genes->{"ENSG00000223972"}->[0]->{"chrom"}." ".
  $all_genes->{"ENSG00000223972"}->[0]->{"start"}."-".
  $all_genes->{"ENSG00000223972"}->[0]->{"end"}."\n";

Using the ENSEMBL PERL API:

The current version of Gencode is also the default geneset in the Ensembl database.

It can be accessed using one of the following options:

For further questions, please contact Admin.

 
Cookies policy | Terms & Conditions. This site is hosted by the Wellcome Trust Sanger Institute.