Gene/Transcript Biotypes in GENCODE & Ensembl

Please also compare to the VEGAdescriptions.

Further details about the annotation of non-coding RNAs are listed on this Ensembl page.

Gencode GTF format description.

Biotype Definition
IG_C_gene
IG_D_gene
IG_J_gene
IG_LV_gene
IG_V_gene
TR_C_gene
TR_J_gene
TR_V_gene
TR_D_gene
Immunoglobulin (Ig) variable chain and T-cell receptor (TcR) genes imported or annotated according to the IMGT.
IG_pseudogene
IG_C_pseudogene
IG_J_pseudogene
IG_V_pseudogene
TR_V_pseudogene
TR_J_pseudogene
Inactivated immunoglobulin gene.
Mt_rRNA
Mt_tRNA
miRNA
misc_RNA
rRNA
scRNA
snRNA
snoRNA
ribozyme
sRNA
scaRNA
Non-coding RNA predicted using sequences from Rfam and miRBase
Mt_tRNA_pseudogene
tRNA_pseudogene
snoRNA_pseudogene
snRNA_pseudogene
scRNA_pseudogene
rRNA_pseudogene
misc_RNA_pseudogene
miRNA_pseudogene
Non-coding RNA predicted to be pseudogene by the Ensembl pipeline
TEC To be Experimentally Confirmed. This is used for non-spliced EST clusters that have polyA features. This category has been specifically created for the ENCODE project to highlight regions that could indicate the presence of protein coding genes that require experimental validation, either by 5' RACE or RT-PCR to extend the transcripts, or by confirming expression of the putatively-encoded peptide with specific antibodies.
nonsense_mediated_decay If the coding sequence (following the appropriate reference) of a transcript finishes >50bp from a downstream splice site then it is tagged as NMD. If the variant does not cover the full reference coding sequence then it is annotated as NMD if NMD is unavoidable i.e. no matter what the exon structure of the missing portion is the transcript will be subject to NMD.
non_stop_decay Transcript that has polyA features (including signal) without a prior stop codon in the CDS, i.e. a non-genomic polyA tail attached directly to the CDS without 3' UTR. These transcripts are subject to degradation.
retained_intron Alternatively spliced transcript believed to contain intronic sequence relative to other, coding, variants.
protein_coding Contains an open reading frame (ORF).
processed_transcript Doesn't contain an ORF.
non_coding Transcript which is known from the literature to not be protein coding.
ambiguous_orf Transcript believed to be protein coding, but with more than one possible open reading frame.
sense_intronic Long non-coding transcript in introns of a coding gene that does not overlap any exons.
sense_overlapping Long non-coding transcript that contains a coding gene in its intron on the same strand.
antisense/antisense_RNA Has transcripts that overlap the genomic span (i.e. exon or introns) of a protein-coding locus on the opposite strand.
known_ncrna
pseudogene Have homology to proteins but generally suffer from a disrupted coding sequence and an active homologous gene can be found at another locus. Sometimes these entries have an intact coding sequence or an open but truncated ORF, in which case there is other evidence used (for example genomic polyA stretches at the 3' end) to classify them as a pseudogene. Can be further classified as one of the following.
processed_pseudogene Pseudogene that lack introns and is thought to arise from reverse transcription of mRNA followed by reinsertion of DNA into the genome.
polymorphic_pseudogene Pseudogene owing to a SNP/DIP but in other individuals/haplotypes/strains the gene is translated.
retrotransposed Pseudogene owing to a reverse transcribed and re-inserted sequence.
transcribed_processed_pseudogene
transcribed_unprocessed_pseudogene
transcribed_unitary_pseudogene
Pseudogene where protein homology or genomic structure indicates a pseudogene, but the presence of locus-specific transcripts indicates expression.
translated_processed_pseudogene
translated_unprocessed_pseudogene
Pseudogene that has mass spec data suggesting that it is also translated.
unitary_pseudogene A species-specific unprocessed pseudogene without a parent gene, as it has an active orthologue in another species.
unprocessed_pseudogene Pseudogene that can contain introns since produced by gene duplication.
artifact Used to tag mistakes in the public databases (Ensembl/SwissProt/Trembl)
lincRNA Long, intervening noncoding (linc) RNA that can be found in evolutionarily conserved, intergenic regions.
macro_lncRNA Unspliced lncRNA that is several kb in size.
3prime_overlapping_ncRNA Transcript where ditag and/or published experimental data strongly supports the existence of short non-coding transcripts transcribed from the 3'UTR.
disrupted_domain Otherwise viable coding region omitted from this alternatively spliced transcript because the splice variation affects a region coding for a protein domain.
vaultRNA Short non coding RNA gene that forms part of the vault ribonucleoprotein complex.
bidirectional_promoter_lncRNA A non-coding locus that originates from within the promoter region of a protein-coding gene, with transcription proceeding in the opposite direction on the other strand.