RNAseq Genome Annotation Assessment Project (1/2)

Guidelines

Rules

  1. Input data
    • The prediction programs will be able to use the data in various ways but they should only use the reference genome human: GRCh37 (hg19), fly: version 5, worm: WS200) and the RNAseq data (1/ 2) as direct input data. There is a short description ( 1 / 2) of the datasets
    • Please ignore haplotypes and use the reference genome only
    • Ideally you should use the different files from each organism, each technology and each cell line and produce predictions and quantification for them (8 files for human, 4 for fly, 7 for worm). You can also submit a set of fully comprehensive predictions (without quantification) using all data for each organism combined
    • Please don't incorporate annotation produced from external sources such as GENCODE, RefSeq wormbase, flybase directly. It is not a problem if your programs were trained on these or other datasets (proteins, gene models, cDNAs, ESTs), but this data should not be used for the prediction directly
  2. Prediction data
    • Programs should predict transcripts
    • Programs should predict genome wide
    • It will be possible to consider submissions at different tracks: eg. transcript reconstruction using a reference genome, without using a reference genome, based only on transcript assembly, based on an underlying gene finding program, etc
    • We are encouraging participants to submit predictions for all three organisms
    • We are also encouraging participants to submit quantification data for transcripts predictions to indicate level of expression in reads/kilobase per million mapped reads (RPKM)
    • Participants should predict for the different data types (human: Illumina, Helicos, SOLiD) and cell lines separately, but can use merged sets in order to get full coverage of the transcriptome
    • The format of predictions will be gtf with coordinates as specified below. We will have a parser in place to test and verify the format. Participant can submit a test file beforehand to check their format
    • Along with the data, participants need to submit submission notes listing the methods and input data used.
  3. Evaluation
    • Evaluation will be a) genome wide using GENCODE annotation and b) on a few selected chromosomes using un-released HAVANA annotation
    • Evaluation metrics will be standard gene prediction assessment metrics: sensitivity, specificity and correlation coefficient at the nucleotide, exon, transcript and gene level
    • There will be an evaluation at the level of splice boundaries. This will assess the ability of the programs to splice shorts reads across splice junctions
    • Prediction of novel transcripts not in the annotation will undergo a level of experimental verification (since the GENCODE annotation has limitations as it does not have transcript coverage from all tier 1 cell lines)
    • A subset of the quantification will be compared against experimental nanostring data produced by the Wold lab (human) and Hoskins lab (drosophila)
    • For quantification we will also use spike-in control sequences

Conditions

  1. You agree that the submitted predictions will be evaluated by the RGASP team qualitatively and quantitatively
  2. You agree that the RGASP team may publish the results of these evaluations and your prediction sets both in a journal and on the web
  3. You agree to share the details of your method used with all participants after the submission
  4. You certify that your predictions are created using only the data provided and the methods you describe
  5. Prediction sets may be updated before the submission deadline, but after this deadline, the predictions may not be updated for any reason
  6. Predictions not submitted in a validated format and on the requested genome assembly can not be evaluated

GTF format

Please take your time to read through the definition before writing your output data.

The fields are:
[attributes]
The following feature types should be used where applicaple: "transcript", "exon", "CDS", "start_codon", "stop_codon".

Coordinates are absolute genome locations for every chromosome. The start codon is included in the CDS and the stop codon is part of the UTR if given.

Example lines:

Please use the parser on the submission page to verify your format.

chr21   Predictor-1   exon   12004   12200   .   +   .   gene_id "gene_21_1"; transcript_id "transcript_21_1";
chr21   Predictor-1   CDS    12090   12200   .   +   .   gene_id "gene_21_1"; transcript_id "transcript_21_1";

To give quantification on transcript and exon level, please use a "RPKM" key/value, e.g. for a simple, one transcript gene model.

chr20   Predictor-1  transcript  12004  12200  .  +  .  gene_id "gene_20_1"; transcript_id "transcript_21_1"; RPKM "20031.5"
chr20   Predictor-1   exon   12004  12100  .  +  .  gene_id "gene_20_1"; transcript_id "transcript_21_1"; RPKM "22051.9"
chr20   Predictor-1   exon   12150  12200  .  +  .  gene_id "gene_20_1"; transcript_id "transcript_21_1"; RPKM "18000.1"

(preferred) A two-transcript gene model, with a shared exon, allocating RPKMs between the transcripts at the exon level

chr21  Predictor-1  transcript  756009  767009  .  +  .  gene_id "gene_21_1"; transcript_id "transcript_21_1"; RPKM "131.5"
chr21  Predictor-1  exon  756009  756509  .  +  .  gene_id "gene_21_1"; transcript_id "transcript_21_1"; RPKM "120.5"
chr21  Predictor-1  exon  766500  767009  .  +  .  gene_id "gene_21_1"; transcript_id "transcript_21_1"; RPKM "140.1"
chr21  Predictor-1  transcript  760300  767009  .  +  .  gene_id "gene_21_1"; transcript_id "transcript_21_2"; RPKM "3050.5"
chr21  Predictor-1  exon  760300  763080  .  +  .  gene_id "gene_21_1"; transcript_id "transcript_21_2"; RPKM "3100.4"
chr21  Predictor-1  exon  766500  767009  .  +  .  gene_id "gene_21_1"; transcript_id "transcript_21_2"; RPKM "3010.1"

(alternative - must specify ahead of time), shared exons are listed with duplicated exon-level RPKMs

chr21  Predictor-2  transcript  756009  767009  .  +  .  gene_id "gene_21_1"; transcript_id "transcript_21_1"; RPKM "131.5"
chr21  Predictor-2  exon  756009  756509  .  +  .  gene_id "gene_21_1"; transcript_id "transcript_21_1"; RPKM "120.5"
chr21  Predictor-2  exon  766500  767009  .  +  .  gene_id "gene_21_1"; transcript_id "transcript_21_1"; RPKM "3160.5"
chr21  Predictor-2  transcript  760300  767009  .  +  .  gene_id "gene_21_1"; transcript_id "transcript_21_2"; RPKM "3050.5"
chr21  Predictor-2  exon  760300  763080  .  +  .  gene_id "gene_21_1"; transcript_id "transcript_21_2"; RPKM "3100.4"
chr21  Predictor-2  exon  766500  767009  .  +  .  gene_id "gene_21_1"; transcript_id "transcript_21_2"; RPKM "3160.5"

The nine main columns are seperated by one "tab", the fields within the last column are seperated by ";space" and the key from the value by one "space".

The findings will be published and made available here once they are finalized.