The refTSS is an annotated reference dataset for transcriptional start sites (TSS) in human and mouse. The dataset is generated by collecting, reprocessing and assembling various public resources. For question and inquiries about the data please contact reftss-help@riken.jp [Files] Under the directory of each organism (e.g. human and mouse), we stored the following files. * refTSS_v3.1_{human,mouse}_coordinate.{hg38,mm10}.bed Coordination of TSS in the refTSS dataset. This is in BED6 format. The columns in this file are: 1. chromosome 2. start of TSS region 3. end of TSS region 4. name (ID) of the TSS (refTSS ID) 5. score (=1) 6. strand of the TSS 7. start of the center position in the TSS region 8. end of the center position in the TSS region 9. item RGB (255,255,0) * refTSS_v3.1_{human,mouse}_ids.list.txt.gz Relationship between TSS peaks in refTSS and their original sources * gene_annotation/refTSS_v3.1_{human,mouse}_annotation.txt Associated gene / transcript / protein to each TSS (annotation). The columns are organized as follows. 1. refTSS ID 2. Transcript name (accession number) 3. Distance between the TSS and 5’-end of the transcript 4. Entrez Gene ID 5. HGNC/MGI ID 6. UniProt ID 7. Gene name 8. Gene symbol 9. Gene synonyms 10. Source of the gene annotation * gene_annotation/refTSS_v3.1_{human,mouse}_transcript.txt All (candidate) transcripts located around each TSS. The columns show as follows. 1. refTSS ID 2. Transcript name (accession number) 3. Distance between the TSS and 5’-end of the transcript 4. The number of transcripts located around the TSS 5. All accession numbers of the transcripts * sources/*.bed Sources of the refTSS data, which were assembled to construct the representative TSS set. The files are in BED6. [Data sources] In the current release of refTSS, we assembled the following TSS sets. * FANTOM5 promoter atlas (CAGE mapping of TSS) http://fantom.gsc.riken.jp/5/ * human/source/hg38_v3.CAGE_peaks_merged.bed * mouse/source/mm10_v3.CAGE_peaks_merged.bed * dbTSS (TSS-Seq mapping of TSS) http://dbtss.hgc.jp/ * human/source/dbtss_paraclu_output_mv464_simplified_L20.bed * EPD (The Eukaryotic Promoter Database) (manually created promoter databases) http://epd.vital-it.ch/ * human/source/Hs_EPDnew_004_hg38.one.bed * mouse/source/Mm_EPDnew_002_mm10.one.bed * DRA000914 https://trace.ddbj.nig.ac.jp/DRASearch/submission?acc=DRA000914 * human/source/DRA000914_hg38_CAGE_paraclu_output_mv99_simplified_L20.bed * mouse/source/DRA000914_mm10_CAGE_paraclu_output_mv56_simplified_L20.bed * ENCODE CAGE https://www.encodeproject.org/matrix/?type=Experiment&status=released&award.project=ENCODE&assay_title=CAGE * human/source/ENCODE_CAGE_paraclu_output_mv1412_simplified_L20.bed * RAMPAGE https://www.encodeproject.org/rampage/ * human/source/rampge_paraclu_output_mv17233_simplified_L20.bed Associated genes, transcripts, and proteins to TSS were generated with the following databases (as of October 2, 2018): GENCODE (v28 and vM18); Entrez Gene; RefSeq; HUGO Gene Nomenclature Committee (HGNC) database; the Mouse Genome Database (MGD); the UCSC Genome Browser; and UniProt [Analysis data] * tss_classfication/TSS.clssification.{hg38,mm10}.bed CAGE peaks identified as true TSS by TSS classifier (https://sourceforge.net/p/tometools/wiki/TssClassifier/) Files contain the set of predicted TSS. * tata_box_annotations/{hg38,mm10}_tata_annotation_V2.txt The TATA-Box was annotated in refTSS by using Homer software for motif discovery and next-gen sequencing analysis (http://homer.ucsd.edu/homer/ngs/annotation.html). The resulting text file contains the following set attributes: 1:PeakID 2:Chr 3:Start 4:End 5:Strand 6:Peak Score 7:Focus Ratio/Region Size 8:Annotation 9:Detailed Annotation 10:Distance to TSS 11:Nearest PromoterID 12:Entrez ID 13:Nearest Unigene 14:Nearest Refseq 15:Nearest Ensembl 16:Gene Name 17:Gene Alias 18:Gene Description 19:Gene Type 20:CpG% 21:GC% 22:TATA-Box(TBP)/Promoter/Homer Distance From Peak(sequence,strand,conservation) * tata_box_annotations/reftss.{hg38,mm10}.basic.homer.annotations.txt The TATA-Box was annotated in refTSS by using Homer software for motif discovery and next-gen sequencing analysis (http://homer.ucsd.edu/homer/ngs/annotation.html). The resulting text file contains the following set attributes: 1:PeakID 2:Chr 3:Start 4:End 5:Strand 6:Peak Score 7:Focus Ratio/Region Size 8:Annotation 9:Detailed Annotation 10:Distance to TSS 11:Nearest PromoterID 12:Entrez ID 13:Nearest Unigene 14:Nearest Refseq 15:Nearest Ensembl 16:Gene Name 17:Gene Alias 18:Gene Description 19:Gene Type * conservation/liftovered_mm10_to_hg38_peaks_overlapped_reftss_hg38_500bp.bed A list of mouse refTSS peaks that can be liftovered within 500bp of any human refTSS peaks. The file shows the locations of liftovered mouse refTSS peaks in the hg38 genomic coordination. * conservation/reftss_hg38_peaks_overlapped_liftovered_mm10_to_hg38_500bp.bed A list of human refTSS peaks that are located within 500bp of any liftovered mouse refTSS peaks. * conservation/liftovered_hg38_to_mm10_peaks_overlapped_reftss_mm10_500bp.bed A list of human refTSS peaks that can be liftovered within 500bp of any mouse refTSS peaks. The file shows the locations of liftovered human refTSS peaks in the mm10 genomic coordination. * conservation/reftss_mm10_peaks_overlapped_liftovered_hg38_to_mm10_500bp.bed A list of mouse refTSS peaks that are located within 500bp of any liftovered human refTSS peaks. * regulatory_build/{hg38,mm10}.RegBuild.gz Regulatory annotations in Ensembl Regulatory Build (ERB) in which each TSS peaks are located. [Future release plans] We are planning the further release: * Assemble of more TSS data sources * Integration of various resources including TSS activities, transcriptional regulations, and epigenetic information, to the refTSS.