In current version, two varieties of cucumber are collected in CuAS. They are Cucumis sativus L. var. sativus cv. 9930 and Cucumis sativus var. hardwickii PI 183967.

CuAS can be queried using three input formats: IDs (gene id/isoform id/UniProt id/gene name), gene families and chromosomal positions. You can choose any of them as you prefer.

CuAS database was established from the integration of AS events inferring from multiple tissues, functions of isoforms, features of isoforms, and tissue-specific splicing events. These results are organized at three levels, the gene, transcript and isoform levels. At the gene level, we list the basic information about the gene and the homologs between the two cucumbers. At the transcript level, the transcript expression abundance, predicted AS events and PSI values of these events are reported for each query gene among multiple tissues. At the isoform level, isoform functional annotations (GO annotation and KEGG pathway annotation) and features of the gene isoform are provided.

In previous research using RNA-Seq of ten cucumber tissues of Cucumis sativus L. var. sativus cv. 9930, we assembled transcripts by TopHat (Trapnell, et al., 2009) and Cufflinks (Trapnell, et al., 2010) respectively (Sun, et al., 2018).These sets of transcripts were then compared with the reference genome annotation file using Cuffcompare. To ensure the accuracy of transcripts, we filtered these transcripts by following strategies: 

(1)Transcripts with three class codes (=, j, o) were extracted from the outputs created by Cuffcompare (click here to the detail of three class codes ).

(2)These transcripts in “j” and “o” classes were considered as novel transcripts. The novel transcripts with a single exon were removed.

(3)Each novel splice junction was required to be supported by at least ten reads, and each known splice junction was supported by at least one read. The transcripts poorly supported by splice junction reads were removed.

(4)Transcript expression was qualified by Salmon (Patro, et al., 2017), and the transcripts with TPM more than or equal to one in at least one sample were used for the analysis (Wagner, et al., 2013).

The same softwares and parameters were used in Cucumis sativus var. hardwickii PI 183967.

CuAS shows five types of AS events, including RI, SE, A5, A3 and MX.

To investigate tissue-specific splicing events, which is a representative AS event measurement, the percent spliced-in index (PSI), which is a representative AS event measurement, was quantified for all AS events. The PSI measured the fraction of the mRNAs expressed from the gene that contained a specific form from the AS event (Wang, et al., 2008). After quantifing transcripts, PSI values were then calculated by SUPPA (Alamancos, et al., 2015) for all AS events.

To better understand the impact of differently spliced isoforms encoded by one gene, we used TransDecoder (https://github.com/TransDecoder/TransDecoder, version 3.0.1) to identify the candidate coding regions in the assembled transcripts. TransDecoder performs a homology searched against Pfam 30.0 (Finn, et al., 2016) and the UniProt database (version 2016_11) (UniProt Consortium, 2018) as supporting evidence for the open reading frame (ORF). We selected the single best ORF for each transcript using the parameter “-single_best_orf”. If a premature termination codon was located more than 55 nucleotides from the last splice junction, the transcript was considered as a result of NMD (Nagy E. et al. 1998, Kalyna M. et al. 2012, Ohtani M. et al. 2019). Any transcript with an ORF that is greater than or equal to 300 bp in length and without a result of NMD, was retained for further analysis.

All the functions in CuAS are predicted at the isoform level. First, we performed a Blast2GO (Conesa and Gotz, 2008) analysis that assigned gene ontology terms to each isoform. Blast2GO used a BLASTP search (E-value 1e-05) against UniProt (release 2017_06). Then, the identified isoforms were mapped to reference canonical pathways in the Kyoto Encyclopedia of Genes and Genomes (KEGG) (http://www.kegg.jp/kegg/, version 85.0) (Kanehisa, et al., 2004). KAAS (KEGG Automatic Annotation Server, https://www.genome.jp/tools/kaas/ ) was used to assign KEGG pathways.

In total, 15 types of features were predicted, including amino acid composition, sequence features, transmembrane segments, secondary structure, intrinsic disordered regions, signal peptides, subcellular localization, PEST regions, low complexity regions, coiled coils, phosphorylation sites, N-linked glycosylation sites, O-GaINAc glycosylation sites, domains and motifs.


Feature Group Software Reference
Amino acid compositionEMBOSS-6.6.0(Rice, et al., 2000)
Sequence featuresEMBOSS-6.6.0(Rice, et al., 2000)
GravyGRAVY CALCULATOR(no warranty)
Transmembrane segmentsMEMSAT 3.0(Jones, et al., 1994)
Secondary structurePSIPRED 4.0(Jones, 1999)
Intrinsically disordered regionsDISOPRED 3.16(Jones and Cozzetto, 2015)
Signal peptidesSinglP 4.0(Bendtsen, et al., 2004)
Subcellular localizationYLoc(Briesemeister, et al., 2010)
PEST regionsEMBOSS-6.6.0(Rice, et al., 2000)
Low complexity regionsEMBOSS-6.6.0(Rice, et al., 2000)
Coiled coilsEMBOSS-6.6.0(Rice, et al., 2000)
Phosphorylation sitesNetPhos-3.1(Blom, et al., 2004)
N-linked glycosylation sitesNetNGlyc-1.0c(Blom, et al., 2004)
O-GalNAc-glycosylation sitesNetOglyc-3.1d(Blom, et al., 2004)
Domains (Pfam)Interproscan 5.24(Zdobnov and Apweiler, 2001)
Motifs (Prosite)Interproscan 5.24(Zdobnov and Apweiler, 2001)

The gene families are identified by iTAK (Zheng, et al, 2016) , including three types: transcription factors (TFs), transcriptional regulators (TRs) and protein kinases (PKs).

While the splicing related genes were identified by OrthoFinder (version: 2.3.1) (Emms. et al. 2019) against the sequence in Arabidopsis (Wang, et al. 2004), including small nuclear ribonucleoprotein, splicing factors, splicing regulation, novel splicesome proteins and possible splicing related proteins.

The gene descriptions were given by AHRD tool (https://github.com/groupschoof/AHRD), based on the BLASTP result against UniProt and TAIR.

First, the longest protein of each gene was extracted from the two varieties of cucumber. Then these representative protein sequences were identified using Orthofinder (Emms. et al. 2019) with the default parameters.

Alamancos, GP. et al. (2015) Leveraging transcript quantification for fast computation of alternative splicing profiles. RNA, 21, 1521-1531.

Bendtsen, JD. et al. (2004) Improved prediction of signal peptides: SignalP 3.0. J. Mol. Biol., 340, 783-795.

Bendtsen, N. et al. (2004) Prediction of post-translational glycosylation and phosphorylation of proteins from the amino acid sequence. Proteomics, 4, 1633-1649.

Blom, N. et al. (2004) Prediction of post-translational glycosylation and phosphorylation of proteins from the amino acid sequence. Proteomics, 4, 1633-1649.

Briesemeister, S. et al. (2010) YLoc - an interpretable web server for predicting subcellular localization. Nucleic Acids Res, 38(Web Server issue), 497-502.

Conesa, A. et al. (2008) Blast2GO: A comprehensive suite for functional analysis in plant genomics. Int J Plant Genomics, 2008, 619832.

Dong, C. et al. (2018) Alternative Splicing Plays a Critical Role in Maintaining Mineral Nutrient Homeostasis in Rice (Oryza sativa). Plant Cell, 30(10), 2267-2285.

Emms DM. et al. (2019) OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome biology, 20(1), 238.

Finn, RD. et al. (2016) The Pfam protein families database: towards a more sustainable future. Nucleic Acids Res. 44, D279-285.

Jones, DT. et al. (1999) Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol, 292, 195-202.

Jones, DT. et al. (2015) DISOPRED3: precise disordered region predictions with annotated protein-binding activity. Bioinformatics, 31, 857-863.

Jones, DT. et al. (1994) A model recognition approach to the prediction of all-helical membrane protein structure and topology. Biochemistry, 33, 3038-3049.

Kalyna M. et al. (2012) Alternative splicing and nonsense-mediated decay modulate expression of important regulatory genes in Arabidopsis. Nucleic Acids Res, 40(6), 2454-2469.

Kanehisa, M. et al. (2004) The KEGG resource for deciphering the genome. Nucleic Acids Res. 32, D277-280.

Nagy E. et al. (1998) A rule for termination-codon position within intron-containing genes: when nonsense affects RNA abundance. Trends Biochem Sci, 23(6), 198-199.

Ohtani M. et al. (2019) NMD-Based Gene Regulation-A Strategy for Fitness Enhancement in Plants? Plant Cell Physiol, 60(9), 1953-1960.

Patro, R. et al. (2017) Salmon provides fast and bias-aware quantification of transcript expression. Nat Methods, 14(4), 417-419.

Rice, P. et al. (2000) EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet. 16, 276-277.

Sun, Y. et al. (2018) The comparison of alternative splicing among the multiple tissues in cucumber. BMC Plant Biol, 18, 5.

Trapnell, C. et al. (2009) TopHat: discovering splice junctions with RNA-Seq. Bioinformatics, 25, 1105-1111.

Trapnell, C. et al. (2010) Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol, 28, 511-515.

UniProt Consortium, T. (2018) UniProt: the universal protein knowledgebase. Nucleic Acids Res, 46, 2699.

Wagner, GP. et al. (2013) A model based criterion for gene expression calls using RNA-seq data. Theory Biosci , 132(3), 159-164.

Wang BB. et al. (2004) The ASRG database: identification and survey of Arabidopsis thaliana genes involved in pre-mRNA splicing. Genome biology, 5(12), R102.

Wang, ET. et al. (2008) Alternative isoform regulation in human tissue transcriptomes. Nature, 456, 470-476.

Zdobnov, EM. et al. (2001) InterProScan--an integration platform for the signature-recognition methods in InterPro. Bioinformatics, 17, 847-848.

Zheng, Y. et al. (2016) iTAK: A Program for Genome-wide Prediction and Classification of Plant Transcription Factors, Transcriptional Regulators, and Protein Kinases. Mol Plant, 9(12), 1667-1670.