Full-length transcriptome sequencing and comparative transcriptomic analysis to uncover genes involved in early gametogenesis in the gonads of Amur sturgeon (Acipenser schrenckii)

Background Sturgeons (Acipenseriformes) are polyploid chondrostean fish that constitute an important model species for studying development and evolution in vertebrates. To better understand the mechanisms of reproduction regulation in sturgeon, this study combined PacBio isoform sequencing (Iso-Seq) with Illumina short-read RNA-seq methods to discover full-length genes involved in early gametogenesis of the Amur sturgeon, Acipenser schrenckii. Results A total of 50.04 G subread bases were generated from two SMRT cells, and herein 164,618 nonredundant full-length transcripts (unigenes) were produced with an average length of 2782 bp from gonad tissues (three testes and four ovaries) from seven 3-year-old A. schrenckii individuals. The number of ovary-specific expressed unigenes was greater than those of testis (19,716 vs. 3028), and completely different KEGG pathways were significantly enriched between the ovary-biased and testis-biased DEUs. Importantly, 60 early gametogenesis-related genes (involving 755 unigenes) were successfully identified, and exactly 50% (30/60) genes of those showed significantly differential expression in testes and ovaries. Among these, the Amh and Gsdf with testis-biased expression, and the Foxl2 and Cyp19a with ovary-biased expression strongly suggested the important regulatory roles in spermatogenesis and oogenesis of A. schrenckii, respectively. We also found the four novel Sox9 transcript variants, which increase the numbers of regulatory genes and imply function complexity in early gametogenesis. Finally, a total of 236,672 AS events (involving 36,522 unigenes) were detected, and 10,556 putative long noncoding RNAs (lncRNAs) and 4339 predicted transcript factors (TFs) were also respectively identified, which were all significantly associated with the early gametogenesis of A. schrenckii. Conclusions Overall, our results provide new genetic resources of full-length transcription data and information as a genomic-level reference for sturgeon. Crucially, we explored the comprehensive genetic characteristics that differ between the testes and ovaries of A. schrenckii in the early gametogenesis stage, which could provide candidate genes and theoretical basis for further the mechanisms of reproduction regulation of sturgeon.


Background
Sturgeons (Acipenseriformes) are polyploid chondrostean fish that originated during the Devonian period and have over 200 million years of history; thus, they constitute an important model species for studying development and evolution in vertebrates [1,2]. As the source of caviar food, the sturgeon has high economic value, which has resulted in intense fishing pressure on wild stocks, so leading them to be listed among the more endangered group of species. In the updated February 2019 press release, the International Union for Conversation of Nature and Natural Resource (IUCN) Red List identified sturgeon as one of the most endangered animal groups; some 85% of sturgeon species are on the verge of extinction (http://www.iucnredlist.org/search). Efforts have been made worldwide to develop a sturgeon aquaculture industry for artificial reproduction since the early 1960s. However, to date, many new problems have emerged that require attention. For instance, The developmental asynchronism of the double sex gametes i.e. sperm and ovum and reproduction interval of 2-7 years markedly reduce reproduction efficiency in sturgeon artificial propagation practices, causing more aquaculture cost. Therefore, studies of gametogenesis mastering the mechanism of reproduction regulation in sturgeon are greatly useful for reproduction evaluation and aquaculture management.
Gametogenesis includes oogenesis and spermatogenesis, which mainly are comprised of germ cell growth and proliferation, primary spermatocytes and primary oocytes formation, until matured gametes of double sexes production. In fish, the onset of meiosis of germ cell is one of the most important steps in gametogenesis, which involved in the sex steroid hormones stimulating and the endocrine regulation [3,4]. The previous studies reported various abnormalities in sturgeon gametogenesis and occurs of aberrant intersex gonads in cultured sturgeons [5,6], which significantly impact on the viability of progeny and decrease the efficiency of sturgeon stock enhancement. In this context, the first and core step in investigating the key genes involved in early gametogenesis in sturgeon is to acquire full-length nucleotide sequences information of the genes. Wholegenome sequencing and assembly combined with transcriptome data would be an efficient way to systematically characterize gene models. However, the sturgeon genome is large and includes numerous mini-chromosomes and substantial polyploidy caused by genome duplication [7,8], which possess significant difficulties to entire-genome sequencing. Recently, transcriptome-scale sex-related gene characterization was conducted in different sturgeon species with next-generation high-throughput sequencing technologies, including Adriatic sturgeon (A. naccarii) [9], Chinese sturgeon (A. sinensis) [10], Amur sturgeon (A. schrenckii) [11,12] and Russian sturgeon (A. gueldenstaedtii) [13]. However, the genes acquired by only assemble procedure used in above studies signify incomplete sequences, which extremely restricts the yield of full-length genes. The third-generation long-read sequencing platform can overcome this difficulty.
In comparison with short-read sequencing, the methodological advantages of PacBio Isoform sequencing (Iso-Seq) include better completeness in sequencing both the 5′ and 3′ ends of the full-length cDNA molecules and greater accuracy in producing isoform-level transcripts. Recently, PacBio Iso-Seq technology has been successfully used in multiple species, such as Populus [14], Dolly Varden char [15], human cell lines and tissues [16], rabbits [17] and primates [18]. The transcriptomic data produced by PacBio Iso-Seq provide innovative research materials. For example, deciphering highly similar multigene family transcripts from Iso-Seq data with IsoCon has opened the door for gaining a deep understanding of genome evolution and human diseases [19], and the full-length transcripts from the Iso-Seq platform have provided new insight into the extreme metabolism of the ruby-throated hummingbird [20]. Meanwhile, the reconstruction and annotation of fulllength transcripts also plays a critical role in gene discovery, particularly for species with no reference genome, such as the transcript variants involved in the innate immune system in Litopenaeus vannamei [21], gene families with two and more isoforms in Misgurnus anguillicaudatus [22] and transcript diversity in bioactive compound biosynthesis of Astragalus membranaceus [23]. Full-length transcriptome analyses may also drive new innovative progress for understanding the mechanisms of reproduction regulation in sturgeon.
In this study, we adopted joint PacBio Iso-Seq and short-read RNA-seq to generate a high-confidence fulllength transcripts dataset of the gonads of 3-year-old A.
schrenckii individuals and further used them to obtain comparative transcriptomic analysis for quantification investigation between testes and ovaries. Subsequently, a functional annotation of these full-length transcripts was systematically conducted with well-curated databases. Alternative splicing (AS) events, long noncoding RNAs (LncRNAs) and transcription factors (TFs) were detected. Most importantly, searching for genes involved in early gametogenesis and association analyses of related factors and early gametogenesis were performed. Herein, we not only provide a valuable resource i.e. a comprehensive full-length transcript set for the genomic-level reference of sturgeon but also systematically characterize early gametogenesis related genes of A. schrenckii to further investigate the functions of molecules during gametogenesis in sturgeon.

Samples collection and histological analysis
A total of ten 3-year-old healthy Amur sturgeon (five males and five females) were sampled from the Engineering and Technology Center of Sturgeon Breeding and Cultivation of the Chinese Academy of Fishery Sciences (Beijing, China) in this study. Before sampling, the experimental individuals were anesthetized with eugenol in water for 1-3 min according to the AVMA guidelines for use (2013 version). The gonads i.e. ovaries and testes were separately collected and treated. One part of each gonad tissues was preserved in Bouin's fixation for histological procedures, and another part was immediately immersed into liquid nitrogen until total RNA extraction. The histological analysis of each gonad tissue was performed using hematoxylin and eosin (HE) staining.

RNA extraction and quality evaluation
Total RNA was extracted from each gonad tissue (testis or ovary) using RNAiso reagent (Takara, Tokyo, Japan). To prevent genomic DNA contamination, RNA sample was treated to digest DNA using RNase-free DNase I during extraction of total RNA. RNA purity and concentration were checked using a Nanodrop 2000 spectrophotometer (Thermo Scientific). RNA integrity was assessed using an Agilent RNA 6000 Nano reagents part I Kit in an Agilent 2100 Bioanalyzer System (Agilent Technologies, CA, USA). The RNA quality criteria for the RNA samples were RIN ≥ 7.0 (RNA Integrity Number), and 1.8 < OD260/ 280 < 2.2. The qualified RNAs were used for further PacBio and Illumina library construction, respectively. All the sequencing operations were conducted at Biomarker Technologies CO., LTD (Beijing, China). Considering the relatively large size of the sturgeon genome and the available A. schrenckii individuals (a presumed octaploid species), the standard of sequencing quantity (clean data) was set as follows: Pacbio Iso-Seq of the library was at least 40 G and Illumina short-read RNA-seq was greater than 20 G for each sample.

PacBio library construction and sequencing
To construct the library for PacBio sequencing, the qualified RNA from seven tissues, including the three testes and four ovaries, were mixed in equal amounts. The mixed RNA sample was reverse-transcribed for mRNA using the SMARTer™ PCR cDNA Synthesis Kit. PCR amplification was performed using the KAPA HiFi HotStart PCR Kit. Then, the PCR product for the SMRTbell library was constructed using the SMRTbell template pre kit. The concentration of the SMRTbell library was measured using a Qubit 3.0 fluorometer with a Qubit™ 1X dsDNA HS Assay kit (Invitrogen, Carlsbad, USA). The quantified criteria of library quality were concentration > 10 ng/μl with dispersive but continuous distribution in the range of 1-10kbp. A total of 2.5 ng of the library was sequenced for each SMRT cell using the binding kit 2.1 from the PacBio Sequel platform, producing 20 h of movies. The sample information was first registered as BioProject with accession number PRJNA532819 and BioSample with accession number SAMN11415730. Subsequently, the subread sequence generated by the PacBio Iso-Seq platform was deposited into the NCBI Sequence Read Archive (SRA) with accession number SRR7453063.

Error correction of PacBio Iso-Seq reads
According to PacBio's protocol, the raw polymerase reads were first processed using SMRTlink 5.0 software. Briefly, after removing the SMRTbell™ adapter and the low-quality data, postfilter polymerase reads were obtained. The circular consensus sequence (CCS) was generated from the subreads BAM files, also known as the reads of insert (ROI). All the ROIs whose number of full passes were > 1 were further classified into full-length (FL) and non full-length (nFL) transcript sequences based on whether the 5′ primer, 3′ primer and poly A tail could be simultaneously observed. We employed a three-step strategy for error correction to improve the accuracy of the full-length transcripts produced by the PacBio Iso-Seq platform. First, the circle sequencing with > 1 pass provided an opportunity for ROI selfcorrection. Second, full-length, nonchimeric (FLNC) reads were subjected to nonredundant and clustering treatments by the ICE Quiver algorithm and to Arrow polishing with the nFL sequence, producing high-quality and polished full-length consensus sequences. Finally, these polished consensus sequences were further subjected to correction and redundancy removal with Illumina short reads using the Proovread tool and the CD-HIT program with a -c 0.99 parameter cutoff [24], respectively. The above three corrections resulted in nonredundant, nonchimeric, full-length unigenes (isoform level) with high accuracy for subsequent analyses.

Illumina library construction and sequencing
The Illumina library was prepared using the NEBNext, Ultra™ RNA Library Prep Kit (E7530 L) for Illumina (NEB, USA). Briefly, polyadenylated RNA was isolated and randomly separated into fragments. First-strand cDNA was synthesized using random hexamer primers, followed by second-strand synthesis. The purification of the double-stranded cDNA was performed using VAHTSTM mRNA capture beads. The purified and repaired double-stranded cDNA fragments were selected by size in the range of 250 bp~350 bp. The concentration and quality control of the Illumina library were measured using a Qubit 3.0 fluorometer with an ExKubit dsDNA HS Assay kit (Invitrogen, Carlsbad, USA) and a Qsep 400 fragment analyzer, respectively. The quantified criteria of the library quality were a concentration > 1 ng/μl in a range of 380 bp~480 bp fragments. The Illumina libraries were finally sequenced on the Illumina HiSeq platform.
Raw short reads in FASTQ format were first processed through in-house Perl scripts; clean short reads were obtained by removing reads containing adapter and ploy-N, and with low quality from the raw short reads. Simultaneously, the Q30, GC content and sequence duplication level of the clean data were calculated. The clean short reads were then mapped to the PacBio reference sequence using Tophat2 tools. Only reads with perfect matches or only one mismatch were further analyzed.

Quantification of unigene expression levels and differential expression analysis
Using the full-length transcripts yielded from the SMRT Iso-Seq analysis as reference sequences, the unigene expression levels between the ovaries and testes of A. schrenckii were further analyzed based on the shortreads datasets generated by the Illumina sequencing platform. The extracts of three testes and four ovaries from seven A. schrenckii individuals were separately used as examples of the analysis.
Quantifications of the unigene expression values from the Illumina reads of each sample were determined with RSEM using the default parameters [34]. Briefly, the clean data from the Illumina sequencing were mapped back onto the reference sequences, and the readcount values of the unigenes for each sample were obtained. To eliminate the effects of the sequencing depth and transcript length, all the readcounts were transformed into FPKM values (expected number of fragments per kilobase of transcript sequence per million base pairs sequenced). The 0.1 threshold of FPKM value is regarded as the expression criterion of the unigene in tested tissue (FPKM > 0.1) according to the method [35].
To detect the differentially expressed genes (DEGs), all the readcounts of each sample were first normalized into a standardized readcount using the edgeR package. Then, the differential expression analysis of testes and ovaries was performed using the EBSeq R package mode based on the negative binomial distribution from the biological replicates. The resulting false discovery rate (FDR) values were adjusted using the posterior probability of being DE (PPDE) approach. Herein, the |log2(Fold Change)| > 1 and FDR < 0.05 were used as the threshold for determining DEGs.

Alternative splicing prediction and analysis
Based on the BLAST method [25], all the unigenes were used for pairwise alignment. Finally, BLAST alignments that met all three of the following criteria were considered products of candidate AS events [36]. Briefly, 1) the length of two unigenes was both greater than 1000 bp, and there were two high-scoring segment pairs (HSPs) in the alignment. 2) The AS gap between two aligned unigenes was greater than 100 bp and at least 100 bp from the 3′ end and 5′ end. 3) A 5 bp overlap could be tolerated.

LncRNA prediction and analysis
Unigenes with a length of over 200 nt and having more than two exons were selected as lncRNA candidates. Then, four computational approaches were employed to further screen the protein-coding unigenes from the noncoding unigenes: 1) we performed the codingnoncoding-Index (CNCI) with default settings [37] to assess the coding potential; 2) we used the coding potential calculator (CPC) to search for the unigenes in the NCBI eukaryotic protein database with a score < 0 setting [38]; 3) we translated each transcript in all three possible frames and used the Pfam scan utility with default parameters to identify the occurrence of any of the known protein family domains documented in the Pfam database [39]; 4) we used the coding potential assessment tool (CPAT) to assess the putative protein-coding unigenes by calculating the Fickett and Hexamer scores based on the logistic regression model [40]. As a result, all the isoform transcripts with coding potential were filtered, and the intersecting unigenes without coding potential formed our candidate set of lncRNAs.

Transcription factor prediction and further analysis
Transcription factor (TF)-related unigene sequences were predicted using the BLAST method with the AnimalTFDB database [41]. The HMG domain of SOX was further verified using the SMART mode (http:// smart.emblheidelbergde/index2.cgi).
According to the lengths of the four nonredundant full-length Sox9 genes, we renamed them as Asc_Sox9-1, Asc_Sox9-2, Asc_Sox9-3 and Asc_Sox9-4, respectively. A neighbor-joining phylogenetic tree was reconstructed based on amino acid sequences using the MEGA 7.0 package with the following parameter settings: p-distance model and partial deletion treatment. Meanwhile, Sox2 from the zebrafish Danio rerio (accession number: BAE48583.1) was chosen as the out-group protein sequence.

Histology characteristics from the gonads of 3-year-old A. schrenckii
The gonads of 3-year-old A. schrenckii were visible at the early gametogenesis stage. In 3-year-old female individuals, the histological section of ovary showed deep, branching ovarian lamellae structure and was mainly composed of primary growth oocytes of different diameter sizes ranging from 100 to 500 μm. Some oocytes at the perinucleolar stage were clearly observed with a high number of small nucleoli along the nuclear perimeter. In 3-year-old male individuals, section of the entire germ region of testis showed smooth surface of the lateral side. By histological observation, it was found that the testis tissue displayed alveolate seminiferous lobules organization structure. The anastomosing tubules were separated from each other by thin layers of compact connective tissues and filled with spermatogonia enveloped by their own Sertoli cells. Meanwhile, we also found that there were some degenerating oocytes probably undergoing autophagy and some spermatogonia with pyknotic nuclei were probably apoptotic. Histomorphological characteristics of the ovary and testis from 3-year-old A. schrenckii individuals are shown in Fig. 1.

Full-length transcripts from the gonads of A. schrenckii
The full-length transcriptome of A. schrenckii was generated using the PacBio Sequel platform on the pooled RNA from seven tissues, including three testes and four ovaries. The resulting total of 50.04 G subread bases was generated by two SMRT cells from the PacBio library; therefore, the 1,260,958 reads of insert were produced with a mean read quality of 0.95 and mean passes of 14 circles (Supplementary Table 1). By applying the standard Iso-Seq classification and clustering protocol, all the ROIs were further classified into 358,153 nFL sequences and 860,617 FLNC reads with a mean length of 2548 bp. Based on the ICE Quiver and Arrow polishing algorithms, we produced 461,596 polished full-length consensus transcripts with a mean length of 2782 bp, including 335,067 high-quality (HQ) and 125,969 lowquality (LQ) sequences. After correction using short reads produced by Illumina short-read RNA-seq and subsequently removing redundancies using the CD-Hit program, the consensus transcripts were finally clustered into a total of 164,618 unigenes for subsequent analysis. We found that 91.09% of the unigenes main length distribution ranging from 1 to 6 kbp ( Fig. 2 and Supplementary Table 2). The Iso-Seq statistics from the gonads of A. schrenckii by the PacBio Sequel platform are listed in Table 1.
Meanwhile, the clean short reads were respectively generated from seven gonads of A. schrenckii (including three testes and four ovaries) using Illumina sequencing platform. The quality evaluation indexes of the clean short reads are summarized in Supplementary Table 3. In sum, the high-quality clean short reads from each sampled Illumina library was obtained with at least 6.96 × 10 7 read numbers and a Q30 of over 90%.

Efficient gene annotation of full-length unigenes from A. schrenckii gonads
To obtain a comprehensive functional annotation from the full-length gonad transcriptome of A. schrenckii, all the 164,618 nonredundant unigenes were assigned to align with seven different databases, including NR, KOG, Pfam, SwissProt, KEGG, GO, and eggNOG. A total of 93.55% of the unigenes (154,006 of 164,618) were successfully annotated with significant hits (E-value <1E − 5 ) from these well-curated databases. The statistics of the full-length unigene annotations are listed in Table 2. The remaining unannotated unigenes (10,612 unigenes) might represent novel A. schrenckii species-specific genes.
Among the annotated 60 classified GO terms, cellular process was identified as the most common annotation in the biological process; metabolic process and biological regulation were the next most abundant GO terms. In the molecular function and cellular component categories, binding and cell part annotations were identified as the most abundant terms, respectively. Two putative early gametogenesis related GO terms, including reproductive process (involving 1188 unigenes) and reproduction (involving 1134 unigenes), were successfully annotated. The GO classifications of the full-length unigenes from the gonads of A. schrenckii are shown in Supplementary Figure 1 and Supplementary Table 4.
In the KEGG classification, a total of 295 pathways annotated from 87,086 nonredundant unigenes were extracted from the gonad transcriptome of A. schrenckii (Supplementary Table 5). The results showed that the protein processing endoplasmic reticulum (2329 unigenes), RNA transport (2312 unigenes) and cell cycle (2263 unigenes) were the top three pathways with the most abundant unigenes. Notably, we paid attention to 14 KEGG pathways, which may be closely associated with early gametogenesis of sturgeon. Among these, MAPK signaling pathway (1834 unigenes), oocyte meiosis (1731 unigenes) and progesterone-mediated oocyte maturation (1398 unigenes) were the top three pathways with the most abundant unigenes distribution.

Search for genes involved in early gametogenesis of A. schrenckii
A total of 60 genes (involving 755 unigenes), reported as active in the gonad development of sturgeon [10,[42][43][44], were detected to be present in the full-length gonad transcriptome of 3-year-old A. schrenckii. Therefore, the 60 early gametogenesis related genes and their NR annotations of full-length unigenes from the Iso-Seq of A. scipenserkii gonads are listed in Table 3 and Supplementary Table 6, respectively. Mainly, two major sex determination related genes in fish, Dmrt1and Gsdf, were also present in the full-length gonad transcriptome. Nine spermatogenesis-related genes were significantly matched with the full-length gonad transcriptome, including AR, Vasa, ER, Rspo1, Igf I, Dkk1, Hsd11b2, Cyp11b and ATRX. Meanwhile, four oogenesis-related genes were also predicted from the full-length gonad transcriptome, including Foxl2, Cyp19a, VR and Ozf. Four hormone receptor genes, including AR, ER, Gnrhr and Fshr. Two members of transforming growth factor (TGF)-β superfamily, Amh and Gsdf. Nine genes belonging to the sox subfamily, Sox3, Sox4, Sox5, Sox6, Sox7, Sox8, Sox9, Sox10 and Sox18, and five genes belonging to the double sex and mab-s (DM) domain, Dmrt1, Dmrt2, DmrtA1, DmrtA2 and DmrtB1, were found. In addition, Nine Sap family genes and four Hsp family genes were also concerned. We also found that Oct4 has the most abundant Differential expression genes in early gametogenesis in the gonad transcriptome of A. schrenckii Using nonredundant full-length transcripts as genome sequence references and combining the clean short reads datasets from the Illumina sequencing platform, the expression values (FPKM) of all 164,618 unigenes in three testes and four ovaries of seven A. schrenckii individuals were separately obtained. Therefore, the expression characteristics of all the full-length unigenes between the testes and ovaries of A. schrenckii were classified into the following three categories. 1) 19,481 unigenes were not expressed in either ovaries or testes (FPKM = 0 or FPKM ≤0.1); 2) 19,716 unigenes (13.6%) were ovaryspecific expression patterns, including 14,284 unigenes in 0.1 < FPKM< 2, 4242 unigenes in 2 < FPKM< 10 and 1190 unigenes with FPKM> 10; 3) In contrast, only 3028 unigenes (2.1%) were exclusively transcribed in testis tissues, including 2751 unigenes in 0.1 < FPKM< 2, 229 unigenes in 2 < FPKM< 10 and 48 unigenes with FPKM> 10. Here, the distribution of testis-specific and ovaryspecific unigenes are shown in a Venn diagram (Fig. 3a).
To uncover the biological function of DEUs, we separately performed KEGG pathway enrichment analysis for the ovary-biased and testis-biased DEUs. As shown in Fig. 3b and c, we found that completely different KEGG pathways were enriched between the ovary-biased and testis-biased DEUs. A total of 17 and 11 terms significantly enriched from the ovary-biased and testis-biased DEUs were discovered, respectively (corrected P < 0.05). Among the ovary-biased DEUs, most of the unigenes were involved in the three top KEGG pathways, including the cell cycle, oocyte meiosis and phagosome. However, most of the unigenes of the testis-biased DEUs were related to cell adhesion molecules, neuroactive ligand-receptor interactions, calcium signaling pathways, etc. We further analyzed the distribution of DEUs in early gametogenesis-related GO terms and KEGG pathways. We found that oocyte meiosis is the most significant pathway with 277 DEUs of 1731 unigenes, followed by the progesterone-mediated oocyte maturation pathway with 201 DEUs (Table 5).
Association analysis of alternative splicing (AS) and early gametogenesis from the gonad full-length transcriptomes of A. schrenckii Because sturgeon species still had no reference genome, we detected AS transcript isoforms of the full-length gonad transcriptome from A. schrenckii by referring to the pipeline of Amborella trichopoda without a reference genome [36]. As a result, a total of 236,672 AS events  however, it is interesting to note that 5314 unigenes (14.55%) were predicted to have more than 10 isoforms (Fig. 4a). Importantly, we found that sixteen early gametogenesisrelated genes were predicted to be involved in AS events (Supplementary Table 7). We selected two genes as examples, including Vasa (unigene ID: F01_cb6729_c68/f1p2/ 2928) predicted with four alternative isoforms and Fem1 (unigene ID: F01_cb8161_c15/f1p2/1763) predicted with five alternative isoforms. In Fig. 4b, the cluster heatmap (Log 2 (FPKM+ 1) values) indicates the expression patterns of different alternative isoforms in the testes and ovaries of A. schrenckii. Vasa, as a exclusively expressed marker gene in germ-line, showed the predominant expression transcript with higher expression level in testes than that in ovary among 4 isoforms. Isoform 1 of Vasa had an obviously high expression in ovaries, and the other isoforms showed very low expression levels in both testes and ovaries. Meanwhile, we also found that expression levels of Fem and its five isoforms might seem like a paradoxical pattern. For example, isoform 2 and isoform 3 were the two main expression transcripts and had relatively high expression levels in both the ovaries and the testes, but isoform 4 and isoform 5 showed the opposite pattern between the ovaries and the testes. Therefore, these alternative isoforms suggest an important role in the regulation of gene expression through compensation or neutralization effects.
Subsequently, we investigated the distribution of AS events in the early gametogenesis-related GO terms and KEGG signaling pathways. The results indicated that the AS event is a pervasive feature involved in the early gametogenesis processes of sturgeon. The column plot showed that oocyte meiosis, progesterone-mediated oocyte maturation and the MAPK signaling pathway are the three most abundant AS events unigenes, while steroid biosynthesis is the least abundant event, with only eight unigenes (Fig. 4c).
Association analysis of long noncoding RNAs (LncRNAs) and early gametogenesis from full-length gonad transcriptomes of A. schrenckii A total of 10,566 unigenes were identified as putative LncRNAs from the full-length gonad transcriptome of A. schrenckii (Fig. 5a). Further analysis indicated that 2388 of the detected lncRNAs were not expressed in either the testes or ovaries (FPKM = 0 or FPKM≤0.1). Meanwhile, 524 LncRNAs present exclusively in ovaryspecific expression and 551 LncRNAs present exclusively in testis-specific expression were also detected. Overall, 961 putative LncRNAs were differentially expressed between the ovaries and testes of A. schrenckii, including 235 ovary-biased LncRNAs and 726 testis-biased LncRNAs (Supplementary Figure 3).

Association analysis of transcript factors (TFs) and early gametogenesis from full-length gonad transcriptomes of A. schrenckii
A total of 4339 nonredundant TF-related unigene sequences were matched against the AnimalTFDB database using BLAST, corresponding to 53 TF-related Pfam family domains. The top 20 abundant terms are listed in Fig. 5b. The sixty unigenes were found to have two or more different TF-related domains. For example, the results indicated that F01_cb10567_c40/f1p0/2683 was predicted to have a Homebox and miscellaneous domain, which suggests that further identification needs to be performed.
Many SRY-related HMG (high-mobility group) box (Sox) transcription factors play an important role in male gonadal differentiation, spermatogenesis and gonadal function maintenance in vertebrate species. Because the SOX family often shares the conserved HMG box domain, members of the Sox family were herein identified from the full-length gonad transcriptome of A. schrenckii. After further validation using SMART protein motif analysis, 82 nonredundant unigenes were found to have both HMG domains and SOX NR annotations with significant matches. We also found that the sequence variations of Sox increased the numbers of unigenes, and the 82 nonredundant unigenes belonged to seven members of the SOX family, including Sox3 (52 unigenes), Sox4 (4 unigenes), Sox5 (5 unigenes), Sox7 (10 unigenes), Sox8 (6 unigenes), Sox9 (4 unigenes) and Sox10 (1 unigene) ( Fig. 5c and Supplementary Table 8).
Further analysis was performed using Sox9 as an example. The four nonredundant full-length unigenes were renamed Asc_Sox9-1-4 according to the length of PacBio Iso-Seq. The characteristics of the only completed sturgeon Sox9 protein from A. sinensis in the NCBI  database (accession number: AHZ62758.1) and the Asc_ Sox9-1-4 sequence are listed in detail in Table 6. The results showed two main differences: 1) compared to the length of Asi_Sox9 (2145 bp), those of the four Asc_ Sox9-1-4 were longer and varied from 2873 bp to 3491 bp.
2) The lengths of the amino acids in Asc_Sox9-1-4 also varied, from 429 aa to 488 aa. The information for the Asc_Sox9-1-4 sequences is listed in Supplementary File 1. Due to the significant differences in amino acid lengths, HMG domain position, UTR length and expression abundance, the four Asc_Sox9 genes suggested putatively novel transcripts in A. schrenckii. Figure 6 shows a schematic diagram of the gene structure and expression abundance (FPKM levels) among the Asc_Sox9-1-4 genes.
The phylogenetic tree was constructed from the alignment of the amino acid sequences in the four Asc_ Sox9-1-4 genes with those from forty-six other animal species from five classes, including Mammalia, Aves, Amphibia, Reptilias and Osteichthyes (Supplementary Table 9). From Fig. 7, we found that all four Asc_Sox9-1-4 genes and those of other sturgeon species were perfectly clustered into one group. Obviously, Asc_Sox9-1 gene was closer to that of A. baerii; however, it was genetically distant from Asc_Sox9-3, Asc_Sox9-2 and Asc_Sox9-4. As expected, compared with the other four classes, the sturgeon Sox9 genes had the closest evolutionary relationship with other teleost fish species among all the five classes. We also found differences between the Sox9 gene tree and the species tree, which suggests that Sox9 is under evolutionary selection pressure.
Meanwhile, Sox9 genes from five different classes were clustered together into five groups, which indicates that their functions may be relatively conserved in vertebrates.

Discussion
Female sturgeons are more valuable than males due to the valuable caviar produced by their ovaries; consequently identifying sturgeon sex as early as possible and mastering the regulation technology of female gonad development could reduce production costs for enterprises. However, we still have poor knowledge for the regulatory mechanisms of reproductive processes, for example sex determination and differentiation, gonad development and gametogenesis. Amur sturgeon (A. schrenckii) is a critically endangered fish in the Acipenseridae family distributed in the Amur river in China and Russia [45]. Artificial reproduction of A. schrenckii began in the 1930s in China, and it has been the most popular sturgeon aquaculture species since the early 1990s. Currently, A. schrenckii has been one of the dominant caviar productions of farmed sturgeons and the popular crossing parents for sturgeon aquaculture in China [46,47]. Here, the study of searching for genes involved in early gametogenesis and regulation mechanisms of reproduction control of A. schrenckii can provide practice application and theory basis. In the present study, we first produced high confidence and full-length transcriptome data from two independent types of gonad tissues (three testes and four ovaries), and maximized transcript diversity using the PacBio Sequel sequencing approach. As expected, a large naccarii [9], A. sinensis [10], and A. schrenckii [42].
Here, a total of 154,006 (93.55%) out of 164,618 fulllength unigenes were successfully annotated as known homologous genes using seven well-curated databases. Due to the absence of reference genomic information, the remaining 10,612 full-length unigenes may suggest putative novel genes in sturgeon. The testes tissue of 3-year-old A. schrenckii individuals was active in the mitotic proliferation stage of spermatogonia. In teleost fish, R-spondin 1 (Rspo1) expression was upregulated just before meiotic initiation in both the ovary and testis during the early developmental stages and the deficiency of Rspo1 was intriguingly found to cause a delay in spermatogenesis in XY fish [48]. In human, two estrogen receptor β (ERβ) wild-type transcript variants suggested specific functions in spermatogenesis due to their expression mainly located in somatic cells and primary spermatocytes [49]. The 11-betahydroxysteroid dehydrogenase (Hsd11b2) was expressed higher in male than female of 2-year-old A. ruthenus [50]. ATRX protein was present in adult human and rat testis and was expressed in spermatogonia, early meiotic spermatocytes and somatic cells [51]. Many members of spermatogenesis associated protein family genes (Sap) are identified in testes and play important roles in the spermatogenesis process in vertebrates including fish, for example, Sap2 [52,53], Sap4 [54], Sap6 [55], Sap22 [56], and so on. In our study, above mentioned genes were also identified to be sex-significant expression genes in testes, which suggest they may paly conserved role in the early spermatogenesis of sturgeon.
Morphological observations indicated that ovary tissue was active in the stage of ovarian follicles development, when it filled with growing oocytes with different diameter sizes and abundant nucleoli in 3-year-old A. schrenckii. The forkhead transcription factor (Foxl2) plays an essential role in early ovarian development, subsequent maintenance of female trait and reproduction function. The knockout of Foxl2 was reported to give rise to complete female to male sex reversal in mammals and teleost fish [57,58]. In fish, Foxl2 may play an important role in ovarian differentiation by maintaining cyp19a (cytochrome P450 1A) expression and antagonizing the expression of Dmrt1(double sex and mab-3 related transcription factor 1) [59]. In the present study, foxl2 and cyp19a were sexual dimorphic expression pattern and significantly higher expressed in the ovaries The relative expression level of genes in ovary compared to that in testis. OV ovary, TE testis b "inf" indicates tissue-specific expression pattern compared to the testes. Considering their significant role in the ovary, it would be necessary to explore the function of foxl2 and cyp19a further in the A. schrenckii. As the markers of undifferentiated spermatogonia, the expression patterns of Oct4 and Sox3 were found to have differentially expressed in the ovaries than that in the testes of the A. schrenckii transcriptome. The similar patterns were also reported in the previous Chinese sturgeon [10], which maybe signify a novel function in sturgeon. At the molecular level, functional assignment indicated that the abundant transcript numbers and DEGs were mainly distributed in directly ovaryrelated KEGG pathways, including oocyte meiosis, progesterone-mediated oocyte maturation, oxytocin signaling pathway, estrogen signaling pathway, prolactin signaling pathway and ovarian steroidogenesis. Above results may suggest that the gonads of 3-yearold female Amur sturgeon are still in the core development stage for launching reproduction-related function activity. In mammals, the TGF-β signaling pathway has been shown to play important roles in the development of both ovarian and testicular functions [60,61]. Moreover, the identification of Amh and Gsdf genes indicated that the TGF-β signaling pathway played a critical role in gonadal differentiation of teleost fish [62,63]. In the present study, Amh and Gsdf, the two TGF-β subfamily numbers, were more significantly expressed in the testis compared to the ovary; in particular, the Amh gene had a testis-specific expression pattern. Therefore, whether the TGF-β signaling pathway is involved in sturgeon gonadal differentiation and how it works are worth further investigation.
In mammalian Mus musculus, the transcription factor Dmrt1 is sufficient to determine male fate, subsequent testicular development and both intrinsic and extrinsic control of spermatogenesis [64,65]. In zebrafish, Dmrt1 functions in male sex determination and testis development [66]. In this study, Dmrt1 expression levels were not found to be significantly different between the testes and ovaries of A. schrenckii. A similar result was also observed in Chinese sturgeon [10] and the starlet A. ruthenus [43]. Coincidentally, Dmrt1 was reported that it does not participate in the initial steps of gonad differentiation in Siberian sturgeon [67] or Russian sturgeon [13]. Recent study indicated that Dmrt1 proteins were abundantly expressed in spermatocytes and spermatids, but not in type A spermatogonia of the Japanese eel [68]. Therefore, Dmrt1 expression also may be the stage-specific manner due to the pre-meiotic testes of 3-years-old A.schrenckii. However, another DM domain gene, DmrtB1, was found to be differentially expressed in the testes and ovaries of Amur sturgeon. DmrtB1 has been reported to play a pivotal role in coordinating the transition between mitosis and meiosis in murine germ cells and has a relevant role in the entry of spermatogonia into meiosis in humans [69]. Thus, it would be worth investigating the special role that DmrtB1 plays during the spermatogenesis of Amur sturgeon. Although sturgeons have Müllerian ducts, which different from the teleost fish, anti-müllerian hormone (Amh) is reported to be conserved in a wide range of fish species and has a possible regulatory interaction with sex steroids and gonadotropic hormones in gonad development in fishes [70]. For example, the expression of Amh started in the gonads before sex differentiation, and its levels surged in the differentiated testes in Danio rerio [71]. An Amh testis-specific expression pattern was also detected in this study, strengthening the specific expression pattern in differentiated testes of sturgeon such as A. baeri [67] and A. ruthenus [50]. In teleost fish, a high diversity of sex candidate genes has been reported, for example Gsdf in Oryzias luzonensis [63] and Amhy (a duplicated copy of Amh) in the Patagonian pejerrey (Odontesthes hatcheri) [62]. Recent studies have identified major candidate genes for sex determination and differentiation based on conserved molecular mechanisms in many developmental events for vertebrate taxa, but the results vary among different species of sturgeon. For the early gametogenesis related genes, we compared our results with published studies of other sturgeons. We found two distinct characteristics among all the studies. First, two hormone genes, Gnrh and Fsh, were reported to be transcribed in the gonads of 3-year-old A. sinensis [10]. However, no similar hormone genes could be found in the present study, the transcriptome data from 6 m A. naccarii [9], or even in five-year-old A. gueldenstaedtii [13]; we only found their corresponding receptors. Fish gonadal development is regulated by the hypothalamo-pituitary-gonad (HPG) axis. In axis, gonadotropin-releasing hormone (GnRH) expression is generally restricted to hypothalamus and promotes the synthesis and release of FSH and LH. FSH and LH are expressed in the pituitary by gonadotropic ells and activate their receptors and stimulate synthesis of the various sex steroid hormones in the gonads to regulate gametogenesis. That may be the main cause of no their transcripts in gonadal tissues. Second, Sox family numbers were found widely, which suggests a universal regulatory role in sturgeon gonadal differentiation. In this study, we provided an additional 31 full-length cDNA sequences grouped into five novel Sox family numbers for sturgeon species, including Sox5 (7 unigenes), Sox7 (15 unigenes), Sox8 (7 unigenes), Sox10 (1 unigene) and Sox18 (1 unigene). Therefore, a total of 14 Sox family numbers have been identified thus far from transcriptomes of sturgeon gonads.
The transcription factor Sox9, which is both necessary and sufficient for male fate and hence the maintenance of testis function in mammals [72], was found to have significantly higher expression in the testes than that in the ovaries of 3-year-old Amur sturgeon. The Sox9 gene expression characteristics in the other sturgeon species  can be summarized as follows. In Russian sturgeon, Sox9 was expressed at significantly higher levels in the gonads of 2-year-old and 5-year-old males [13]. A similar report was made for 4-year-old stellate sturgeon (A. stellatus) individuals [73] and Siberian sturgeon during the early sex differentiation period [67]. These previous studies may suggest that Sox9 may have a core role in the early spermatogenesis of Amur sturgeon; however, additional studies i.e. function confirmation are necessary to verify the role of Sox9. In our study, we also found four novel Sox9 transcript variants with significantly different UTR region sequence and length. To our knowledge, the UTR region can play regulatory roles in mRNA transcription and translation. For example, the eukaryotic 5′ UTR is critical for ribosome recruitment to the mRNA and to start codon choice and that it plays a major role in the control of translation efficiency and shaping the cellular proteome [74]. The sequence and length of 3'UTR region impact on mRNA expression by association with microRNA (miRNAs). MiRNAs are endogenous, small non-coding RNAs (19-25 nucleotides in length) that functions as post-transcriptional repressors of gene expression by binding to complementary sites of 3'UTR of target mRNA. MiRNAs participate in pervasive biological processes, including growth, development, tumorigenesis [75]. Recent advances have been reported gonadal miRNAs directly control the differential expression of many sex-related genes to be involved in gametogenesis [76]; for example, miR-124 was involved in regulating the fate of developing ovarian cells by preventing the expression of Sox9 in mice [77]. An other study also reported that miRNAs can control mRNA fate by compartmentalization based on the 3'UTR length in male germ cells [78]. The present study is the first to report the four novel Sox9 transcript variants in the gonads of A. schrenckii. Excitedly, the identification of gonadal miRNAs and their differential expression in testis and ovary of A. schrenckii were also reported [79]. Therefore, investigation combined the early gametogenesis related full-length genes with gonadal miRNAs may help us to further understand the roles of reproduction regulation mechanism in sturgeon. Isoform identification from the PacBio Iso-Seq strategy is one of is extremely important advantages. When using short read RNA-Seq strategies, extensive alternative prediction is impractical and high variability of isoform expression quantification is impossible in sturgeon without a true genome reference. However, the PacBio Iso-Seq conveniently provides the ability to finding greater numbers of AS events in genes of many species, even those of reference-free species [23,36]. From the full-length gonad transcriptome data, we detected a huge number (236,672, involving 36,522 nonredundant full-length unigenes) of AS events in A. schrenckii. We also revealed that these AS events are universal, involving early gametogenesis related genes and widely existing in early gametogenesis related GO terms and KEGG pathways, which most likely corresponds to the complexity of the early gametogenesis function. For example, Vasa was proven to be important role in animal reproduction. The absence of functional Vasa leads to sterility, but the penetrance of this phenotype vary by sex. In Drosophila, Vasa deficiency results in female-specific sterility [80]. In mice and zebrafish, Vasa mutants have led to male-specific sterility [81,82]. However, the regulation role of Vasa in gametogenesis is still unclear and previous reports mainly rely on expression studies. Vasa expression is restricted to only in both the ovary and testis with sexually dimorphic expression pattern. During gametogenesis, the Vasa protein was dynamically expressed in differentiating germ cells at different stages in adult gonads of the anadromous fish [83]. In A. dabrynaus, Vasa mRNA and protein displayed mitotic and meiotic expression in females, and Fig. 7 The phylogenetic tree of the Sox9 protein sequences was constructed using the neighbor-joining method. The node values represent the percent bootstrap confidence level derived from replicates. The accession numbers of the Sox9 proteins are shown in Supplementary Table 9. The five classes are comprised of Mammalia, Aves, amphibian, Reptilian and Osteichthyes. Meanwhile, Sox2 from zebrafish Danio rerio (accession number: BAE48583.1) was chosen as the out-group protein sequence mainly showed mitotic expression in males [84]. In our study, Vasa expression in both ovary and testis with sexually dimorphic pattern was consolidated in A. schrenckii, which maybe suggest the potential role in gametogenesis of sturgeon. Meanwhile, Vasa had relatively higher expression levels in testis than that in ovary, which is contrary with the ovary-biased pattern in Japanses flounder [85]. Importantly, we also revealed the four isoform variants of Vasa, which may imply having different biology functions during sturgeon early gametogenesis. Recent evidences indicate that different splice variants of Vasa exist in many species, including bovine [86] and Chinese mitten crab [87]. The qPCR further analysis indicated that the splicing variants of Vasa had different relative proportions during bull spermatogenesis. As 3-yearold A. schrenckill individuals were in the mitotic proliferation stage of spermatogonia, the testis-biased expression of Vasa suggests its roles in spermatogenesis, and further studies are required to investigate.
In summary, we combined PacBio Iso-Seq with Illumina short-read sequencing methods to conduct a comprehensive transcriptome analysis of the gonads of the Amur sturgeon, A. schrenckii. This approach enabled the generation of full-length transcripts as well as related analysis, that is, efficient gene annotation, alternative splicing, long noncoding RNAs and transcript factors. More importantly, isoform variants and expression profiles survey of the early gametogenesis related molecules of A. schrenckii contributed to a comprehensive insight into the early gametogenesis process and the reproduction regulation mechanisms of Amur sturgeon. Therefore, our study provides a valuable resource-a comprehensive full-length transcript set for genomiclevel reference-which is both interesting and worthy of further in-depth studies in sturgeon.

Conclusions
The present study provides the new genetic resources of full-length gonad transcriptome data and comparative transcriptome information of the gonads of the Amur sturgeon (A. schrenckii), an importantly economic aquaculture sturgeon species. A total of 164,618 high-quality nonredundant full-length transcripts (unigenes) generated from 50.04 G subread bases were herein produced with an average length of 2782 bp, which represents a significant advance in sturgeon genetics. The study discovered the number of 60 full-length genes identified to be related to early gametogenesis, further out of the 30 genes showed differential expression in the testes and ovaries suggesting significant function in the early gametogenesis of sturgeon. Interestingly, Amh with testis-specific expression and Gsdf with significantly higher expression in testes than ovaries (fold change > 200) belong to two key numbers of TGF-β superfamily, which may play core regulatory roles in the spermatogenesis of A. schrenckii; while the foxl2 combining with Cyp19a imply significant regulatory role in the oogenesis of A. schrenckii. Meanwhile, the four Vasa isoforms and four novel Sox9 transcript variants also hint function complexity of early gametogenesis of A. schrenckii. Finally, a total of 236,672 AS events, the 10,556 putative lncRNAs and the 4339 predicted TFs were identified to be involved in biological process of early gametogenesis of A. schrenckii. In total, our results provide first fulllength gonad transcription data and information as a genomic-level reference for sturgeon. These importantly provide candidate genes and theoretical basis for further exploration of reproduction regulation of sturgeon. length; FLNC: Full-length, nonchimeric reads; NR: NCBI nonredundant protein sequences; COG: Cluster of Orthologous Groups of proteins; Pfam: Protein family; Swiss-Prot: A manually annotated and reviewed protein sequence database; KEGG: Kyoto Encyclopedia of Genes and Genomes; GO: Gene Ontology; eggnog: Cluster of Orthologous Groups of proteins; FPKM: Expected number of fragments per kilobase of transcript sequence per million base pairs sequenced; DEGs: Differentially expressed genes; FDR: False discovery rate; CNCI: Coding-noncoding-Index; CPC: Coding potential calculator; CPAT: Coding potential assessment tool