Characterization of EST‐SSR markers in bread wheat EST related to drought tolerance and functional analysis of SSR‐containing unigenes

Bread wheat (Triticum aestivum) is an important staple food around the world. The enormous volume of the genome of wheat makes it quite slow to progress in traditional scientific research. On the other hand, incessant databases and suitable tools on web sites make progress in wheat research quicker and easier. Drought is a major abiotic stress in accordance with weather changes and accelerated increase in drylands. In this study, 9077 ESTs related to drought tolerance in hexaploid wheat were downloaded from NCBI and assembled into 12062 contigs and 4141 singletons. It was found that trinucleotide had the highest frequency 64.71%. Moreover, 53.80% of SSRs found in coding regions in respect of ORFs. The highest amino acids found for tri-and hexanucleotides were Arginine. In addition, 81% of SSR-containing unigenes had one chromosome location and the highest number of loci was found in chromosomes 1B (69). The distribution of genic SSR loci among the 21 wheat chromosomes, the three subgenomes, and the seven homoeologous groups of wheat chromosomes was significant, with P<0.01 indicating a non-random distribution. Functional annotation and characterization of SSR-containing unigenes have been performed. Eighty-six sequences were identified and sorted into 25 putative TF families and establish 166 pathways using KEGG. Primer-BLAST was used to predict the polymorphism, which was 39% of the 63 primer pairs of SSR markers. Our current study attempts to help farmers in wheat breeding programs to have drought-tolerant accessions, particularly in developing countries.


Introduction
Bread wheat (Triticum aestivum) is the most widely grown cereal crop around the world and is cultivated extensively in dry regions. It resulted from hybridization between cultivated tetraploid wheat (AABB, T. dicoccoides) and diploid goatgrass (DD, Aegilops tauschii) about 8,000 years ago [1], so, it is composed of three subgenomes, A, B, and D, every subgenome has 7 chromosomes, n = 21. Therefore, seven homoeologous groups are organized; each homoeologous set has three similarly associated chromosomes from each subgenome. Moreover, the A, B, and D subgenomes have been determined their progenitors [2]. Bread wheat is the enormous volume of the wheat genome about 17 GB [3] which is a consequence of polyploidy and the existence of areas with repeat motifs, make it hard to sequence the complete genome until recently, so presently there is an opportunity for the research to make real advances in this field [4].

Retrieving data, cleaning, and assemblage
A total of 9077 TSEs were downloaded from the GeneBank database, linked to Triticum aestivum and some wild relatives, expressed in dry conditions from the diverging tissue (leaf, root and seed) of the plant Then, the data was pre-processed to reduce the errors. The sequences were cleaned and trimmed for contamination, poor quality and short sequences using the EGassembler web tool [21]. Next, in order to reduce redundancy, the processed ESTs were clustered and assembled into contigs using the Cap3 [22] program. The remaining ESTs without assembly resulted in singletons.

Mining of SSR, the position of SSR in respect to ORF, and amino acids
All contigs and singletons engendered in the study were used to search for SSR using the MISA Perl script (http:// pgrc.ipk-gatersleben.de/misa/). The parameters used were 5, 4, 4, 3, 3 repeat units for di-, tri-, tetra-, penta-, and hexanucleotide, respectively. The ORF Finder in the NCBI detected the position of the SSRs corresponding to the coding regions. In addition, all amino acids for tri-and hexanucleotide SSR have been identified.

Functional characterization of unigenes containing SSR
All SSR-containing unigenes were analyzed using the PlantTFcat online tool (http://plantgrn.noble.org/PlantTFcat) for TFs. Gene ontology (GO) was analyzed using Blast2GO v 2.5 [23]. Whereas, these unigenes were first translated in all reading frames. Then, compared with the protein sequence database (NCBI nr) to identify potential translation products using BLASTX. All unigenic sequences containing SSR homologous with annotated proteins in the nr database have been selected for functional characterization. According to the GO vocabulary, the EST-contigs sequences were then categorized into three groups, i.e. molecular function, cellular component and biological process. While, KEGG pathways were identified by the online KAAS [24], (http://www.genome.jp/kegg/kaas) using SBH (single-directional best hit) method.

Chromosome localization
The positions of all unigenes containing SSR were determined using the downloaded IWGSC Reference Sequence v1.0 assembly and a stand-alone blat [25] Blat produces alignment at the DNA level between two sequences that were of 90% or higher similarity, but which may contain large inserts. The criteria used here for the blat were -minidentity=97% and -minscore=400. In addition, it was excluded all aliments there match less than 100 bp. For the primers design, we chose, at first, the unigenes related to functions as "response to stimulation, enzymes, transcription factors, and transport".

Random Distribution of EST-SSR mapped loci using Statistical Tests
The predictable number of genic SSR loci on the wheat chromosomes was evaluated on the knowledge of the size of genome assembly Ref v1. The equation has been used X 2 = ∑[ (Oi -Ei)2 / Ei ], where Oi is the observed number of SSR loci and Ei is the expected SSR loci.
To predict primers polymorphism, the online Primer-BLAST tool was used (https://www.ncbi.nlm.nih.gov/tools/primer-blast/). The parameters was sated as follows: primer melting temperature ranged from 50 to 60 and 55 as an optimal tm with a difference of 2 max tm, ' nr ' as a database and Triticum (taxid:4564), Triticum aestivum (taxid:4565) and Triticum aestivum subsp. aestivum (taxid:4565) as an organisms. The mismatch at the 3 ' end was set as 3 in the last bps, and the maximum target size was detected as 350 bp. Primer Size was set from 18 to 24 and 20 as optimal size, GC primer content as (30-70 %) and max self-complementarity as 6. The rest of the parameters remained as standard. By this step in the validation of polymorphism, the expected results of the proposed drought-related primers have been computationally improved.

Assembling of EST sequences
A total of 9077 EST sequences, related to drought stress tolerant of wheat, were downloaded from GeneBank. The average length of these ESTs was 552 bp. The initial sequences analysis showed a total length of 5010591 bp and a total GC count of non-redundant EST set was 55%. Out of these 9077 EST sequences, 18 sequences discarded through the sequence cleaning process; the total repetitive elements were masked were 163114 bp i.e. about 3.26 % of the total size of the query sequence. The remaining ESTs sequences for Triticum aestivum were assembled into 12062 contigs and 4141 singletons. Most of the contigs consist of two or three ESTs. The assembled ESTs account for 56.17% of total ESTs. The singletons representing expressed transcripts could not be assembled into larger contigs. These singletons may act as genes where only one mRNA represented one gene.
Mainly, sixty-one types of EST-SSR motifs were recognized in the study (Table 1). In general, the SSRs were found to be unequally distributed across motif types. The motifs CGC, GGC, CCG, and GCC have the highest frequency of (5.35), (4.66), (4.49), and (3.88) % in trinucleotide respectively, followed by the motifs GC (2.50 %), CG (2.33), and CA (2.07 %) in dinucleotides. The other types of motifs own a frequency of < 3 percentage (Table 1). In addition, the motifs with 4 repeat copy number were the most abundant (53.06%) followed by 5 repeat copy number (26.14%) ( Figure 2). In others studies, SSRs based on the repeat copy number were also usually commenced at 4 [13].

Distribution of SSRs in open reading frames and the amino acids for tri-and hexanucleotide
Of all identified microsatellites, 53.80% were distributed across the middle region or coding regions, while about 34.87% were located near the 5 ' end, and 11.34 % were located near the 3 ' end. In addition, hexanucleotide (68.42%) and tri-nucleotide (61.25%) SSRs were the most frequent in the coding regions, whereas, tetra and SSRs were the least frequent there. In contrast, pentanucleotide (60.71%) and tetranucleotide (50 %) SSRs were the most frequent in UTRs 5'. Each tri-nucleotide or hexanucleotide motif codes for amino acids that have a potential role in the biological activity of protein molecules. Of the 838 tri-and hexanucleotides identified in the present study, 17.45% encode Arginine, 16.78 % encode Alanine, 11.95 % Proline, and 8.99 % Glycine ( Figure 3).

Functional characterization of unigenes containing SSR
Out of 959 unigenes, 854 have homologous with known proteins. The outcome showed that 6.46 % is exclusive to Triticum aestivum and more than 90 % was conserved between the other species like Zea mays, Aegilops tauschii, Oryza sativa, Brachypodium distachyon, and Hordeum vulgare ( Figure 4). The percentages of the conserved regions for Zea mays, Aegilops tauschii, and Oryza sativa were 12.13 %, 7.78 %, and 7.34% respectively. Additionally, these 854 unigenes were submitted to additional annotation using Gene Ontology vocabulary.    Out of 854 unigenes, GO terms were available for 706. It has been observed that 2719 GO term obtained, that means on average 4 for a unigene. Maximum 30 GO terms for one and minimum one GO for 100 unigenes. So, unigenes classified consistent with the GO Annotation into three groups: Cellular Component, Molecular Function, and Biological Process ( Figure 5) according to the level 2.

Chromosome localization
It was found that 929 unigenes had positions on the assembly out of 959 unigenes, and 1242 loci were located on all the 21 wheat chromosomes (in addition on the Un-named sequences) ( Table 3; Figure 8). Of the mapped unigenes, 754 (81%) sequences have one location on a chromosome, 88 (9.5%) sequences have two homo-loci, and 40 (4.3%) sequences have three homo-loci pointed on three homologous chromosomes, while 47 (5%) sequences have multiple loci on nonhomologous chromosomes, which may expose duplications. The chromosomes 1B, 1D, 2A, 2D and 5A had the highest number of loci. In another hand, the 401, 392 and 376 loci, were localized on subgenome A, D and B respectively (Table 3). Table 3. Distribution of unigenes containing SSR across the chromosomes, the three subgenomes, and the seven homologous of bread wheat.

Random Distribution test of EST-SSR Loci on Individual
Chromosomes,

Primers design & validation
Using primer3, 301 primers from 328 SSR-containing unigenes were designed to be related to enzymes, transport, transcription factors, and proteins responsible to stimulations that appear to be more related to drought stress genes. Then, a set of 63 designed drought-related primers were chosen from all 301 designed primers depending on the in-silico PCR results ( Table 3). The selecting process of primers in this study was with respect to gain the expected product size on the predictable position of the related chromosomes and making sure there were no other undesirable sequences could give the same product size, so this EST-SSR primers were more related to genes expressed in the drought stress. Of the 63 primer pairs, 39% expected to produce polymorphic bands. Polymorphism was the difference in length of the predicted size of each expected band from the designed primers (Table 4, Figure 9).

Discussion
Drought is one of the most common environmental stresses affecting plant growth and development. It continues to be a critical challenge for breeders and researchers. Knowledge of functional classification, placement on chromosomes, and the use of appropriate tools to predicate and select unique specific SSR markers may have a positive impact on breeding programs and may dramatically improve the effectiveness of a specific investigation strategy or the inference of markers linked to favorable agronomic features such as drought tolerance.
In this study, the dehydration-related SSR markers were mined and characterized using a computational method using the WGSC Reference Sequence v1.0 hexaploid wheat assembly. According to the Go terms, SSR-containing unigenes have been categorized as cellular components biological processes and molecular function categories ( Figure 5). Among the biological process group, cellular process (72.6 %), metabolic process (71.82 %) and single-organism process (42.6 %) were the most strongly represented categories. The cell (85.5%), cell part (85.5%), organelle (65.55%), membrane (38.45%), membrane part (32.77%), organelle part (28.78%), and macromolecular complex (25.21%) categories within the cellular function categories participated the biggest portion of all annotations. In the molecular function group, binding (59.25%) and catalytic activity (49.35%) were the only two major groups. These results provide a good picture of the fact that the unigenes examined, expressed during drought stress, participate in the various biological interactions of the organization, growth and development process, metabolism and normal death of wheat cells, and are in agreement with a similar wheat study [15]. Drought is known to be regulated at the transcriptional level and TFs have been at the center of interest for improved yields of cultivars since targeting one TF will affect other regulatory aspects of drought tolerant [7]. In this study, the most abundant was C2H2, followed by Hap3/NF-YB and MYB-HB-like. A number of studies showed the importance of these TF families in abiotic stress-tolerant. C2H2 has been shown to play a significant role in biological processes and in reacting to various abiotic stresses including oxidative and drought stress in rice [27]. In the other hand, they found NF-YB family genes are responsive to drought stress in a study on Poplar [28]. Over-expression of OsMYB3R-2 in Arabidopsis OsMYB48-1 and OsMYB48-1 in rice increases resistance to drought stress [29]. In addition, depending on the similarity search (BLASTx), First, the enzymes had the highest frequency in our study were hydrolase and transferase. In wheat, glutathione S-transferase (GST) has been found to be high in drought stress [30]. In a study on wheat endosperm cells that were affected by drought stress, nucleic acid hydrolase activity was higher in dry treatment compared to control [31].
Another approach to identifying gene functions with a concentration on biochemical pathways was used in our study by conducting a single-directional BLAST scan against the KEGG protein database to better clarify the relationship between the putative proteins and the biological functions derived from 714 SSR-containing unigens. According to KEGG findings, 714 unigenes were mapped to 166 expected metabolic pathways ( Table 2); abundance unigenes of "metabolism" were in "carbohydrate metabolism," "energy metabolism" and "amino acid metabolism". This suggests that carbohydrates, energy metabolism and amino acids have been involved under abiotic stress and that are in line with other research [32]. KEGG pathway is a valuable method for predicting potential genes and their functions at the overall transcriptome level.
In our study, the abundance of SSRs (perfect, compound and imperfect) in unigenes was 1 in 4.7 unigenes and 1 in 2.7 kb, compared to 1 in 3.1 Mb in wheat assembly. It is vital to note that the total frequency and frequency of different lengths of SSRs and repeat motifs are reverting to the difference in SSR search tools with different SSR search principles and different datasets used for EST database mining, which change significantly in different studies. For instance, in wheat, the frequency of SSRs in ESTs varied from 1 in 0.74 kb to 1 in 9.2 kb [33].
The trinucleotides had the highest frequency of SSR motifs followed by dinucleotides (Figure 1). In a study of SSR analysis using the database of the whole hexaploid wheat genome, it was observed that di-followed by hexanucleotide were the most abundant. However, many of the previous studies have agreed that trinucleotide are the most abundant motif [15]. The repeated copy number at the locus is unsteady. In fact, the increase or decrease in the number of repeated copies in microsatellites in non-coding regions believed to modify gene regulation, while in coding regions it usually results in alterations in the reading frames resulting in changes in protein products [13].
The relative location of the di-, tri-, tetra-, penta-and hexanucleotide repeats was identified and labelled in regards to the open farm reading (ORF). From this analysis, it was observed that more than half of all microsatellites were distributed along the coding regions, while the remainder were unevenly separated between the 5 ' end and the 3 ' end. This agreed with earlier studies of the genus Prunus [34] and Gentianaceae [35]. In addition, trinucleotides and hexanucleotides have been the most abundant in coding regions. Similar studies have parallel outcomes [9,13,15]. This showed a significant bias in the frequency of SSRs in coding regions and UTRs. The increased rate of SSRs in coding regions indicating their role in protein structure and therefore their function [34]. In addition, many trimeric SSRs in ESTs were correlated with the non-existence of frameshift mutations in coding regions when the length of these SSRs changed [9].
Arginine was the most common amino acid followed by alanine, proline and glycine (Figure 3). In a recent study, the metabolome analysis showed amino acids, organic acids, and sugars as the main metabolites varied in multitude upon to water shortages. In particular, a spring wheat cultivar (Bahar cv) display increased levels of proline, arginine and lysine in drought stress [36]. In addition, a peanut study found that drought increased the amount of arginine in all genotypes [37]. In a study of Triticeae species to compare metabolite profiles of drought stress, alanine was found to increase exposure to drought stress [38]. In addition, proline amino acid is commonly known to occur in higher plants and usually accumulates in large quantities in response to environmental pressures; therefore, proline has advanced positive effects as the root/shoot proportion of wheat germplasm has improved under water-limited stress [39]. The arginine code sequences were CGG, AGA, CGC, CGT, and CGA, both tri-and hexanucleotide. It has been reported that in coding regions, this triple could possibly encode amino acids while in coding regions or UTRs. Such forms of motifs are considered to be amino acid coding sequences that could be used as a source of several functional markers [13].
The average number of mapped loci per genic SSR in the present study on the genome assembly was 1.3 and ranged from 1 to 4; these results are consistent with previous studies [40]. From the mapped genic SSR, most sequences have a single location and are consistent with the knowledge that ESTs usually appear as a single locus [15,40]. While the presence of EST-SSRs with multiple loci is essential and may be referred to the conserved nature of EST-SSRs that tend to occur as homoeologous in two or three genetically similar regions in subgenomes [40]. The presence of multiple positions on non-homoeologous chromosomes (5%) may be represented by repetition. This concept is consistent with the gene duplication reported for the wheat genome [40], but differs from the expected 25 to 30%; and this may return to many purposes, one of which used a different alignment tool (e.g. blat) and used EST return to a specific trait, drought.
In the present study, in drought conditions, the distribution of genic SSR markers to all 21 chromosomes was non-random (P < 0.001) with a maximum of 69 genic SSR loci mapped to chromosome 1B and a minimum of 43 genic SSR loci to chromosome 4B and 7B and 49 genic SSR loci to chromosome 5B. These results are consistent with previous studies in which chromosome 1B was observed to have the maximum number of genic SSR loci [40], while 4B had the minimum number of SSR loci with the lowest genic intensity [40]. In addition, 2D and 5A were found to have high loci numbers in this study, and this is consistent with a previous study on bread wheat under drought where QTLs on chromosomes 5A and 2D were involved in controlling stomatal density and size traits and were useful for molecular breeding [41]. In addition, 5A was found to carry many more QTL for a number of traits within short distances for agronomic traits under drought stress in wheat [42]. While 4B found to have no QTLs for stay-green and agronomic traits in wheat under drought stress [43]. The distribution of the genic SSR loci in the three subgenomes (Table 3.) was also a non-random distribution (P < 0.001) with the maximum number of genic SSR mapped on the subgenome A while the minimum number was on the genome B, although the subgenome B was larger than the subgenome A or the subgenome D [44]. This is not consistent with earlier studies that recorded more EST-SSR loci mapped to the B subgenome compared to those mapped to the A or D subgenome [33]. It can be returned to the disparity between the methods used for mapping sequences found in the SSR, the genome assemblies used, the type of EST datasets, and the reliance on the computational method instead of the wet-lab as in the previous studies. The distribution of genic SSR loci among the seven homoeologous groups of wheat chromosomes was also non-random (P < 0.001). The most loci were found to be in the homologous groups 1, 2, and 5, while the least was in the homologous groups 3, 4 and 7, which tend to be close to the QTL mapping study of traits such as stay-green and agronomic in wheat with multiple water shortage levels [43]. Nevertheless, the distribution of genic SSR loci within the three subgenomes vas the seven homoeologous assemblies was independent (P > 0.05) and is consistent with the other wheat study [40].
The rate of successfully designed primers ranges from 60-90 % for both genomic and genetic SSRs [9]. In our research, variations between the predicted size of the primer3 and that of the wheat genome were identified using in-silico PCR analysis. In these cases, the PCR product was absent, smaller or greater than predicted. These agreements with many similar studies, in which similar results are obtained by testing EST-SSR primers on the wheat genome in the wet lab [9]. Such drought stress-related SSR-markers are known to be useful markers and must be tested to be inserted afterwards in breeding programs for cultivated wheat. The rat of predicted polymorphism in our study was low and that consists with many studies on ginec SSR e.g. [11].
Due to greater conservation of DNA sequences in transcribed regions, EST-SSR primers in crop plants were found to be less polymorphic than genomic SSRs [9]. However, our findings were comparable to other studies [14,45,46]. Here we have tried to overcome one of the common issues with primers designed when performed in a web lab, which is 'primers polymorphism' (Table 4).
Usually, studies give a picture of the SSR-marker and the relative maps dedicating their positions on the chromosomes using real PCR. However, in addition to the positions of high-quality predicted SSR-markers and their polymorphism on one of the newest wheat assemblies, our research provided a comprehensive view of the SSR-containing drought-related sequences and their distribution and the most accurate chromosomal positioning using a wide range of tools.

Conclusion
Due to the size and complexity of the wheat genome, the bioinformatics tools are needed to facilitate and accelerate research, particularly with regard to the availability of databases and resources. A comprehensive survey using computational approaches was performed in this study, using several tools for characterization, detection of random distribution of EST-SSR mapped loci, designed and validation of SSR marker polymorphism related to drought-tolerant. The methods and findings used in the present study may help researchers and breeders accelerate breeding programs by inserting computational way to laboratory work subsequently reducing the time and labor intensive needed to select suitable markers associated with a particular trait.