Bioinformatic Tools In Arabidopsis Research Paper

The genomics era has produced vast amounts of biological data that await examination. In order to “mine” such data effectively, a bioinformatics approach can be utilized to identify genes of interest, subject them to various in silico analyses, and extract relevant biological information on them from various public databases. Examination of such data produces novel insights with respect to the genes in question and can be used to facilitate and guide further research in the field. Such is the case here, where bioinformatics tools were developed to identify, classify, and analyze members of the Hyp-rich glycoprotein (HRGP) superfamily encoded by the Arabidopsis (Arabidopsis thaliana) genome.

HRGPs are a superfamily of plant cell wall proteins that are subdivided into three families, arabinogalactan proteins (AGPs), extensins (EXTs), and Pro-rich proteins (PRPs), and extensively reviewed (Showalter, 1993; Kieliszewski and Lamport, 1994; Nothnagel, 1997; Cassab, 1998; José-Estanyol and Puigdomènech, 2000; Seifert and Roberts, 2007). However, it has become increasingly clear that the HRGP superfamily is perhaps better represented as a spectrum of molecules ranging from the highly glycosylated AGPs to the moderately glycosylated EXTs and finally to the lightly glycosylated PRPs. Moreover, hybrid HRGPs, composed of HRGP modules from different families, and chimeric HRGPs, composed of one or more HRGP modules within a non-HRGP protein, also can be considered part of the HRGP superfamily. Given that many HRGPs are composed of repetitive protein sequences, particularly the EXTs and PRPs, and many have low sequence similarity to one another, particularly the AGPs, BLAST searches typically identify only a few closely related family members and do not represent a particularly effective means to identify members of the HRGP superfamily in a comprehensive manner.

Building upon the work of Schultz et al. (2002) that focused on the AGP family, a new bioinformatics software program, BIO OHIO, developed at Ohio University, makes it possible to search all 28,952 proteins encoded by the Arabidopsis genome and identify putative HRGP genes. Two distinct types of searches are possible with this program. First, the program can search for biased amino acid compositions in the genome-encoded protein sequences. For example, classical AGPs can be identified by their biased amino acid compositions of greater then 50% Pro (P), Ala (A), Ser (S), and Thr (T), as indicated by greater than 50% PAST. Similarly, arabinogalactan peptides (AG peptides) are identified by biased amino acid compositions of greater then 35% PAST, but the protein (i.e. peptide) must also be between 50 and 90 amino acids in length. Likewise, PRPs can be identified by a biased amino acid composition of greater then 45% PVKCYT. Second, the program can search for specific amino acid motifs that are commonly found in known HRGPs. For example, SP4 pentapeptide and SP3 tetrapeptide motifs are associated with EXTs, a fasciclin H1 motif is found in fasciclin-like AGPs (FLAs), and PPVX(K/T) (where X is any amino acid) and KKPCPP motifs are found in several known PRPs (Fowler et al., 1999). In addition to searching for HRGPs, the program can analyze proteins identified by a search. For example, the program checks for potential signal peptide sequences and glycosylphosphatidylinositol (GPI) plasma member anchor addition sequences, both of which are associated with HRGPs (Showalter, 1993, 2001; Youl et al., 1998; Sherrier et al., 1999; Svetek et al., 1999). Moreover, the program can identify repeated amino acid sequences within the sequence and has the ability to search for bias amino acid compositions within a sliding window of user-defined size, making it possible to identify HRGP domains within a protein sequence.

Here, we report on the use of this bioinformatics program in identifying, classifying, and analyzing members of the HRGP superfamily (i.e. AGPs, EXTs, PRPs, hybrid HRGPs, and chimeric HRGPs) in the genetic model plant Arabidopsis. An overview of this bioinformatics approach is presented in Figure 1. In addition, public databases and programs were accessed and utilized to extract relevant biological information on these HRGPs in terms of their expression patterns, most similar sequences via BLAST analysis, available genetic mutants, and coexpressed HRGP, glycosyl transferase (GT), prolyl 4-hydroxylase (P4H), and peroxidase genes in Arabidopsis. This information provides new insight to the HRGP superfamily and can be used by researchers to facilitate and guide further research in the field. Moreover, the bioinformatics tools developed here can be readily applied to protein sequences from other species to analyze their HRGPs or, for that matter, any given protein family by altering the input parameters.

RESULTS

Finding and Classifying AGPs

The BIO OHIO program was used to identify potential classical AGPs, including the Lys-rich classical AGPs, AG peptides, and chimeric AGPs (i.e. FLAs and other chimeric AGPs) from the Arabidopsis proteome (Table I). The program initially identified 64 possible classical AGPs by searching for biased amino acid compositions of at least 50% PAST. Similarly, 86 potential AG peptides were identified by searching for proteins between 50 and 90 amino acids in length with biased amino acid compositions of at least 35% PAST. Finally, 25 potential FLAs were identified by searching for the following fasciclin H1 motif: [MALIT]T[VILS][FLCM][CAVT][PVLIS][GSTKRNDPEIV]+[DNS][DSENAGE]+[ASQM]. The 175 proteins identified by the program were further examined individually to determine if they appeared to be AGPs. The presence of a signal peptide was one such factor, as was the presence and location of AP, PA, SP, and TP repeats, since these dipeptide sequences are often present in known AGPs (Nothnagel, 1997). Finally, the presence of a GPI anchor addition sequence provided additional support, although not all AGPs have this sequence. By these criteria, 64 of the original 175 were classified as AGPs; moreover, they fall into several distinct classes: 20 classical AGPs, three Lys-rich (classical) AGPs, 16 AG peptides, 21 chimeric FLAs, three chimeric plastocyanin AGPs (PAGs), and one other chimeric AGP (Tables I and II). Additionally, one other AGP was documented in the literature, AGP30, a nonclassical or chimeric AGP, but was not identified by the program given that its PAST value of 34% was below the 50% threshold value used by the program (Baldwin et al., 2001; van Hengel and Roberts, 2003). Consequently, this AGP was added to the list of AGPs appearing in Table II but was not counted in Table I. In addition, four PRPs (PRP18, PRP5, PRP6, PRP16), 20 EXTs (EXT40, EXT17, EXT38, EXT19, EXT22, EXT18, EXT15, EXT7, EXT9, EXT10, EXT2, EXT11, EXT13, EXT16, EXT6, EXT12, EXT14, EXT8, EXT20, EXT21), and three hybrid AGP/EXTs (HAEs; HAE1, HAE3, HAE4) were identified by the program using the 50% PAST rule; further information on these HRGP sequences is presented below.

Some AGPs, particularly chimeric AGPs, can be below the 50% PAST threshold but were identified by searching the Arabidopsis protein database annotations and then subjecting such proteins to further analysis (i.e. searching for signal peptides, AP, PA, SP, and TP repeats, or GPI anchor addition sequences). With this approach, 21 additional AGPs were found, including two classical AGPs (AGP50C and AGP57C), 14 PAGs, and five other chimeric AGPs, including AGP30. The locus identifiers of these sequences are indicated in italics in Table II.

With the addition of these AGPs from the protein database annotations, the total number of potential AGPs became 85 and included 22 classical AGPs, three Lys-rich classical AGPs, 16 AG peptides, 21 chimeric FLAs, 17 chimeric PAGs, and six other chimeric AGPs (Table II). Representative amino acid sequences of these potential AGPs, including the predicted locations of their signal peptides and GPI anchor addition sequences, are displayed in Figure 2 and Supplemental Figure S1. The classical AGPs ranged in size from 87 to 739 amino acids. The majority (19 of 22) were predicted to have a signal peptide, and many (14 of 22) were also predicted to have a GPI anchor. The Lys-rich, classical AGPs ranged in size from 185 to 247 amino acids. All three were predicted to have a signal peptide, but only two were predicted to have a GPI anchor. The AG peptides ranged in size from 58 to 87 amino acids. All 16 AG peptides were predicted to have a signal peptide, but only 12 were predicted to have a GPI anchor. The FLAs ranged in size from 247 to 462 amino acids. The majority (20 of 21) were predicted to have a signal peptide, but only 11 were predicted to have a GPI anchor. The FLAs are a type of chimeric AGP; each FLA contains either one or two AGP domains. Such AGP domains were readily visualized with the BIO OHIO program by utilizing the sliding windows feature to search for biased amino acid sequences within a user-defined amino acid window size (e.g. 80% PAST in a 10-amino acid window) that slides along the protein sequence. Usually, such domains were also apparent by examining the location of the AP, PA, SP, and TP repeat units, which was easily done by the BIO OHIO program. The PAGs ranged in size from 177 to 370 amino acids. The 17 PAGs were all predicted to have a signal peptide, and 16 were predicted to have a GPI anchor. The other chimeric AGPs ranged in size from 222 to 826 amino acids. All but one (five of six) of these chimeric AGPs were predicted to have a signal peptide, and only one was predicted to have a GPI anchor as well as a signal peptide.

Figure 2.

Protein sequences encoded by representative AGP gene classes in Arabidopsis. Colored sequences at the N and C termini indicate predicted signal peptide (green) and GPI anchor (light blue) addition sequences if present. AP, PA, SP, and TP repeats (yellow) and Lys-rich regions (olive) are also indicated.

BLAST analysis was also conducted using The Arabidopsis Information Resource (TAIR) WU-Blast 2.0 to identify other potential AGP sequences and to provide insight to AGP sequences with the greatest similarity (Table II; Supplemental Table S1). BLAST searches were initially conducted with the filtering option on, but they were repeated with filtering off for those searches that found no other HRGPs. Such analysis showed that not all AGPs can be found with this method, but it did reveal sequences showing high degrees of similarity. BLAST was most successful for locating other FLAs and PAGs. In other words, a BLAST search using any one FLA sequence found most, but typically not all, other known FLA sequences.

Table I.AGPs identified from the Arabidopsis genome based on biased amino acid compositions, size, and the presence of fasciclin domains

The number in parentheses indicates the number of proteins that had a predicted signal peptide sequence.

Table II.Identification, characterization, and classification of the AGP genes in Arabidopsis

AGP Gene Expression and Coexpressed HRGPs, GTs, P4Hs, and Peroxidases

In order to elucidate patterns of gene expression for these predicted AGPs, three public databases were searched: Genevestigator (https://www.genevestigator.ethz.ch/), the Arabidopsis Membrane Protein Library (http://www.cbs.umn.edu/arabidopsis/), and the Arabidopsis Massively Parallel Signature Sequencing (MPSS) Plus Database (http://mpss.udel.edu/at/). While about half of the AGPs had a broad range of expression throughout the plant, the other half showed organ-specific expression. Notably, several AGPs were specifically or preferentially expressed in the pollen, while others were expressed in roots, stems, leaves, and siliques (Table II; Supplemental Figs. S2–S5). Moreover, in examining the expression levels of all the AGP genes, the ones specifically or preferentially expressed in the pollen were the most highly expressed, as indicated by their high relative signal intensities. Furthermore, there was no observed correlation between organ-specific expression and a particular AGP class or between environmental stress-induced expression and a particular AGP class.

In order to elucidate HRGP gene networks and identify genes involved with AGP biosynthesis, the AGP genes were next examined with respect to coexpressed genes using The Arabidopsis Co-Response Database (http://csbdb.mpimp-golm.mpg.de/csbdb/dbcor/ath.html; Table III; Supplemental Table S2). Unfortunately, 39 of the 85 AGPs had no coexpression data available, so the following information was based on the 46 AGPs for which data were available. In analyzing the data, a focus was placed not only on other HRGPs but on GTs, P4Hs, and peroxidases, since GTs and P4Hs, and possibly peroxidases (Kjellbom et al., 1997), are responsible for posttranslational modification of AGPs. In terms of AGPs being expressed with other HRGPs, a total of 73 HRGPs were coexpressed with one or more AGPs. Among all HRGPs, FLA7 was coexpressed with the most AGPs, a total of 22 different AGPs. Interestingly, several different EXT and PRP genes were also coexpressed with numerous AGP genes. For the GTs, 27 of the 42 members of the GT2 family, 17 of the 42 members of the GT8 family, 11 of the 33 members of the GT47 family, and two of the three members of the GT29 family were coexpressed with various AGPs, to name just a few. Most notably, two members of the GT47 family (At5g22940 and At4g38040) were found to be coexpressed with 17 and 15 AGP genes, respectively. Also notable was the one member of the GT29 family (At1g08660) that was coexpressed with 14 different AGP genes and the three members of the GT8 family (At1g24170, At5g47780, At1g13250) that were coexpressed with 13, 11, and 10 different AGPs, respectively. In conducting this GT analysis, it was observed that not all of the CAZY members are annotated as GTs in the coexpression database. Consequently, coexpressed genes had to be cross-referenced against the gene identifiers listed in the CAZY database. For the P4Hs, five of 13 members of the P4H gene family were coexpressed with various AGPs. Among these, one P4H gene (At3g06300 or P4H2) was coexpressed with 10 different AGPs. Many peroxidase genes showed evidence of coexpression. The greatest amount of coexpression was exhibited by At4g26010, which was coexpressed with 13 different AGPs.

Table III.HRGPs, GTs, P4Hs, and peroxidases coexpressed with AGPs

AGP Gene Organization and Mutants

Information was extracted from the TAIR and SALK Web sites with regard to the gene structure and available genetic mutants for each of the predicted AGP genes. The AGP genes contained few, if any, introns. Of the 85 AGPs, 46 had no introns and 32 had only one intron (Table II; Supplemental Table S3). One chimeric AGP (At5g21160 or AGP32I), however, was predicted to have 14 introns.

Examination of the various mutant lines available for research showed that nearly 99% (84 of 85) of the AGP genes had one or more mutants available. Of these mutants, 33% were in the promoter region, 19% were in the 5′ untranslated region (UTR), 25% were in an exon, 6% were in an intron, and 17% were in the 3′ UTR (Table II; Supplemental Table S4).

Finding and Classifying EXTs

The BIO OHIO program was used to identify potential EXTs by searching for SP3 and SP4 sequences repeated two or more times (Table IV). The program initially identified 114 and 63 potential EXTs by searching for these tetrapeptide and pentapeptide repeats, respectively.

The 114 and 63 proteins identified by the program were further examined individually to determine if they appeared to be EXTs, with the realization that the 63 proteins are a subset of the 114. The presence of a signal peptide was one such factor, as was the presence and location of SP3, SP4, and SP5 repeats, since these peptide sequences are often present in known EXTs. GPI anchor addition sequences are not known to be associated with EXTs; nonetheless, testing for the presence of such a sequence was performed out of curiosity. By these criteria, 57 of the 114 and 50 of the 63 proteins were classified as EXTs. While the SP4 criteria resulted in a high percentage of EXT sequences, they did not locate all potential EXTs, given that the SP3 criteria were used to find more EXTs, but with a higher rate of false positives. Subsequent analysis involved examining the 57 EXT sequences and attempting to classify them. Based upon the repeat sequences found in these EXTs, they were placed into nine classes: three SP5 EXTs, two SP5/SP4 EXTs, 12 SP4 EXTs, two SP4/SP3 EXTs, one SP3 EXT, 12 short EXTs, 11 (chimeric) Leu-rich repeat EXTs (LRXs) that include pollen extensin-like (PEX) proteins, 11 (chimeric) Pro-rich extensin-like receptor kinases (PERKs), and three other chimeric EXTs (Tables IV and V; Fig. 3). YXY repeats were observed in most of the EXT sequences. Such sequences are involved in cross-linking EXTs (Brady et al., 1996, 1998; Schnabelrauch et al., 1996; Held et al., 2004; Cannon et al., 2008). Forty of the 59 EXTs identified contain this YXY sequence. Although YVY is the most common repeat, YIY, YYY, and YAY repeats also occur less frequently. Interestingly, several EXTs have a YPY sequence immediately following the signal peptide.

Figure 3.

(Figure continues on following page.)Protein sequences encoded by representative EXT and hybrid HRGP gene classes in Arabidopsis. Colored sequences at the N and C termini indicate predicted signal peptide (green) and GPI anchor (light blue) addition sequences if present. SP3 (blue), SP4 (red), SP5 (purple), and YXY (dark red) repeats are also indicated. AP, PA, SP, and TP (yellow) repeats are indicted on hybrid HRGP only.

The Arabidopsis protein database annotations were searched, but no additional EXTs were found beyond those already identified by the program. Additionally, four other PERKs were documented in the literature but were not identified by the program, because three (At5g24400 or PERK2, At1g68690 or PERK9, At4g32710 or PERK14) were not included in the Arabidopsis protein database and one (At1g52290 or PERK15) found in the database contained only one SPP. The PERK14 sequence was subsequently found on the TAIR Web site but lacked SP3/SP4 repeats. Nonetheless, PERK14 and PERK15, being members of the PERK family and having publicly available sequences, were added in italics to the list of EXTs appearing in Table V and subjected to subsequent analyses. PERK2 and PERK9 were described as pseudogenes on the TAIR Web site and had no sequences available. Thus, they were not added to the table or analyzed further. In addition, two AGPs (AGP9C, AGP19K) and four HAEs (HAE1, HAE2, HAE3, HAE4) were identified by the program using the SP3 rule. Analysis of these AGP sequences was already presented in the AGP section above; however, the four hybrid HRGPs were considered here along with the EXT family members.

The three other chimeric EXTs were annotated in the Arabidopsis protein database as late embryogenesis abundant protein (EXT50), expressed protein (EXT51), and plastocyanin-like protein (EXT52). EXT50, EXT51, and EXT52 contained five, seven, and three SP4 repeats, respectively. EXT51 also contained numerous TP and SP repeats, reminiscent of AGPs.

A hybrid HRGP was defined as a protein that contains sequence characteristics of different HRGPs, such as EXT and AGP sequence modules, within the same protein. The four hybrid proteins identified in the EXT search had sequence characteristics of both EXTs and AGPs. Three of these hybrids, HAE1, HAE3, and HAE4, were identified because they passed an EXT test as well as the classical AGP test, having at least 50% PAST and multiple PA and TP repeats. The other hybrid, HAE2, contained two SP4 repeats and one additional SP3 module but did not pass the 50% PAST threshold, having only 43% PAST. Nonetheless, it contained multiple AP, PA, SP, and TP repeats, which are indicative of AGPs.

BLAST analysis was also conducted with each of the EXTs, chimeric EXTs, and HAEs to identify other related sequences and to provide insight to EXT sequences with the greatest similarity (Table V; Supplemental Table S1). Such analysis showed that not all EXTs were found with this method but did reveal sequences showing high degrees of similarity and clearly showed many more potential EXT sequences compared with the results from the similar strategy for analysis of the AGPs. Such BLAST analysis of LRXs and PERKs proved especially effective, as a BLAST query using any one LRX or PERK resulted in the identification of all other members in their respective class. Analysis of the other chimeric EXTs revealed that only EXT52 resulted in BLAST hits; these hits were PAG17, PAG9, and PAG10. This result was expected, since EXT52 contains a plastocyanin domain along with the EXT motifs. BLAST analysis of the At4g11430 hybrid HRGP (HAE3) as the query sequence showed similarity to both AGP and EXT genes, providing support for its identification as a hybrid HRGP. BLAST results for the other HAEs were less informative, with HAE1 showing similarity to no other HRGPs and HAE2 and HAE4 showing similarity to only one PRP and multiple chimeric PRPs, respectively.

As seen in Table V and in Supplemental Figure S6, the 20 SP5, SP5/SP4, SP4, SP4/SP3, and SP3 EXTs ranged in size from 212 to 1,018 amino acids. The majority (17 of 20) were predicted to have a signal peptide, and none was predicted to have a GPI anchor. The 12 short EXTs ranged in size from 96 to 181 amino acids. All but one was predicted to have a signal peptide, and surprisingly, seven were predicted to have a GPI anchor. The 11 LRXs ranged in size from 433 to 956 amino acids and consisted of an N-terminal Leu-rich repeat domain and a C-terminal EXT domain. All but two were predicted to have a signal peptide, and none was predicted to have a GPI anchor. The 13 PERKs ranged in size from 509 to 760 amino acids and consisted of an N-terminal EXT domain and a C-terminal kinase domain. None was predicted to have a signal peptide or a GPI anchor. The three chimeric EXTs contained three to seven diagnostic EXT repeats; two had signal peptides, and none contained GPI anchor addition sequences. The four HAEs contained 219 to 375 amino acids; three had a signal peptide and none had GPI anchor addition sequences. The EXT domains/motifs in the LRXs, PERKs, and other chimeric EXTs as well as the EXT/AGP hybrids were readily visualized with the BIO OHIO program by observing the locations of the SP3, SP4, and SP5 repeat units.

Table IV.EXTs identified from the Arabidopsis genome based on SP3 and SP4 amino acid repeat units

The number in parentheses indicates the number of proteins that had a predicted signal peptide sequence.

Table V.Identification, characterization, and classification of the EXT genes in Arabidopsis

EXT Gene Expression and Coexpressed HRGPs, GTs, P4Hs, and Peroxidases

In order to elucidate patterns of gene expression for these predicted EXTs, including the various chimeric EXTs and four HRGP hybrids, the same three public databases were searched as with the AGPs. While several EXTs had a broad range of expression throughout the plant, most of the EXT genes showed organ-specific expression. Notably, several EXTs were specifically or preferentially expressed in the root (27), while several others were specifically or preferentially expressed in the pollen/stamen (14) or siliques (one; Table V; Supplemental Figs. S7–S10). Moreover, in examining the expression levels of all the EXT genes, many of those specifically or preferentially expressed in the pollen were the most highly expressed ones, as indicated by their high relative signal intensities.

Next, the EXT and hybrid HRGP genes were examined with respect to coexpressed genes (Table VI; Supplemental Table S5). For EXTs, there was no information for 29 out of the 59 genes in The Arabidopsis Co-Response Database, and the four hybrid HRGP genes were also not listed in this database. In analyzing the data, a focus was placed not only on other HRGPs but on GTs, P4Hs, and peroxidases, since GTs, P4Hs, and EXT peroxidases are responsible for posttranslational modification of EXTs; this approach represents one potential avenue to identify genes involved in the posttranslational modification of EXTs. In terms of EXTs being expressed with other HRGPs, a total of 67 HRGPs were coexpressed with one or more EXTs. The most highly coexpressed HRGP was FLA2, which was coexpressed with a total of 15 EXTs, while FLA9 was next on the list, being coexpressed with 14 EXTs. As reported above, FLA2 and FLA9 were also coexpressed with many AGP genes. A number of EXT genes, including EXT9, EXT13, EXT14, EXT6, EXT10, EXT2, and LRX4, were also coexpressed with 10 or more EXT genes.

For the GTs, the most coexpressed was CslB04, a member of the GT2 family, which was coexpressed with nine EXTs. Also highly coexpressed were At1g24170 (GT8), At1g74380 (GT34), At4g15290 (GT2), and At5g22940 (GT47), all of which were coexpressed with seven EXTs. Notably, several of the GTs that were coexpressed with EXTs were also coexpressed with AGPs. For example, one member of the GT8 family, At1g24170, was coexpressed with seven different EXTs and 13 different AGPs. For the P4Hs, four of 13 members of the P4H gene family were coexpressed with various EXTs. Among these, one P4H gene (At3g06300 or P4H2) was coexpressed with six different EXTs. As reported above, this P4H gene was also coexpressed with 10 different AGPs. Many peroxidase genes were coexpressed, but the greatest amount of coexpression was exhibited by At1g05240, At3g49960, At4g26010, At5g17820, and At5g67400, which were all coexpressed with eight different EXTs. Interestingly, these same peroxidase genes were coexpressed with the greatest number of AGP genes as well (Table III). Given that EXTs are known to be cross-linked at YXY sequence motifs by an EXT peroxidase with an acidic pI, it was interesting to observe that the At3g03670-encoded peroxidase, which had a predicted endomembrane localization and a predicted pI of 4.8, was coexpressed with two of the three EXTs containing the greatest numbers of YXY sequence repeats (i.e. EXT20 and EXT21).

Table VI.HRGPs, GTs, P4Hs, and peroxidases coexpressed with EXTs

EXT Gene Organization and Mutants

Information was extracted from the TAIR and SALK Web sites with regard to the gene structure and available genetic mutants for each of the predicted EXTs. With the exception of the PERK genes, EXT genes including the four HRGP hybrid genes contain few, if any, introns (Table V; Supplemental Table S6). Of the 46 non-PERK EXT genes, 36 had no introns and eight had only one or two introns. All four HAEs contained either zero or one intron. One chimeric EXT (At3g11030), however, was predicted to have four introns. In contrast, the PERK genes contained between six and eight introns.

Examination of the various mutant lines available for research showed that all of the EXT genes (including HAEs) had one or more mutants available. Of these mutants, 29% are in the promoter region, 17% are in the 5′ UTR, 30% are in an exon, 4% are in an intron, and 20% are in the 3′ UTR (Table V; Supplemental Table S7).

Finding and Classifying PRPs

The BIO OHIO program was used to identify potential PRPs primarily by searching for proteins with a biased amino acid composition of at least 45% PVKCYT. In addition, PRPs were identified by searching for KKPCPP and PPVX(K/T) sequences repeated two or more times (Fowler et al., 1999). The program initially identified 113 potential PRPs by searching for 45% PVKCYT and identified 13 and two potential PRPs by searching for the PPVX(K/T) and KKPCPP repeats, respectively. Eleven of these 13 potential PRPs and both of these two potential PRPs were also identified with the 45% PVKCYT search criteria (Table VII).

The 113 proteins identified by the program were further examined individually to determine if they appeared to be PRPs. The presence of a signal peptide was one such factor, as was the presence and location of PPV repeats, since these peptide sequences are often present in known PRPs. The PRPs, like the EXTs, are not known to contain GPI anchor addition sequences, but the presence of such sequences was queried nonetheless. By these criteria, 15 of the 113 were classified as PRPs. The 45% PVKCYT search criteria failed to find all the potential PRP sequences and had a high rate of false positives. In addition to the 15 PRPs, nine AGPs (AGP45P, AGP56C, AGP9C, AGP7C, AGP4C, AGP18K, AGP19K, AGP30I, AGP33I), 31 EXTs (EXT40, EXT17, EXT32, EXT37, EXT41, LRX3, LRX1, EXT39, EXT20, EXT21, EXT3/5, EXT8, EXT7, EXT35, EXT9, EXT10, EXT2, EXT11, EXT13, EXT16, EXT15, EXT18, EXT1/4, EXT22, EXT19, EXT30, PEX3, EXT6, EXT12, EXT14, EXT51), and three hybrid HRGPs (HAE2, HAE3, HAE4) were found with the 45% PVKCYT search. In addition, two AGPs (AGP4C, AGP9C), one EXT (EXT1/4), and one hybrid HRGP (HAE3) were found with the two PPVX(K/T) repeat search; further information on these sequences was presented in the AGP and EXT sections above. Three additional PRPs (PRP8, PRP9, PRP11) did not pass the biased amino acid test but were found instead by a database annotation search. The locus identifiers of these sequences are indicated in italics in Table VIII. With these additional PRPs, 18 total PRPs were found and subjected to further analysis. Six of the 18 PRPs contained a non-HRGP domain along with a PRP domain and thus were classified as chimeric PRPs. The remaining 12 PRPs were not divided further into subclasses (Table VIII). Representative sequences of these two classes of PRPs are shown in Figure 4.

Figure 4.

Protein sequences encoded by representative PRP gene classes in Arabidopsis. Colored sequences at the N terminus indicate predicted signal peptide (green). PPVX(K/T) (gray), KKPCPP (teal), and PPV (pink) repeats are also indicated.

BLAST analysis was conducted to identify other potential PRP sequences and to provide insight to PRP sequences with the greatest similarity (Table VIII

Table VII.PRPs identified from the Arabidopsis genome based on biased amino acid composition and repeat units

The number in parentheses indicates the number of proteins that had a predicted signal peptide sequence.

Table VIII.Identification, characterization, and classification of the PRP genes in Arabidopsis
  • American Society of Plant Biologists

With more and more high-throughput data becoming available, scientists are faced with the challenge to develop or apply intelligent software to extract essential information from large-scale data sets. If used in a smart way, some bioinformatic programs can aid in many ways to elucidate the function of a gene of interest, including modes of regulation and synthesis, its posttranslational modifications and potential interaction partners, and last but not least processes that are regulated by its gene products. Examples of combinatory applications of bioinformatic tools that lead to the generation (and subsequent confirmation) of hypotheses (Table I ) are described below, with a focus on the deciphering of cellular processes regulated by mitogen-activated protein kinase (MAPK) cascades in Arabidopsis (Arabidopsis thaliana).

Plants need to cope with a wide range of challenging environmental conditions. The successful adaptation/response to such stresses requires the efficient and specific transduction of environmental signals. In stress signal transduction, a prominent role is played by MAPK cascades, which minimally consist of a MAPK kinase kinase (MAPKKK), a MAPK kinase (MAPKK), and a MAPK. Via a phosphorelay mechanism, these modules transduce incoming signals to activate MAPKs that subsequently phosphorylate specific target proteins (for review, see Colcombet and Hirt, 2008; Pitzschke et al., 2009c). So far, experimental evidence exists only for a very few MAPK substrates, but a proteomic phosphoarray approach suggests that transcription factors (TFs) are the major targets of MAPKs (Popescu et al., 2009). Phosphorylation of TFs can potentially alter their subcellular localization, protein stability, or DNA-binding activity. MAPK cascades may thus be primary regulators of stimulus-dependent adaptation of gene expression. The Arabidopsis genome encodes for 60 to 80 MAPKKKs, 10 MAPKKs, and 20 MAPKs. The present challenge is to elucidate (1) which of the thousands of theoretically possible signaling modules are indeed formed, (2) which stimuli are conveyed, (3) which targets are addressed, and (4) what is the biological role of the respective signaling modules.

COMPARISON OF EXPRESSION PROFILES

One approach is to use correlative transcriptome analysis as a relatively unbiased technique. Hereby, microarray profiles of signaling cascade mutants can be compared from a wide range of organisms, e.g. by using the Genevestigator tool (https://www.genevestigator.com/gv/index.jsp). Mutants whose transcriptome profiles significantly overlap are likely to act in common signaling cascades. The extent of such overlaps can be conveniently visualized in Venn diagrams, e.g. using the tool at http://www.pangloss.com/seidel/Protocols/venn4.cgi, where expression profiles of up to four mutants can be compared. The program also generates lists of gene IDs occurring in two, three, or four entered data sets. Further inspection of the list of commonly regulated genes can give indications on the processes controlled by a theoretical signaling module, e.g. using Genevestigator—a rich source for transcriptome data on spatio-temporal expression patterns, mutant profiles, and responses to numerous treatments/growth conditions.

The following example emphasizes the consistencies with respect to similarities in expression profiles, phenotype, and hormone accumulation and thus documents the robustness and usefulness of transcriptome-based approaches. Rather than confirming correlations predicted from experimental results, with comparatively little effort such tools can generate reasonable hypotheses, which can subsequently be experimentally validated.

EXAMPLE: MEKK1-MKK1/2-MPK4 AND BEYOND

MEKK1-MKK1/2-MPK4 engage in a signaling cascade that is activated in response to pathogen attack (Gao et al., 2008). Bimolecular fluorescence complementation analysis showed that both MPK4 and MEKK1 interact with MKK1 and MKK2 (Gao et al., 2008). mekk1, mpk4, and mkk1/mkk2 double knockout mutants show spontaneous cell lesions and highly elevated levels of reactive oxygen species. Moreover, they display a severely dwarfed phenotype, which is correlated with the strong accumulation of salicylic acid (SA), a major hormone in biotic pathogen responses. Accordingly, the sensitivity to the plant pathogen Pseudomonas syringae is reduced in these pathway mutants. For these reasons, the MEKK1-MKK1/2-MPK4 cascade has been ascribed a role as a negative regulator of innate immune responses in plants (Gao et al., 2008).

For all mutants affected in this MAPK module, transcriptome analyses have been performed (Qiu et al., 2008a; Pitzschke et al., 2009a). Indeed, the gene expression profiles of these mutants are highly similar. Consistent with the hierarchical order in the signaling cascade, mekk1 shows the largest set of differentially regulated genes, followed by mkk1/2 and eventually mpk4. Moreover, many of the common differentially regulated genes are known to be SA-responsive genes and/or are associated with redox regulation (Qiu et al., 2008a; Pitzschke et al., 2009a). In agreement with the partial redundancy of MKK1 and MKK2, the expression profiles of mkk1 or mkk2 single mutants hardly overlap with those of mekk1, mkk1/mkk2, and mpk4 mutants (Gao et al., 2008). Microarray data-based online tools (e.g. https://www.genevestigator.com/gv/index.jsp) also reveal a strong correlation of transcriptome profiles of mekk1, mkk1/mkk2, and mpk4 with several other mutants, such as constitutive expression of PR genes5 (cpr5) and nonexpressor of pathogenesis-related genes1 (npr1), suggesting further commonalities between these mutants. A more targeted bioinformatic approach of comparative transcriptome studies, Functional Associations by Response Overlap, has also highlighted the relatedness of mekk1, mkk1/2, and mpk4 profiles with those of cpr5 and npr1 (Nielsen et al., 2007). Indeed, cpr5 and npr1 mutants are also dwarfed and have highly elevated SA levels (Cao et al., 1994; Bowling et al., 1997).

An interesting characteristic of many SA-accumulating mutants is that their dwarfed phenotype (often associated with sterility/poor seed production) can be rescued by growing these plants at elevated temperatures, in line with the observed negative correlation of the heat-induced and the mpk4 mutant gene expression profiles (revealed by Functional Associations by Response Overlap analysis; Nielsen et al., 2007). Applying this knowledge to lines of interest may assist the positioning of the corresponding gene in the signaling cascade and can help to yield a larger pool of precious seeds through adjustment of growth conditions.

Web-based tools that integrate large sets of microarrays have the potential to reveal novel correlations between responses. To give an example, we observed a strong negative correlation between the expression response to SA and CO2 by Genevestigator analysis. It may therefore be worth testing the CO2 response of mekk1, mkk1/2, and mpk4 mutants with respect to phenotype, SA levels, and transcriptional changes. Likewise, this observation may also indicate that increasing environmental pollution (CO2) renders plants more susceptible to pathogen attack and a recent study provides experimental evidence for this in silico-based assumption (Lake and Wade, 2009).

HOW TO GO FROM GENE TO FUNCTION

To understand the functional significance of gene expression profiles displayed by a mutant of interest, a search for statistically overrepresented “functional” and “cellular compartment” terms, using the gene ontology (GO) tool (e.g. http://www.arabidopsis.org/tools/bulk/go/index.jsp) is another promising approach. Not unexpectedly, in our example, this tool detects an enrichment of the GO terms “stress-responsive” and “transcription factor activity” in the list of mekk1, mkk1/2, and mpk4 up-regulated genes. Moreover, GO term analysis revealed that among the genes down-regulated in mekk1 and mpk4 those encoding plastidic or chloroplastic proteins are significantly overrepresented (Pitzschke et al., 2009a), which may indicate that these mutants might also regulate processes related to photosynthesis to prevent further ROS production.

HOW TO IDENTIFY POTENTIALLY COREGULATED GENES—THE ARABIDOPSIS CHROMOSOME MAP TOOL

The highly user-friendly setup and the diversity of tools provided by The Arabidopsis Information Resource (TAIR) enable the researcher to subject genes of interest to further bioinformatic analysis. For example, the tool (http://www.arabidopsis.org/jsp/ChromosomeMap/tool.jsp), which displays the position of entered genes on the five Arabidopsis chromosomes, is useful for revealing clustering of genes on a particular chromosomal region. Such local clustering can be an indication for transcriptional coregulation, as, for example, evidenced in a study on the cluster of Arabidopsis RPP5 locus R genes involved in pathogen response (Yi and Richards, 2007).

HOW TO MANAGE THE FLOOD OF DATA

Despite their unquestionable value for driving research progress, whole-genome microarrays have their drawbacks. The experiment as such is very cost intensive. Moreover, the huge data set generated from these arrays often confronts the scientist with the (hard) decision of which subset of differentially expressed genes to investigate further (in silico). Prior selection might therefore be advisable. For those researchers particularly interested in defense-related responses, the small-scale expression array (Sato et al., 2007), which analyzes transcript abundance of 321 genes associated with pathogen response (and a set of genes for normalization), may be a good alternative, either for the experiment as such and/or as a preselection for the downstream data analysis. Results from a recent miniarray study investigating the response of seven defense-affected mutants (coi1, dde2, ein2, mpk3, pad4, cbp60g-1, and sid2) to P. syringae treatment (Wang et al., 2009) provides a manageable data set for a first comparison with our own data. The closer the transcriptome profile of a mutant of interest is to any of these mutants, the higher the probability that the corresponding proteins engage in a common pathway. Moreover, if a mutant profile shows strongest overlap with the subset of genes coregulated in several of the other mutants (e.g. subsets of the seven defense-related mutants), the corresponding protein may be an upstream regulator acting before stress signaling bifurcation into individual pathways.

Similar to the usefulness of the above-described miniarray for pathogen response-focused research, a global map of gene expression within 15 different zones of the root corresponding to cell types and tissues at progressive developmental stages, allows researchers a preselection of large data sets (retrieved from published or our own microarrays) for the analysis of developmental aspects (Birnbaum et al., 2005). Likewise, a report from Leonhardt et al. (2004) provides a list of genes that allows a preselection for guard cell- and mesophyll-expressed genes.

MAPPING OF MICROARRAY DATA ONTO PATHWAYS AND GENETIC MAPS—THE MAPMAN TOOL

One precious tool, which allows analysis of large data sets and facilitates the assignment of clusters of genes showing major transcriptional changes to areas of function, is MapMan (http://gabi.rzpd.de/projects/MapMan). MapMan is grouping genes on the Arabidopsis affymetrix 22 K array into >200 hierarchical categories, thereby providing an overview of various cellular processes. Due to its complexity, we will not describe this tool in detail, but recommend the following articles (Thimm et al., 2004; Usadel et al., 2009). Ideally, upon reading of the articles, MapMan should be visited and data sets, including those of your own experiments, explored.

Briefly, MapMan allows superimposition of different data sets in overlay plots and thus facilitates the identification of shared features, both globally and on a gene-to-gene basis.

By grouping genes that are probably involved in a common area of function, the MapMan tool can reveal trends toward repression or induction, which might not be obvious at the single gene level. The data sets of responses of interest can originate from your own experiments or can be downloaded from published microarrays. The analysis is also facilitated by the option to focus and visualize certain major pathways, such as “metabolism” or “DNA synthesis.”

The usefulness of MapMan has been demonstrated by the analysis of the Arabidopsis starvation response: The transcript profile of wild-type seedlings harvested at the end of the night was compared either to wild-type seedlings that had been incubated in the dark for an additional 6-h period or to starchless pgm mutants harvested at the end of the night. The MapMan-generated overlay plot revealed strong correlation between these two sugar-depletion conditions. As might be expected, the common transcriptional response indicates repression of photosynthesis and Suc, starch, and lipid synthesis, while genes involved in lipid, amino acid, and carbohydrate breakdown are largely induced. Novel aspects of sugar depletion were also revealed, e.g. a trend to preferential induction of cell wall synthesis-involved genes and repression of genes involved in cell wall breakdown. Furthermore, previous indications on a cross talk between sugar-sensing and abscisic acid- and ethylene-sensing pathways (Rook et al., 2001; Brocard et al., 2002; Leon and Sheen, 2003) could be substantiated. MapMan is being updated continuously, and a conversion of this tool now also allows comparison of responses in different organisms, as demonstrated by the comparison of diurnal changes in Arabidopsis and tomato (Solanum lycopersicum) expression profiles (Urbanczyk-Wochniak et al., 2006).

Despite its unquestionable value, MapMan has the major drawback in that many genes cannot be categorized into certain MapMan-defined areas of function and are therefore not considered in the analysis. For example, in a study on the Arabidopsis response to Fusarium, the majority of genes could not be assigned to any of the known function categories (Yuan et al., 2008). If one's list of genes of interest contains several “genes of unknown function,” a further separate inspection might be advisable. Using ClustalW (http://www.genebee.msu.su/clustal/basic.html) potential phylogenetic relatedness between the corresponding proteins can be detected, which will help to assign putative roles/implications of those proteins to the process that is investigated. This, in turn, can help to refine the MapMan data sets and thus facilitate future analyses.

PREDICTION OF PATHWAY MODULES THROUGH CORRELATION OF GENE EXPRESSION

Genes that are coexpressed over multiple data sets are likely to show functional relatedness. This knowledge may help to predict which proteins act in a common pathway or, as in this particular case, which MAPK signaling component engages in a common module. Using the AttedII tool (http://atted.jp/), lists of genes whose expression correlates with that of a gene of interest can be generated and correlation coefficients calculated. To test its suitability, we queried AttedII to predict components potentially associated with MKK4, a stress-related MAPKK whose transcript abundance alters in response to numerous stimuli (e.g. as evidenced in Genevestigator). AttedII reveals strong gene expression correlation of MKK4 with MKK5 and also with MPK3, but not with any other MAPK signaling component. MKK4 and MKK5 are known to be functionally redundant, to be controlled by the MAPKKK YODA, and to act as upstream regulators of the MAPKs MPK3 and MPK6 (Wang et al., 2007). Neither YODA nor MPK6 are among the predicted MKK4-correlated genes, most likely due to their ubiquitous expression. Further genes correlating with MKK4 expression are promising candidates for encoding additional components involved in MKK4-mediated signal transduction. The above example shows the usefulness of gene expression correlation-based hypothesis generation, but also reveals the limitations of this approach for constitutively expressed genes.

HOW TO EXPLOIT MICROARRAY DATA FOR THE IDENTIFICATION OF TARGETS OF SIGNALING CASCADES

The rich pool of publicly available microarray data cannot only be screened by bioinformatic tools for hypothesizing the composition of signaling pathways, but they are also suitable for making predictions on the TFs and promoter elements that control a set of coexpressed genes.

Although each type of signal requires a specific cellular response, the transcript abundance of some genes is altered in response to multiple signals. This approach can be exemplified for finding the set of common stress genes by using a clustering method (Ma and Bohnert, 2007). Using publicly available microarray data of transcriptional changes in response to various abiotic and biotic stresses, 197 common stress-responsive genes were identified. Similar studies were reported by Swindell (2006) and by Kant et al. (2008) for nine and 16 abiotic stress conditions, respectively. Based on GO annotation (kinase, TF, etc.), the latter report classified a subset of 289 genes as multiple stress regulatory genes (MSTRs), including several members of the WRKY and bZIP protein families, which are known to be stress associated (for review, see Jakoby et al., 2002; Ulker and Somssich, 2004). Considering the transcriptional response of these factors to very diverse signals, one may position MSTRs at the early steps of stress signaling responses. MSTRs can be expected to have a high turnover and to be controlled at multiple levels to allow fast adaptation and to prevent a prolonged activation of downstream signaling processes that would interfere with plant growth and development. These characteristics render MSTRs prime candidates for posttranslational modifications such as protein kinases or ubiquitin-mediated stability control.

Given that some signaling cascades are activated very rapidly (e.g. MAPK cascades are activated within minutes), candidate targets for diverse signaling pathways might be found by screening transcriptome data sets for very early responses in a similar fashion as done for MSTRs.

The identification of early targets of signaling cascades and knowledge on the modes controlling their activity is also of tremendous value for applied science, because appropriate manipulation may minimize the effort of creating crops with desired traits such as resistance to multiple stresses. The key regulators may be expressed in a controllable system, e.g. by using chemically inducible expression or nuclear translocation systems, thereby circumventing undesirable side effects on growth/development that are often associated when overexpressing genes constitutively.

HOW TO EXPLOIT MICROARRAY DATA FOR THE IDENTIFICATION OF TRANSCRIPTIONAL REGULATORS

To decipher the modes controlling the expression of a set of coexpressed genes, an in-depth inspection of their upstream regulatory regions may provide further information. During the immediate responses to a given stimulus, the signaling may not have bifurcated yet into highly complex downstream pathways. Therefore, early induced genes are likely to underlie regulation by common TF(s) and therefore share common DNA motifs in their regulatory regions. Promoter sequences of user-defined length can be downloaded from several Arabidopsis databases, e.g. TAIR (http://www.arabidopsis.org/tools/bulk/sequences/index.jsp), and subsequently screened for the presence of certain DNA motifs. While PLACE (www.dna.affrc.go.jp/PLACE/) or PlantCARE (http://sphinx.rug.ac.be:8080/PlantCARE/) are useful for detecting known cis-elements within a set of promoters, the TAIR motif finder (http://www.arabidopsis.org/tools/bulk/motiffinder/index.jsp) and AlignACE tool (http://atlas.med.harvard.edu/cgi-bin/alignace.pl) allow identification of potentially novel DNA motifs shared by multiple promoters. Once candidate motifs have been identified, the statistical significance of their enrichment can be assessed using the POBO tool (http://ekhidna.biocenter.helsinki.fi/poxo/pobo/pobo), which compares motif abundance in the given promoter set to the Arabidopsis background frequencies. This tool has, for example, proven useful for documenting the strong enrichment of W boxes in the promoters of WRKY18-dependent, SA-inducible genes (Wang et al., 2006).

Subsequent to this statistical analysis, the functional relevance of enriched candidate DNA motifs in mediating stress responses can be experimentally validated using synthetic promoter-reporter gene constructs in transgenic plants or transfected protoplasts. The latter system also allows—with minimal effort—to test candidate TFs for their ability to induce/repress gene expression driven by a motif of interest, as, for example, evidenced in Rushton et al. (2002) or Pitzschke et al. (2009b).

HOW TO FIND THE TARGETS OF TRANSCRIPTION FACTORS

Alternatively to starting with the identification of multiple signal-responsive genes through the comparison of multiple signal-dependent expression profiles, an equally attractive approach for the elucidation of signaling cascades is the detailed characterization of a TF of interest, e.g. a known or predicted substrate of a signaling cascade.

For their characterization, a phylogenetic analysis may provide first indications about the dimerization behavior and sometimes even about potential DNA target motifs. However, high homology within the DNA-binding domain of two TFs does not necessarily correlate with target motif similarity. For example, the bZIP factors of tobacco (Nicotiana tabacum) RSG2 and tomato VSF-1 have highly conserved bZIP domains, yet they bind to completely different DNA motifs (Ringli and Keller, 1998; Fukazawa et al., 2000). The bZIP domain of Arabidopsis VIP1 is strongly related to those of RSG2 and VSF-1, and VIP1 had been shown earlier to be phosphorylated by MPK3 in a stress-dependent manner and to undergo cytoplamic-nuclear translocation (Djamei et al., 2007). Where no further information on the DNA motifs targeted by a TF of interest is available, random DNA selection assays (RDSAs) may be applied to generate data that subsequently can be analyzed by a range of bioinformatic tools (Pitzschke et al., 2009b).

In RDSA (Fig. 1 ), random double-stranded DNA fragments, usually 15 to 20 nucelotides long and flanked by defined primer-annealing sites, are incubated with recombinant TF protein. Candidate motifs are enriched through a repetitive selection-amplification procedure. RDSA yields a range of candidate DNA motifs that can be screened for common elements and aligned using the STAMP tool (http://www.benoslab.pitt.edu/stamp/). Electrophoretic mobility shift assays and mutagenesis of the candidate motif(s) is then used for confirming the binding and specificity of the TF to those motifs. Once such motif has been found and confirmed, target genes of the TF can be predicted. For this, the TAIR patmatch tool (http://www.arabidopsis.org/cgi-bin/patmatch/nph-patmatch.pl) provides a tab-delimited file of position, number, and orientation for all genes harboring such motifs in a user-defined region (e.g. within 500 bp promoter regions). In the case of VIP1, this information aided the prediction of one of its target genes MYB44, which was later confirmed by promoter-reporter gene activation and chromatin immunoprecipitation (Pitzschke et al., 2009b). For several multimeric TFs the spacing between adjacent target DNA motifs is crucial for transactivating activity. If knowledge about the spacing exists (e.g. the preferred spacing between W boxes targeted by certain groups of WRKYs; Ciolkowski et al., 2008), the number of further candidate target genes can be narrowed down. A fast visual tool for this application is the MotifMatcher tool (http://users.soe.ucsc.edu/∼kent/improbizer/motifMatcher.html), which depicts multiple user-defined motifs (entered as matrix), each in a different color, on a set of promoters of interest as beads on a string (Fig. 1).

Figure 1.

Flow chart of a possible strategy to identify the DNA motif(s) and promoters targeted by a TF of interest. Bioinformatic tools are shown in italics.

HOW TO USE PROTEOMIC DATA IN CONSTRUCTING SIGNALING NETWORKS

Hypotheses on signaling pathway compositions cannot only be generated through gene expression-based analyses, but also through proteomic approaches. The classical experimental approaches for retrieving a list of candidate interactors of a protein of interest are yeast two-hybrid (Y2H) screens and mass spectrometry (MS) analysis of purified protein complexes. Whereas Y2H analyses have the potential to predict direct protein interaction partners, MS analysis of protein complexes primarily indicates that the proteins are in more or less complicated assemblies of proteins. The low degree of overlap in Y2H and MS studies in yeast further cautions on a naïve interpretation of these data sets. Y2H suffer from a relatively high degree of false positives that can be generated by multiple factors that are inherent in the system, including overexpression, artificial interaction of two components in the same compartment, or misfolding of the protein of interest due to fusion to yeast bait or prey proteins, respectively. MS studies of protein complexes, on the other hand, suffer from copurification of more or less abundant contaminants and the possibility that the proteins may not be interacting directly. Given these drawbacks, valuable information can nonetheless be obtained from in silico analysis of publicly available interaction data, e.g. by using the tool provided at http://bar.utoronto.ca/interactions/cgi-bin/arabidopsis_interactions_viewer.cgi. This tool queries a huge database of confirmed Arabidopsis interacting proteins retrieved from Biomolecular Interaction Network Database and from high-density Arabidopsis protein microarrays, and provides details about the experimental evidence. It also integrates data from macro- and microarray-based phosphoprotein arrays that led to the identification of Arabidopsis MAPK candidate substrates (Feilner et al., 2005; Popescu et al., 2009).

Because a Y2H- or protein microarray-based predicted interaction does not necessarily mean that two proteins truly interact in planta, the list of candidate interacting proteins can be narrowed down by applying additional selection criteria: (1) check the spatio-temporal expression pattern of the corresponding genes (does gene x expression overlap with that of gene y; useful tools are: https://www.genevestigator.com/gv/index.jsp and http://atted.jp/); and (2) compare the subcellular localization of the proteins (a chloroplast-localized protein is unlikely to interact with a nuclear protein).

Obviously, data merely based on prediction algorithms (e.g. http://wolfpsort.org/, http://www.cbs.dtu.dk/services/TargetP/) need to be interpreted with caution. The more complex SUBA tool (http://www.plantenergy.uwa.edu.au/suba2/) integrates prediction-based information with data based on experimental evidence (MS/MS, GFP fusion protein localization studies). Through integration of transcriptomic and proteomic data—from your own and published arrays—can also further facilitate the identification of top candidates (Fig. 2 ). Once a list of a manageable number of candidate interaction partners has been established, their ability to bind to the protein of interest can be experimentally validated (coimmunoprecipitation/bimolecular fluorescence complementation/fluorescence resonance energy transfer).

Figure 2.

Prediction of signaling components through integration of transcriptome and proteome arrays. By searching for the overlap between candidate proteins identified from peptide-based microarrays (e.g. interactors of a regulatory protein in a process of interest; right) and proteins encoded by genes differentially expressed under conditions of interest (left), high-priority candidates involved in the immediate downstream signaling can be defined. As exemplified here, multiple stress-responsive genes that encode for proteins for which phosphorylation by a stress-associated kinase/phosphatase has been observed are likely to be key components in early stress signal transduction.

The interaction viewer tool and the screening of published lists of protein-protein interactions can also aid the prediction of (further) partners that interact with a protein of interest (if protein A interacts with B, and B with C, then A might also interact with C). For instance, MPK4 has been shown to interact with MKS1 (for MAPK substrate 1). On the other hand, MKS1 was found to interact with two WRKY TFs, WRKY25 and WRKY33 in yeast. Both WRKYs are involved in biotic stress signaling, which in turn is clearly linked to MPK4. In an elegant series of experiments, Qiu et al. (2008b) could show that MPK4 exists in nuclear complexes with the WRKY33 TF. This complex depends on the MPK4 substrate MKS1. Challenge with pathogenic elicitors leads to the activation of MPK4 and phosphorylation of MKS1. Subsequently, complexes with MKS1 and WRKY33 are released from MPK4, and WRKY33 is recruited to the promoter of PAD3, encoding an enzyme required for the synthesis of antimicrobial camalexin. MKS1 serves to fine tune WRKY33-mediated PAD3 expression. In line with this scenario, wrky33 mutants exhibit enhanced susceptibility to necrotrophic pathogens, whereas overexpression of WRKY33 increases resistance (Zheng et al., 2006). A recent transcriptome study has revealed further potential target genes of WRKY33, including CYP71A1 that encodes a cytochrome P450 monoxygenase required for camalexin synthesis (Petersen et al., 2008). The AttedII tool predicts a strong coregulation of PAD3 with CYP71A13, and the promoters of both genes carry multiple W boxes, suggesting that both genes underlie a common regulatory mechanism, i.e. through WRKY33 (Petersen et al., 2008).

In-depth analysis of available phosphopeptide sequences may aid the prediction of peptide motifs that are recognized by a given kinase. Moreover, similar to the screening of Arabidopsis genes carrying a motif of interest in their upstream regulatory region, the TAIR patmatch tool can be applied to generate a list of candidate Arabidopsis proteins that harbor a given peptide motif. Additional confidence about the functional relevance of a candidate peptide motif may also be obtained through phylogenetic analysis. For example, Arabidopsis NPR1 and its orthologs in other plants carry a characteristic DSXXXS peptide, phosphodegron, which marks it for phosphorylation-dependent proteasomal degradation (Spoel et al., 2009). Phylogenetic analysis can, for example, be performed using the tool at http://bioinfoserver.rsbs.anu.edu.au/utils/affytrees/, which provides information about the homologs to a protein of interest in other plant species.

The functional relevance of candidate peptide motifs can then be experimentally verified (e.g. through in vitro phosphorylation). Subsequently, hybrid/artificial kinases can be created that modify proteins other than their true targets or that prevent phosphorylation of a protein by outcompeting the true modifying upstream kinase. Given that phosphorylation events are a common feature in the signaling of critical responses/processes in animals, this approach has high potential, for example, for tumorigenesis/cancer therapy research.

In summary, this review documents the usefulness, robustness, and limitations of applying various transcriptome-, promoterome-, and proteome-based bioinformatic tools for deciphering signaling pathways in Arabidopsis. Clearly, only a small subset of available tools are described, and their literally unlimited number of elaborate combinations harbors high potential to significantly speed up the progress in signaling research. Also, modeling approaches, for example, based on kinetic data, harbor a huge potential to dissect signaling pathways. In the future, experiments can be designed in a highly targeted manner and in silico analysis will replace bench work to a large extent.

Footnotes

  • ↵1 This work was supported by grants from the Austrian Science Foundation.

  • The author responsible for distribution of materials integral to the findings presented in this article in accordance with the policy described in the Instructions for Authors (www.plantphysiol.org) is: Heribert Hirt (hirt{at}evry.inra.fr).

  • www.plantphysiol.org/cgi/doi/10.1104/pp.109.149583

  • Received October 16, 2009.
  • Accepted November 12, 2009.
  • Published November 13, 2009.

LITERATURE CITED

Table I.

List of bioinformatic tools described in this review

0 thoughts on “Bioinformatic Tools In Arabidopsis Research Paper”

    -->

Leave a Comment

Your email address will not be published. Required fields are marked *