Interpreting machine learning models to investigate circadian regulation and facilitate exploration of clock function - pnas.org

jokbanga.blogspot.com

The circadian clock is an internal molecular 24-h timer that is a critical adaptation to life on Earth. It temporally orchestrates physiology, biochemistry, and metabolism across the day/night cycle. As a result, it regulates many traits associated with fitness and survival (1, 2). The clock is a well-characterized transcriptional regulatory network, which drives complex, widespread, and robust patterns of temporal gene expression (3, 4). However, our understanding of such complex transcriptional regulatory systems is limited by our ability to assay them, requiring the generation of long, high-resolution time series datasets.

In plants, much of our understanding of circadian regulation comes from our study of the model plant Arabidopsis thaliana. This has yielded a plethora of public, multiomic resources (5 ⇓–7) that can be reanalyzed to give additional insights into the roles and functions of complex regulatory networks. In this study, we use newly generated datasets, published temporal datasets (8 ⇓–10) (SI Appendix, Table S1), and Arabidopsis genomes, in combination with machine learning (ML) approaches (see SI Appendix, Glossary for definitions of terms), to make predictions about circadian gene regulation and expression patterns. ML models are frequently described as “black boxes,” meaning that because of their complexity their inner logic is not easily understood by a human. Critically, we advance existing approaches using explainable AI algorithms and interpretation of our models to illuminate what is inside the black box (SI Appendix, Glossary), such methods help us to understand the predictions made by ML models. There are many model interpretation strategies in which methods can identify important patterns and/or features that underly an ML model (11). For example, model interpretation has been successfully implemented in drug discovery to enable the mechanistic interpretation of drug action and drug response (12 ⇓⇓–15). We use such approaches to give insight into circadian biology and experimental design, alongside our predictions. Clarity with respect to how a model makes its predictions, we propose, will also generate confidence and trust in the model, promoting its usage. We use the Arabidopsis circadian clock as an example of a complex transcriptional regulatory network since some of its key regulatory elements are already known, allowing the validation of our findings with experimental evidence.

Circadian gene expression rhythms reflect a variety of waveform shapes with a characteristic periodicity of ∼24 h (16). Recent computational methods for identifying these rhythms from transcriptomic time course datasets have achieved circadian gene classification with as few as 3 to 6 timepoints (saving time for sampling and money for sequencing) (17). However, some of the most popular approaches describe optimal sampling strategies for the identification of rhythms running with >3 d of data and 2-h sampling resolution (18, 19). This is partly due to concern for the loss of information and accuracy, as a result of downsampling. Since the cost implications of this are high, our focus is on designing trusted downsampling strategies for capturing circadian oscillations using a nonoptimal number of timepoints and on improving accuracy compared to existing methods to minimize the impact of information loss. Firstly, we develop ML models that classify circadian expression patterns using an iteratively lower numbers of transcriptomic timepoints, improving accuracy compared to the state of the art. Moreover, we use model interpretation to quantify the best, transcriptomic timepoints for sampling. We believe that predictive insight on when to sample will be a valuable reference for experimental biologists when planning experiments.

Next, we redefine the field, developing ML models that distinguish circadian transcripts using no transcriptomic timepoint information and instead using DNA sequence features (SI Appendix, Glossary). The theory supporting this is that a major mechanism of (circadian or otherwise) gene expression control is through transcription factor (TF) binding to the regulatory DNA sequence. Considering previous work in Arabidopsis, it is likely that the promoter, 5′ untranslated region (UTR), and the first part of the coding region are the most useful locations for TF binding site (TFBS) detection (20). Genes expressed with similar patterns are more likely to be controlled by similar sets of TFBSs. In addition, small RNAs (sRNAs), comprising microRNAs (miRNAs) and small interfering RNAs, are thought to affect transcript abundance via posttranscriptional regulation of messenger RNA (mRNA) (21). Plant miRNAs predominately bind to the coding regions of mRNA and, to a lesser extent, 5′UTR and 3′UTR regions (22, 23). As such, we consider both coding and noncoding regions to classify circadian genes using DNA sequence. Our DNA sequence features are profiles of k-mer–based motif representations that are identified de novo and embody a comprehensive picture of TFBS, sRNA/RNA binding sites, and other sequence-based regulatory elements, since we incorporate the promoter 5′UTR, 3′UTR, and coding regions.

A key strength of our DNA sequence–based approach is that we classify circadian transcripts using k-mer–based motif representations generated from preexisting, public, and genomic resources, facilitating downstream application of our methods with no experimental work or prior knowledge of regulatory elements needed. Computational regulatory motif discovery methods typically search for overrepresented words across DNA sequences using methods such as expectation maximization and Gibbs sampling (24 ⇓⇓–27). Approaches are typically limited by a requirement for input information [e.g., coexpressed genes, site abundance, or a fixed motif length (28 ⇓–30)]. Artificial intelligence (AI) has been used to take DNA sequence information as an input to predict outputs that likely impact DNA function. Examples include predicting variant effects on chromatin features, such as TF binding, and histone profiles [e.g., DeepSEA (31) and DanQ (32)]. Furthermore, AI has been used to predict transcriptomic profiles directly using features such as DNA sequence or epigenetic marks. These features typically include representations of TFBS [e.g., Xpresso (33, 34)], enhancers [e.g., McEnhancer (35)], histone modifications [e.g., DeepChrome (36)], open chromatin regions (37), or promoters (38). However, these approaches typically require experimental data beyond training, prior knowledge of regulatory elements that our approach does not need, or they focus on single-gene expression states and do not consider complex patterns, as our methods do.

Traditionally, AI work to predict expression has lacked comprehensive model explanation (39). Increasingly, efforts focus on developing interpretive methods for expression prediction models (34 ⇓–36, 38, 40). For example, for DNA sequence–based models, the studies (33, 38, 40) evaluate feature relevance or importance and derive predictive DNA regions by aligning differential nucleotide importance with differentially expressed genes; these regions can then be bioinformatically analyzed to identify regulatory motifs. Here, alongside our DNA sequence–based predictive model, we use explainable AI to derive regulatory motifs directly from the ML model and explore their functional consequences. We exploit model explanation to identify, on a transcript-by-transcript basis, the ranked regulatory sequences that guide the classification of its expression pattern as circadian. We identify both small and larger combinations of regulatory elements that, in combination, give a larger, overall impact on gene classification. These regulatory sequences are candidate genetic features that could control gene expression and allow us to understand the regulatory mechanisms governing circadian expression patterns and even to manipulate regulation. Ultimately, we use model explanation to generate and validate hypotheses in silico, facilitating both gene expression prediction and derivative regulatory element discovery.

Finally, assaying circadian clock function, as opposed to identifying transcript rhythmicity, has been a challenge for the study of the circadian regulation in organisms ranging from mammals to plants. Recent work applied ML to circadian time course transcriptomic datasets from human blood, to predict the phase of the endogenous circadian clock (circadian time, CT), using a single timepoint from a set of marker genes (41, 42). This allows the use of one timepoint to identify altered clock function (e.g., due to disease or environmental conditions). An equivalent major challenge exists in plant sciences. As such, we use ML to predict the circadian time in Arabidopsis from a single, transcriptomic timepoint using marker genes. To advance previous offerings, we identify marker genes as part of our interpretable approach, ensuring that they represent a diverse range of temporal patterns with consistent amplitudes across datasets to facilitate accurate and robust phase prediction, irrespective of sample phase. Counter intuitively, our marker genes do not include the core clock genes used in previous studies for time prediction (43). Taken together, these tools constitute a suite of informative resources for both experimental biologists and the interpretation of further circadian datasets.

Results and Discussion

ML Interpretation Optimizes Timepoint Downsampling to Define Circadian Transcripts.

We used MetaCycle for the detection of circadian signals in dense time series transcriptomic data (18). MetaCycle is one of the most well-maintained and accessible tools within the community, incorporating a variety of the most widely used methods, ARSER (44), JTK_CYCLE (45), and Lomb–Scargle (46), and integrating their results so that rhythmic prediction is a cumulation of different statistical approaches. We ran MetaCycle (see Materials and Methods) on the Arabidopsis time series transcriptomic dataset generated by ref. 8, which was sampled every 4 h for 48 h (SI Appendix, Table S1). The data were processed to produce normalized counts per transcript (see Materials and Methods). MetaCycle classified 9,394 out of 44,963 transcripts as circadian (q < 0.05), with 7,734 denoted as high confidence (q < 0.02) (SI Appendix, Note S1). We trained a series of ML classifiers to predict if a transcript was circadian or noncircadian in a binary classification using 7,734 of the least likely candidates to be circadian (q > 0.99), labeled by MetaCycle alongside the 7,734 highly circadian transcripts (q < 0.02) (see Materials and Methods and SI Appendix, Glossary and Note S2). For ML models, we report the F1 scores that measure the accuracy of the model on a scale of 0 to 1, with 1 being most accurate (SI Appendix, Glossary). Considering all 12 transcriptomic timepoints, the best model was generated with Light Gradient-Boosting Machine (LightGBM) after optimization (Materials and Methods and SI Appendix, Fig. S1A and Table S2), with an F1 score of 0.999 on the training data, 0.955 on the (held out) test data, and a mean F1 cross validation score of 0.939 (SI Appendix, Glossary). Our confusion matrix (SI Appendix, Fig. S1B and Glossary) highlights consistently high-model accuracy, irrespective of the class that is being predicted (circadian/noncircadian).

Our best ML model (LightGBM) assigned a matching, circadian/noncircadian label to the majority of the transcripts that MetaCycle labeled. However, the overlap was not 100%, so we examined the small proportion of transcripts that were “inaccurately” classified. We found that the inaccurately classified cases by our ML model were often intermediate or borderline cases for MetaCycle (Fig. 1) or edge cases (e.g., with slightly longer period lengths [SI Appendix, Fig. S1]). We deduced this because cases rejected by MetaCycle as circadian but accepted by the ML (false positives [FPs]) had significantly lower (MetaCycle derived) P values than the cases that were rejected by both MetaCycle and ML (true negatives [TNs]) (P < 0.0001, t = 6.8795, df = 7753). Conversely, cases accepted by MetaCycle as rhythmic but rejected by ML (false negatives [FNs]) had higher (MetaCycle derived) P values than cases categorized as rhythmic by both MetaCycle and ML (true positives [TPs]) (P < 0.0001, t = 5.7744, df = 7711) (Fig. 1A). Additionally, cases rejected by MetaCycle as circadian but accepted by the ML (FP) have significantly lower relative amplitudes (rAMPs) compared to the TP calls in which both methods agree (P < 0.0001, t = 8.3845, df = 7732). Conversely, cases accepted by Metacycle as rhythmic but rejected by ML (FN) had a significantly higher rAMP than the TN calls (P = 0.036, t = 2.0924, df = 7732) (Fig. 1B). Therefore, the ML model is not simply using high- and low-expression levels to discriminate the circadian and noncircadian status of transcripts, and interestingly, the distribution of rAMPs for the FNs reflects that of the TPs far more closely than that of the TN calls.

Fig. 1.

Arabidopsis circadian/noncircadian comparative ML binary classification with 12 transcriptomic timepoints. Class 0 = Noncircadian and Class 1 = Circadian. Histograms in A and B relate to the best model from SI Appendix, Fig. S1A generated using LightGBM; the histograms are color coded as per the confusion matrix shown to the right (i.e., showing where our model assigned TP labels, FP labels, FN labels, and TN labels). The histograms show bins representing the frequency of transcripts that had various P values (A) or rAMPs (B) assigned to them by Metacycle.

We assessed the effect of reducing the number of transcriptomic timepoints on the accuracy of our classification of circadian/noncircadian transcripts. For our best ML model (using 12 timepoints), we reduced the number of timepoints (or features) sequentially from 12 down to 3. To obtain each interim-reduced set of timepoints from 12 to 3, we used well-known feature selection tools χ² and eli5 (SI Appendix, Glossary) and compared these against testing every possible feature combination for the timepoint number (see Materials and Methods). The method of trialing every possible feature combination for each reduced timepoint number enabled us to most accurately classify transcripts as circadian/noncircadian (Fig. 2A). Using this approach with six timepoints, we achieved a mean classification F1 score of 0.886 on cross-validation and of 0.792 using only three timepoints (SI Appendix, Table S3). The mean F1 scores on cross-validation varied by 0.09 between our best and least predictive six timepoints, likewise they varied by 0.06 between our best and least predictive three timepoints, highlighting the impact of timepoint selection. SI Appendix, Table S3 highlights that we have consistent accuracy, irrespective of the class that is being predicted (circadian/noncircadian). Using model interpretation (i.e., identifying the combinations of features that gave the highest F1 scores), we defined the most optimal sampling strategies for the different number of timepoints. Selecting six or more timepoints, the best combinations tended to be consecutive timepoints extending across the intersect of day 1 and day 2. In contrast, when selecting low numbers of timepoints, more accurate classifications were made when timepoints were spaced across a single day (Fig. 2B), and there was a distinct bias for selecting certain timepoints with others appearing to be much less informative (e.g., ZT28, ZT40, ZT52, and ZT64 that were never selected). Fig. 2C shows the best combination of reduced timepoints in each category 12 to 3 for the example transcript phytochrome A (PHYA). When we followed the same strategy, creating transcriptomic ML circadian classification models for wheat, a divergent plant species from Arabidopsis (SI Appendix, Table S1), we saw similar trends for the best combinations of reduced timepoints (SI Appendix, Note S3 and Fig. S2).

Fig. 2.

Arabidopsis circadian/noncircadian ML binary classification to reduce the number of transcriptomic timepoints. For our best ML model, we reduced the number of timepoints sequentially from 12 to 3. (A) To obtain each reduced timepoint set, we compare using χ² (Chi2) and eli5 (Eli5) feature selection, with the best set comparing every possible random feature combination (All). We show the best F1 score after fivefold cross validation (CV) for each reduced timepoint set. (B) Detailing the 10 best combinations of features that gave the highest F1 score for each reduced set of timepoints. Labels N3 to N11 show the number of reduced timepoints. Labels avexp_24 to avexp_68 show sampling times. Counts 0 to 10 represent the number of times each timepoint appeared in the 10 best feature combinations. (C) For the example gene PHYA, a line plot of the gene’s expression across the best combination of reduced timepoints in each set 12 to 3. Expression values are reduced by ∼5% for each timepoint combination to allow the separation of lines for visualization.

To test how generalizable our model is on unseen data (SI Appendix, Glossary), we used the most accurate three-timepoint model (timepoints 36, 48, and 60) for the binary classification of, firstly, a second Arabidopsis transcriptomic time series dataset developed by ref. 9 and secondly, a wheat transcriptomic dataset representing a divergent plant species from Arabidopsis (SI Appendix, Table S1). These test datasets represent different sampling strategies and experimental setups (see Materials and Methods). Both test datasets were processed bioinformatically as per our original (8) dataset, in which we generated ground truth circadian/noncircadian labels for transcripts using MetaCycle with all available timepoints (see Materials and Methods). For the Arabidopsis (9) dataset, the timepoints did not match those used to train our model; sampling started 2 h after exposure to constant light (rather than 24 h after), and samples were taken every 3 h instead of every 4 h. As such, we selected the closest times to those that were used to train our model according to time of day relative to dawn (timepoints 11, 23, and 35). Unmatched timepoints are likely to have a negative effect on performance, which we observed with an F1 score for the classification of this gene set of 0.714, amounting to a decrease in accuracy of 0.08, compared to the dataset that the model was trained on. For the wheat dataset, sampling started 24 h after exposure to constant light, and measurements were taken every 2 h instead of every 4 h. Therefore, here, matching the time of day relative to dawn, we could select equivalent timepoints (12, 24, and 36 h), and the F1 score was slightly higher at 0.769, amounting to a decrease of only 0.02 on a highly divergent species. The model therefore generalizes well, irrespective of the sample’s species, and we observe a much similar performance by matching the timepoints that are used (relative to dawn), as we show for wheat.

We compared our timepoint reduction analysis using ML to a range of analyses representing the state of the art across the different timepoint numbers. MetaCycle requires a minimum of six timepoints for circadian analysis and benefits from these timepoints being evenly sampled across the chosen time period (18). As such, we reduced timepoints from 12 to 6 to enable comparison, including evenly spaced sampling patterns: 4 h/1 d and 8 h/2 d versus the best suggested sampling times from our ML analysis (4 h/1 d from 36 to 56 h from Fig. 2 B and C). The reduction to six timepoints significantly decreased the number of positive circadian gene calls by MetaCycle that were conserved with the 12-timepoint analysis, independently of the sampling technique used. The highest proportion of the 9,394 circadian genes, identified with 12 timepoints by MetaCycle that were also identified with six timepoints (P < 0.05), was 63.7% (SI Appendix, Table S4). This accuracy is ∼25% lower than the F1 score we achieved with six timepoints and our best ML model (SI Appendix, Table S3). Furthermore, when comparing the F1 score of our three-timepoint ML model, it was more appropriate to use a three-timepoint state-of-the-art analysis performed by Spörl et al. (17). SI Appendix, Table S4 highlights that we achieve a 12% higher accuracy with only three timepoints in a like-for-like comparison with Spörl et al. (17). This improvement is in addition to the experimental design insight that we provide.

Circadian Genes Can Be Classified from DNA Sequence.

We next eliminated transcriptomic timepoints and used DNA sequence features alone to classify transcripts as circadian/noncircadian. To achieve this, we generated k-mer profiles (counts) de novo for the mRNA and promoter sequences associated with each transcript, comparing a range of k-mer lengths (see Materials and Methods and SI Appendix, Glossary). We trained a series of ML classifiers to predict if a transcript was circadian or noncircadian in a binary classification using the derived k-mer profiles for the same set of transcripts and MetaCycle-derived labels used previously (for the transcriptomic ML model). Across the range of k-mers, the best models were consistently generated with the classifier LightGBM, and the most accurate model used a k-mer length of six, with separate feature sets for promoter and mRNA regions (8,192 features of k-mer counts per transcript) that were both inputted into the model (see Materials and Methods). This best optimized model showed the following (Fig. 3A and SI Appendix, Table S2): a mean F1 score of 0.766 on cross-validation (SD 0.006) and a test F1 score of 0.751 on class 0 (noncircadian) and 0.804 on class 1 (circadian). Our accuracy was largely balanced between the classes. An optimal k-mer length of 6 bp for this analysis could reflect this being the smallest length that we would not expect to simply occur by chance, therefore, giving ideal resolution. Because of the large number of features created using feature selection, we tested the accuracy of our rhythmic classification when subsets of the feature set were used (see Fig. 3B and Materials and Methods and SI Appendix, Glossary). We can reduce the feature number to ∼200 and maintain an F1 score above 0.7, but the highest accuracy was achieved with all 8,192 features, and as such, for downstream investigations, we used the full feature set.

Fig. 3.

Arabidopsis circadian/noncircadian ML binary classification using k-mer profiles. (A) For our best performing classifier LightGBM, we compare F1 scores for the test data and after cross validation (CV). These f1 scores were generated using different k-mer lengths (4 to 7 bp), with or without the use of oversampling (OS) since our classes are not perfectly balanced (SI Appendix, Glossary). (B) To obtain each reduced set of k-mers, we use χ² (Chi2) feature selection. We show the best F1 score after fivefold cross-validation for each set of reduced features. (C) The top 30 most impactful features for predicting class 1 (circadian), considering all samples in training and test as calculated using SHAP (SI Appendix, Glossary). Feature value denotes the frequency of a k-mer per transcript. When the frequency of a k-mer per transcript is high (red) and it has a positive SHAP value, this high frequency is driving the prediction of a circadian transcript. This is often coupled to the lower frequency of the same k-mer per transcript (blue) having a negative SHAP value, so the absence of the k-mer is driving the prediction of a noncircadian transcript. On the contrary, when the frequency of a k-mer per transcript is high (red) and has a negative SHAP value, the high frequency is driving the prediction of a noncircadian transcript. This is often coupled to the lower frequency of the k-mer per transcript (blue) that has a positive SHAP value, so the absence of the k-mer is driving the prediction of a circadian transcript. Features (e.g., the k-mer TATTGC) are labeled as “TATTGC” for counts from the promoter and “TATTGC.1” for counts from the mRNA. The corresponding plot for the class 0 (noncircadian) transcripts contains the same list of k-mers, but the SHAP value plots will be the exact inverse of this figure.

Our de novo k-mer generation approach allows the downstream identification and investigation of both known and previously unknown sites, with only the annotation of the TSS/TTS of a transcript required. Our short k-mers (6 bp) from promoter/UTR regions should mainly represent regulatory elements such as TFBSs. However, our inclusion of coding regions could encompass additional regulators (e.g., miRNA binding sites). Although miRNAs tend to be 20 to 24 bp in length, our k-mers may represent miRNA seed regions that are typically ∼6 bp in length and perfectly/near perfectly match targets (22).

Explanation of DNA Sequence–Based ML Model Links to Circadian Regulation.

We next wanted to explain our model, to identify which k-mer’s were most influential in guiding it to predict transcripts as circadian, since these k-mer’s could represent the most critical regulatory elements for circadian regulation. If we observe known circadian regulatory elements in this process, this is also a means of validation of the model. We used SHAP (Shapley Additive exPlanations) to explain our best DNA sequence–based model’s predictions by computing the contribution of each feature or k-mer to that prediction (i.e., ranked feature impact on classification) (SI Appendix, Glossary) (47). We did this firstly at a global level, looking at the top 30 most impactful features across all of the transcripts for distinguishing class 1 (circadian) from class 0 (noncircadian) (SI Appendix, Glossary and Fig. 3C). Approximately half of the most impactful k-mers in Fig. 3C show a positive correlation between k-mer frequency and the SHAP value or feature impact on the model. Higher frequencies of these k-mers for a transcript indicate a higher impact on it being classified circadian. Correlations between k-mer count and SHAP impact value for the top four most impactful k-mers from Fig. 3C are all highly significant (r > 0.7, P < 0.001; SI Appendix, Fig. S3). Of the positively correlated top 30 k-mers, 55% of those that contributed to the circadian classification of a transcript were predominantly in the promoter/UTR. We hypothesized that these k-mers represent TFBSs for TFs linked to circadian regulation.

To investigate if our most impactful promoter/UTR k-mers for prediction were TFBSs, we aligned known Arabidopsis TFBSs to each k-mer and filtered the most significant matches (SI Appendix, Table S5 and Materials and Methods). We then validated the k-mers that match/likely represent TFBSs using experimental evidence or insight from the literature, many closely associated with circadian regulation or circadian related processes. k-mers of interest included (k-mer 1; SI Appendix, Table S5) matches to TFBS for two photo-responsive TFs (AT3G58630 and AT5G05550) (P value 0.0002, e-value 0.18), which form interactions with a number of circadian-related proteins [e.g., LIGHT INSENSITIVE PERIOD1 (LIP1), CONSTANS-Like (Col) 11 (48) and REVEILLE 2 (49)]. Another k-mer (k-mer 7; SI Appendix, Table S5) matched a motif bound by several ethylene-responsive binding proteins (P = 0.00003, e = 0.02); ethylene synthesis is known to be both circadian controlled and a moderator of the circadian clock (50, 51). We also found matches for binding sites of known circadian TFs, including LUX ARRYTHMO (LUX) (52), CIRCADIAN CLOCK ASSOCIATED 1 (CCA1) (53), and LATE ELONGATED HYPOCOTYL (LHY) (54), alongside motifs associated with light-induced or -repressed sequences (SI Appendix, Table S5).

Four of the positively correlated top 30 most impactful k-mers defined by SHAP were observed primarily in coding regions rather than promoter/UTRs across the circadian, predicted transcripts. Since miRNAs are thought to influence circadian controlled processes (55, 56) and are common in coding regions, we tested the possibility that these k-mers could represent miRNAs by aligning them (plus the surrounding sequence) to mature ath-miRNA sequences to identify matches (see Materials and Methods). Two of the four k-mers matched miRNA sequences that were associated with developmental timing (57) and chloroplast biogenesis (58). Therefore, for a subset of transcripts, the k-mers could represent putative miRNA binding sites that have been experimentally linked to circadian, regulated processes, although this only accounts for a small proportion of the transcripts (SI Appendix, Table S5). We next investigated if these k-mers could represent RNA-binding motifs (see Materials and Methods). RNA-binding proteins are key regulators of gene expression in eukaryotes, and because of strong sequence conservation, their recognition preferences can be inferred from RNA-binding motifs. We validated two of the four coding sequence k-mers, linking them to RNA-binding motifs associated with circadian, related processes (SI Appendix, Table S5; P < 0.05). The first motif is targeted by the RNA-binding protein serine and arginine-rich splicing factor 7 (SRSF7). This links to circadian processes since circadian temperature cycles are known to drive rhythmic SR protein phosphorylation to control alternative splicing (59). The Arabidopsis protein RSZ22 is a known ortholog of the SRSF7 SR factor that this alignment could represent (60). The second k-mer–matched motif is targeted by the RNA-binding protein LIN28A (Homo sapiens). The Arabidopsis protein cold shock protein 1 (CSP1) is a known homolog of LIN28A, with a similar, functional role in reprogramming that this alignment could represent (61). CSP1 has been implicated in seed germination timing that is clock related (62).

For the remaining k-mers in the top 30 most impactful (Fig. 3C), those not associated with promoters/UTRs/miRNAs/RNA binding sites, we investigated their spatial distribution across the transcripts (SI Appendix, Fig. S4). Strikingly, there was a clear tendency for them to appear near to the start or else in the first half of the transcript that includes the first exon. By comparison, when we look at the spatial distribution of the promoter-derived k-mers from the top 30 most impactful features (SI Appendix, Fig. S5), they were distributed more uniformly across the promoters which they occurred in. We investigated if there were any changes in nucleotide composition between our most predictive k-mers compared to our nonpredictive k-mers (SI Appendix, Fig. S4N). As a baseline for comparison, we compared these k-mer groups to the mRNA and promoter sequences that were used to generate k-mers to train the Arabidopsis DNA sequence–based model. SI Appendix, Fig. S4N highlights that promoter sequences show a slightly higher percentage of GC content than mRNA sequence (34% versus 39%, respectively). Our most predictive k-mers have a percentage of GC content of 38% that falls in between the baseline for mRNA and promoter, while our nonpredictive k-mers significantly deviate from this profile showing a percentage of GC content of 70%. It appears that a high percentage of GC content is more likely to result in a nonpredictive k-mer.

Transcript-Specific Explanations Reveal Subclasses within the Binary Circadian Class.

Our DNA sequence–based model discriminated transcripts under circadian regulation from those that are not, which is useful to identify circadian regulatory elements from model explanations. However, circadian rhythms reflect a variety of waveform shapes. As such, we bioinformatically identified coexpression modules (SI Appendix, Glossary) from the transcriptomic profiles of the circadian transcripts that were used to train our ML models using weighted gene coexpression network analysis (WGCNA) (63). This resulted in eight modules with distinct circadian expression profiles. These modules embody groups of transcripts that can be differentiated by their phase of expression with the following phases/groups observed (SI Appendix, Fig. S6): morning phases of 0 h (cluster 7) and 4 h (cluster 5/6), day phase of 8 h (cluster 3), day/evening phase of 12 h (cluster 2), evening phase of 16 h (cluster 1), and night phase of 20 h (cluster 4/8).

We next sought to group our circadian transcripts into subgroups representative of different phases of expression, but rather than using transcriptomic information, we wanted to use the SHAP impact values of their k-mers. This effectively divides our DNA sequence–based model’s binary class circadian into multiple subclasses, providing further insight into transcript rhythmicity. To enable this, we used model explanation of our best DNA sequence–based predictive model, but rather than identifying the most impactful k-mers in general (global explanation) for predicting class 1 (circadian), as previously, we identified the most impactful k-mers for the classification of each circadian transcript individually (local explanation) (SI Appendix, Glossary). For this, we focus on the TP circadian transcripts in which MetaCycle and our ML model both predict the transcripts as circadian. Local explanations are transcript specific and could highlight k-mers that are regulating each transcript’s expression. Each transcript has a calculated SHAP impact value per feature (8,192 k-mers), and with this set of values, we denote the SHAP value profile for a transcript. In an SHAP value profile, the k-mer with the highest SHAP value is the most influential on the transcript’s classification as circadian. Comparison of these profiles allows us to compare and subdivide the transcripts using DNA sequence composition related to gene regulation, rather than transcriptomic profile.

After deriving local explanations, we filtered the most circadian transcripts according to their SHAP explanation (“most positive cumulative SHAP value”; see Fig. 4A and Materials and Methods and SI Appendix, Glossary). Then, we focused on known circadian genes that were within this set (i.e., experimentally validated and known TP genes from previous studies). We clustered the derivative transcripts of these genes based on the similarity of their SHAP value profiles, which represent the relative impact of the k-mers on their classification as circadian (Fig. 4B). In groups to the right of the dendrogram (purple), 85% of transcripts peak in their expression in the morning/day, whereas in groups to the left 77% of transcripts peak in the evening/night (phases determined by MetaCycle). Therefore, circadian transcripts with more similar k-mer SHAP value profiles also had similar expression phases, thus dividing our circadian class into subclasses representing phases of rhythmicity using k-mer information. Since we selected the top five most impactful k-mers per transcript for clustering, clustered transcripts represent those with similar combinations of k-mers that we hypothesize to be guiding expression phase. For example, PRR3 and LUX had similar SHAP value profiles, and we validated this by observing their similar transcriptomic expression profiles, with the evening phases of expression of ZT15 and ZT13, respectively. Exceptions included the two LNK genes, which have morning phase transcript expression profiles but have SHAP profiles similar to evening- and night-expressed genes. This suggests that LNK1/LNK2 may be regulated by a separate mechanism to that regulating other dawn-expressed genes. We also observed TIC, which peaks at dusk in the transcriptomic data, in the morning/day cluster; previously, rhythmicity of TIC was not detected in whole seedlings (64), whereas here we confidently classify this transcript as circadian from aerial tissue (MetaCycle q = 0.004). Previous work concluded that TIC functions in the late evening (65) but plays a role regulating LHY that is in the same morning/day cluster as TIC; this may explain its appearance here (64). Finally, we also see the night gene PHYB in the morning/day cluster, perhaps because of the presence of the similarly regulated PHYA in this cluster (66).

Fig. 4.

Investigating Arabidopsis circadian TP transcripts after ML DNA sequence–based classification. This figure relates to the LightGBM ML model that we selected as our best classifier. (A) Violin plot showing the range of SHAP values across TP transcripts (correctly predicted circadian). A positive SHAP value for a k-mer, for a transcript, indicates the k-mer is driving a circadian prediction, while a negative SHAP value indicates the k-mer is driving the prediction of noncircadian for that transcript. SHAP values are summed for each transcript to produce a cumulative SHAP value. (B) Dendrogram clustering known core circadian transcripts according to their profiles of SHAP values if the transcripts were also present in Q1 to Q3 of A. We clustered transcripts using hierarchical clustering with average linkage and Euclidean distance (see Materials and Methods). Dendrogram branches are colored using a color threshold to color all descendant links below a cluster node k the same color if k is the first node below the cut threshold t (∼5). Dendrogram labels colored according to peak phases of expression: morning (0 to 5.99999 h), day (6 to 11.99999 h), evening (12 to 17.99999 h), and night (18 to 23.99999 h), as determined by 1) MetaCycle or 2) the module of origin of the transcript from our eight WGCNA-generated modules. (C and D) Violin plots show the range of SHAP values across all TP transcripts in groups morning/day/evening/night for the k-mers GATATT (C) (EE) and AAACCC (D) (Telo-box).

From our transcript SHAP value profile clustering (Fig. 4B), for subclasses of transcripts with similar expression phases, the most impactful k-mers per subclass could represent sequences that are regulating time-of-day–specific expression. Identifying these using model explanation could facilitate the estimation of circadian expression phase without the need for a transcriptomic time course. To test this, we split the transcripts into morning/day/evening/night and investigated which k-mers differentiated the groups. We identified the top 30 most variable k-mers between the four groups’ consensus SHAP explanations; these k-mers vary most in their impact between the groups (see Materials and Methods) (SI Appendix, Table S6). Since we are comparing k-mers that differentiate groups of transcripts that are separated by their phase of expression, we validated our hypothesis by matching the k-mers to binding sites that have been experimentally associated with specific times of day. For example, the late night–specific Telo-box (67), a G-box–related sequence thought to associate with late night and dawn genes (68), and the evening element (EE) (69) that appeared twice in the top 30 with two k-mers matching it. When we compared the importance of these k-mers between the morning, day, evening, and night groups, the EE had a higher impact on model prediction in the evening group than in the other three groups, and this difference was statistically significant compared to both morning and night (Fig. 4C and SI Appendix, Table S7). Additionally, the Telo-box had a higher impact on model prediction when observed in the night group compared to all other groups, and this difference was statistically significant compared to day and evening, fitting with its late night specificity (Fig. 4D and SI Appendix, Table S7).

Case Study: Explanation for PHYA to PHYE Guides Reclassification of PHYC.

The PHYTOCHROME (PHY) genes encode red and far-red photoreceptors directly involved in setting the clock. Previous studies have identified circadian regulation of PHY transcripts A to E as rhythmic with patterns ranging from strong to weak (70, 71). Here, from the (8) transcriptomic data PHYC/PHYD/PHYE were all called noncircadian by MetaCycle with q-values of 0.99, 0.60, and 0.13, respectively, while, from the same dataset, the software BooteJTK (72) (that had few differentially annotated transcripts compared to MetaCycle; SI Appendix, Note S1) classed PHYD/E as circadian and only PHYC as noncircadian (q-value = 0.2), with some evidence of a weak, cyclic pattern. From this and previous work, these genes could be rhythmic, but this may not be clear in the transcriptomic data, likely because of reported low-amplitude expression patterns and dependent on testing conditions, particularly for PHYC/E (SI Appendix, Fig. S7A). These genes were missing from our ML analysis and can be used as unseen test datapoints (SI Appendix, Glossary) for the ML models. The mixture of strong (PHYA/B) and weak cyclical patterns (PHYD/E and potentially PHYC) are ideal for a test of the limits of the ML models. Working with the assumption that all of the PHY primary transcripts A to E are circadian as a baseline for comparison, for the PHY primary transcripts A to E, SI Appendix, Table S8 highlights MetaCycle’s 40% accuracy, only classifying PHYA/B as circadian, compared to our ML (12 timepoint) model’s 80% accuracy, since we additionally classify PHYD/E as circadian. This is supported by the BooteJTK analysis and visually evident rhythmic expression in the transcriptomic data for PHYE and to a lesser extent for PHYD (SI Appendix, Fig. S7A). We maintain our 80% accuracy when we generate k-mer profiles for the PHY transcripts A to E and use our DNA sequence– or k-mer–based ML model to predict circadian/noncircadian. Both of our ML models (transcriptomic and DNA sequence–based) classify PHYC as noncircadian, with the other primary PHY transcripts predicted circadian. Even the DNA sequence–based ML model discriminated PHYC from the other PHY transcripts to align with the transcriptomic information, despite sequence similarity between them. Moreover, the transcriptomic expression profile for PHYC provides an unconvincing, circadian rhythm, with an amplitude tending toward zero (0.02), compared to the other transcripts (SI Appendix, Fig. S7A). This may reflect previous work that concluded a weak or noncycling steady state of PHYC mRNA potentially due to posttranscriptional, circadian regulation (70, 71).

We used the SHAP explanations for the PHY transcripts A to E to identify the regulatory elements that were most impactful in guiding their classifications using the DNA sequence–based model. We compared the SHAP impact values between each of the PHY transcripts A/B/D/E (circadian) and PHYC (noncircadian) to identify those k-mers or regulatory elements that are most impactful in predicting PHYA/B/D/E to be circadian but also in predicting PHYC to be noncircadian (six identified in SI Appendix, Table S9). The change in frequency of these k-mers (within the transcript) is most likely to be responsible for the circadian/noncircadian, predictive differences between the transcripts according to our model (SI Appendix, Note S4, Figs. S7 B and C, and S8). To investigate if altering any of the six identified k-mers (SI Appendix, Table S9) had more or less potential to induce rhythmicity in PHYC, we sequentially evolved the spectrum of PHYC, one k-mer at a time, to mimic the robustly rhythmic PHYA/B transcripts more and more with each iteration. We used our DNA sequence–based ML model to classify the evolved transcripts. Firstly, removing k-mers GGTAGA then TTTCTG sites resulted in predictive probabilities for the circadian class of 0.42 and 0.48, respectively (increasing from 0.38). Secondly, adding AAATAA increased the predictive probability of circadian class membership to 0.58. Finally, adding TCTCCG resulted in a circadian class predictive probability of 0.75, placing this transcript’s classification confidently as circadian. Some potential regulatory elements were more important than others, having a larger effect on the classification of the transcript; for example, k-mers in the 5′UTR had a larger effect. Additionally, we show that multiple elements combine to have a greater impact on transcript classification and potentially regulation.

We aligned known Arabidopsis TFBSs to the UTR-based k-mers from PHYA/B that most positively impacted PHYCs circadian reclassification during our evolution to suggest biological reasons why these sites may be having such a large effect. AAATAA aligned to the TFBS of MYB56 that is involved in the regulation of anthocyanin levels in response to circadian rhythms (73) (SI Appendix, Table S5). While TCTCCG-matched TFBS of AT3G58630 that has a protein–protein interaction with LIP1, a gene known to function in the clock regulating light input downstream of photoreceptors such as PHYB (74).

We collated a further 41 known key circadian genes, with published evidence of rhythmic expression from the literature and compared the classification accuracy of their associated primary transcripts between MetaCycle, our transcriptomic ML model, and our DNA sequence ML model (SI Appendix, Table S10). MetaCycle shows an overall accuracy of 80.49%, classifying the 41 transcripts as circadian, compared to 95.12% with the ML transcriptomic model (SI Appendix, Table S11). Approximately 10 of the 41 genes (SI Appendix, Table S10) were not used to train either of our ML models and were unseen datapoints, mainly from MetaCycle not assigning a highly confident classification to their transcripts (q < 0.01) because of low-amplitude expression profiles. These are problematic transcripts for classification and measure the worst-case scenario for predictions. Using 12 transcriptomic timepoints, our ML model was more accurate at correctly classifying these transcripts as circadian, despite their problematic, low-amplitude rhythms (80% accuracy versus 20% for MetaCycle). This suggests that our model can generalize well to unseen transcripts. Interestingly, our model that used DNA sequence achieved a higher accuracy of 90% on the unseen datapoints, which was close to its recorded accuracy on all 41 genes (92.68%), sidestepping the problems associated with low amplitudes using genetic sequence features.

Predictions Using DNA Sequence Generalize to Other Arabidopsis Ecotypes.

Our ML model (using DNA sequence) can accurately make predictions on unseen datapoints. We assessed this in both our initial testing (with held out test data; SI Appendix, Glossary) and in our case study analysis of known circadian genes. We next assess how well our model performs on a different source to that used for model training (Col-0). We selected the Arabidopsis ecotype Ws-2 primarily for this test but also to highlight the suitability of our approach for a divergent species like wheat (details in SI Appendix, Note S5 and Fig. S9). For Ws-2, we firstly generated k-mer spectra for related transcripts and used the two-timepoint, transcriptomic dataset generated by ref. 10 to label Ws-2 transcripts circadian/noncircadian to gauge accuracy (SI Appendix, Table S1, Note S6, and Fig. S10). From this analysis, 71.4% of Ws-2 DNA sequence–based classifications matched their labels derived from (10) transcriptomic data. This is ∼5% lower than the accuracy given by the DNA sequence–based model using Col-0 (mean F1 score of 0.766 on cross-validation) (SI Appendix, Note S7). The 5% decrease may be impacted by imperfect Ws-2 labeling; our Ws-2 labeling strategy using two transcriptomic timepoints was >90% accurate for Col-0 (SI Appendix, Note S6 and Fig. S10), and from previous work, we expect an additional dropout in performance between species (0.02 for matched timepoints).

We used our DNA sequence–based model to identify transcripts that differentiated in rhythmicity between Arabidopsis ecotypes. Then, we used model explanation to explain which regulatory elements influenced this to validate findings. Such functionality gives tremendous power for downstream gene expression manipulation, even in the absence of transcriptomic information. As an initial proof of concept for this, we ranked the transcripts according to the predictive probability of them being circadian for Col-0 and the corresponding predictive probability of them being noncircadian for Ws-2. We identified 12 transcripts that were classified as circadian for Col-0 but noncircadian for Ws-2 by the DNA sequence–based model (predictive probability >0.8) (SI Appendix, Table S12). Our most confident or top ranked transcript was AT1G58602.1-RECOGNITION OF PERONOSPORA PARASITICA 7 (RPP7) (i.e., most probable circadian transcript in Col-0 [probability 0.999] and most probable noncircadian in Ws-2 [probability 0.991]). RPP genes confer resistance to races of Peronospora parasitica in an ecotype-specific manner. Functional RPP7 is thought to mediate resistance to infection by the Peronospora isolate Hiks1. Work by ref. 75 found that while Col-0 has a functional RPP7 and is resistant to Hiks1, Ws-2 is susceptible to attack by this pathogen. This coincides with our DNA sequence predictions suggesting that the circadian regulation of RPP7 is important for defense functionality. This conclusion is supported in the experimental, transcriptomic data, in which RPP7 in Ws-2 shows consistent low expression but in Col-0 is expressed rhythmically at higher levels (SI Appendix, Fig. S11A) (75). RPP7 has been linked to circadian regulation: firstly because resistance (R)-genes in the RPP family were reported to be under CCA1 control (76) and secondly via RPP7’s required interactor EDM2 that is involved in the promotion of floral transition by regulating the floral repressor FLC (77).

Previous evidence supports our observed differentiation in the rhythmicity of RPP7 between Col-0 and Ws-2. We next use model explanation to understand which elements differ between Col-0 and Ws-2; in this example, in Ws-2, this could represent the elements to change to render it resistant to Hiks2. As such, for each k-mer, we compared the SHAP impact values from the DNA sequence–based model between the Col-0 and Ws-2 homologs of AT1G58602.1 (RPP7). We ranked the k-mers in ascending order, as the difference in SHAP impact values between the homologs increased, to highlight the regulatory elements that were most impactful in guiding the differential, circadian/noncircadian predictions (SI Appendix, Fig. S11). The top five ranked k-mers closely linked either to the circadian clock or to disease resistance mechanisms or both (SI Appendix, Note S8). We then sequentially evolved the k-mer spectrum for AT1G58602.1 in Ws-2, a k-mer, at a time to match Col-0 more and more with each iteration. Each iterative, evolved transcript was classified using the DNA sequence–based model, in which we observed that the predictive probability of the circadian class for each evolved transcript quickly increased (SI Appendix, Fig. S11B). The adaptation of 26 Ws-2 k-mers to match Col-0 changed the prediction for Ws-2 from noncircadian to circadian, and the adaptation of 124 Ws-2 k-mers was needed to reach the maximum predictive probability of 0.999. The predictive probability of the circadian class for Ws-2 was highly positively correlated (0.676), with the difference in SHAP values between the Col-0 and Ws-2 k-mers (SI Appendix, Fig. S11C). Our analysis shows that some regulatory elements have a larger effect on transcript classification than others and that this effect is quantifiable using model explanation. We show the potential for large combinations of regulatory elements to work together, potentially each contributing a small amount, to result in a large overall impact on gene classification and potentially regulation (e.g., the 26 k-mers that we changed here to convert Ws-2 to be classified as circadian).

AT1G58602.1 showed no expression in Ws-2 across the two transcriptomic timepoints (SI Appendix, Fig. S11A). To highlight that our DNA sequence–based model was classifying circadian Col-0/noncircadian Ws-2 transcripts, not simply expressed/nonexpressed, we investigated other transcripts from SI Appendix, Table S12. We identified four additional transcripts in our ranked top 10, in which we observe expression of both Ws-2 and Col-0 consistently across the two transcriptomic timepoints (SI Appendix, Fig. S12). Here, the expression profile of Ws-2 is still seen to be largely flat versus potentially cyclical profiles of Col-0.

Identifying Transcriptional Biomarkers that Predict Internal Circadian Time.

Here, we use ML to determine the circadian time of sampling (i.e., predicting the phase of the endogenous circadian clock) using a set of transcriptional biomarkers from any single, transcriptomic timepoint. Previous studies have developed such models for human and mammalian transcriptome datasets (41 ⇓–43, 78, 79). We developed a method that we applied to plant data that innovatively uses model interpretation to identify Arabidopsis biomarker transcripts to guide predictions. This incorporates biomarker selection from across circadian phases to increase accuracy and robustness.

To train our model, we used the transcripts per million–normalized, circadian dataset described earlier (8) and the two further transcriptomic datasets (9, 10) for validation and testing (see Materials and Methods and SI Appendix, Glossary). Firstly, we aggregated a selection of metrics to rank and select transcript subsets from ref. 8 according to their confidence of rhythmicity for model training (see Materials and Methods). SI Appendix, Table S13 highlights the mean absolute errors (MAEs) of the predictions of circadian time without hyperparameter optimization (SI Appendix, Glossary) on the three temporal, transcriptomic datasets using different sized subsets of the highest-ranked, rhythmic genes. The lowest MAE, based on the (10) test dataset, was 104 min and was observed with a selected subset of 50 transcripts. Using the confidence of rhythmicity for transcript prioritization, we noted that the representation of our subsets of transcripts across the eight coexpression modules, generated by the WGCNA gene coexpression network analysis, was not uniform (SI Appendix, Fig. S13A and Glossary). This reflects an uneven representation across the phases of rhythmic expression. Therefore, we prioritized the selection of transcripts using model interpretation in the form of feature selection to make the frequency distribution across the modules more uniform (see Materials and Methods and SI Appendix, Glossary). Optimizing performance based on the validation dataset, our best performing model overall used a final subset of 15 transcripts (SI Appendix, Table S14) and had an MAE of 21 min on the training data, 56 min on the (9) validation data, and 46 min on the test data from ref. 10. SI Appendix, Fig. S13 B and C highlight that after feature selection there was a decrease in the generalization error on average across the (10) test dataset, with the improvements in MAE decreasing as the number of genes increased. This supports the theory that features containing different, temporal patterns of varying strengths outperform features containing strong but highly correlated patterns.

The performance of our best model (15 transcripts with an MAE of 46 min on the test data) is in line with the ∼1-h test error reported by ref. 78 using their state-of-the-art method ZeitZeiger. As such, we applied ZeitZeiger to our datasets (8 ⇓–10) to compare directly with our model. To reflect our previous approach, the dataset (8) was used to fit ZeitZeiger, with predictions then being generated on the validation (9) and testing (10) datasets to compare with the predictions generated by our method. Our approach significantly outperformed ZeitZeiger on the test dataset (MAE of 46 compared to 143 min; SI Appendix, Fig. S14), demonstrating our efficacy at generating highly accurate predictions for circadian time. We also noted a large disparity in training, validation, and test errors by ZeitZeiger (MAE of 6 min on training, 119 on validation, and 143 on test) that suggests overfitting (SI Appendix, Glossary). We hypothesized that our selection of biomarker transcripts, to ensure even representation across the phases of rhythmic expression, would yield a more robust or generalizable mapping from expression data to internal, circadian time (i.e., less overfitting); this analysis supports this hypothesis.

The 15 transcripts in our final subset act as a small subgroup of biomarker transcripts that are sufficient to predict the circadian time (SI Appendix, Table S14). Interestingly, the 15 transcripts did not include any core clock genes. This analysis was conducted using the ecotype Col-0. However, using the Ws-2 data (10), an MAE on this ecotype of only 53 min was observed (5 min lower than for Col-0 on which the model was trained). Generally, we observed no relationship between circadian time and prediction error, except for in the training dataset, in which errors at the 20-h timepoint were significantly larger than the other times (SI Appendix, Fig. S13D). However, variation in error across the timepoints typically stayed under 90 min, allowing the sufficient resolution of circadian time, given that typical sampling strategies are between 2 to 4 h.

Adblock test (Why?)

"machine" - Google News
August 06, 2021 at 08:06AM
https://ift.tt/3ChBtxL

Interpreting machine learning models to investigate circadian regulation and facilitate exploration of clock function - pnas.org
"machine" - Google News
https://ift.tt/2VUJ7uS
https://ift.tt/2SvsFPt

Bagikan Berita Ini

Search