Predicting Sequence Dependent Fluorescence with Classic Machine Learning Models

23 December 2025, Version 1
This content is an early or alternative research output and has not been peer-reviewed by Cambridge University Press at the time of posting.

Abstract

Terminally labeled DNA oligonucleotides have wide applications in modern biology and biotechnological applications. It has been observed that the fluorescent intensity of light released from these fluorescent labels is heavily influenced by the terminal sequence of nucleotides. Recent studies have assayed and published the raw fluorescent values of Cy3 and Cy5 as a function of the most adjacent 5 nucleotides resulting in 1024 data points. While experimentally tractable, an increase in the sequence space will vastly increase the experimental and time cost. Machine Learning is well suited to addressing the issue of experimental tractability however there is a wide design space in the choice of algorithms. In this work we use classic machine learning models such as Support Vector Machine, Multilayer Perceptrons and Random Forests to both predict the raw intensity value and classify the intensity magnitude of the fluorophore using the sequence as input. We demonstrate that the performance of these models is heavily dependent on the numerical transformation of the sequence and that Random Forest consistently outperforms all other models in both regression and classification tasks irrespective of the sequence transformation.

Keywords

Fluorescence
Biotechnology
Machine Learning

Supplementary materials

Title
Description
Actions
Title
Supplementary Figures and Data
Description
Contains all figures, tables and plots that were not included in the main body of the text. Such materials included neural net architectures, confusion matrices, validation error data, etc
Actions

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting and Discussion Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.