Abstract
Terminally labeled DNA oligonucleotides have wide applications in modern biology and biotechnological applications. It has been observed that the fluorescent intensity of light released from these fluorescent labels is heavily influenced by the terminal sequence of nucleotides. Recent studies have assayed and published the raw fluorescent values of Cy3 and Cy5 as a function of the most adjacent 5 nucleotides resulting in 1024 data points. While experimentally tractable, an increase in the sequence space will vastly increase the experimental and time cost. Machine Learning is well suited to addressing the issue of experimental tractability however there is a wide design space in the choice of algorithms. In this work we use classic machine learning models such as Support Vector Machine, Multilayer Perceptrons and Random Forests to both predict the raw intensity value and classify the intensity magnitude of the fluorophore using the sequence as input. We demonstrate that the performance of these models is heavily dependent on the numerical transformation of the sequence and that Random Forest consistently outperforms all other models in both regression and classification tasks irrespective of the sequence transformation.
Supplementary materials
Title
Supplementary Figures and Data
Description
Contains all figures, tables and plots that were not included in the main body of the text. Such materials included neural net architectures, confusion matrices, validation error data, etc
Actions



![Author ORCID: We display the ORCID iD icon alongside authors names on our website to acknowledge that the ORCiD has been authenticated when entered by the user. To view the users ORCiD record click the icon. [opens in a new tab]](https://www.cambridge.org/engage/assets/public/coe/logo/orcid.png)