Starting an internship at University Medical Center Groningen

The what, the why, and the how.

"I have about four weeks to come up with topics for posts to write and write a few drafts", he said. "The first one will be easy", he said. Yeah, right, until it's the day before your self-imposed deadline, and you haven't written anything yet.

Regardless, let me introduce the topic at hand: my internship, or "work placement," as my university likes to call it, at the University Medical Center Groningen. I will develop a web application to annotate large-scale transcriptomic data automatically using machine learning. What? Why? How? Elaboration time!

Large-scale trans-what?

Large-scale transcriptomic data! The research group I'll join for a few months aggregated approximately 280,000 mRNA expression profiles from publicly available sources. I can imagine that doesn't explain much of anything yet, so let's take a step back and talk about something you probably have heard of: Deoxyribonucleic acid, better known by its acronym DNA.

DNA makes up the genetic instructions for how to run the cells in our bodies and is stored in the nucleus (centre) of these cells. A unique collection of enzymes (a particular class of large bio-molecules) can read your DNA and copy specific messages in a slightly different language called Ribonucleic acid or RNA. Many things can be done with RNA, but since this version carries messages, we call it mRNA for messenger RNA. The cell will post-process the mRNA, which will then move to the cell's protein production factory. At this factory, mRNA serves as the blueprint for building a particular protein. The produced protein may then fulfil one of a variety of functions within or outside of the cell. In short, mRNA is the messenger molecule between the instructions stored in DNA and the protein factory.

Proceeding, you can take a snapshot of a cell and measure how many mRNAs of a specific part of DNA you find, which we call the amount of expression of that part of the DNA. An mRNA expression profile, then, is a profile of all of the sections of DNA, with their respective expression levels in mRNA, found in the snapshot we took of the cell. Approximately 280,000 of these expression profiles form the data underpinning this project.

And you annotate those things?

Yes! Or, well, kind of. A small subset of the data has been annotated by manually collecting information from the corresponding publications and attaching it to their respective samples. However, most of the data contains incomplete annotations, which poses a considerable issue when processing this much data. My first task will be to take these fully annotated samples and use them to train a machine-learning algorithm to do the annotations for us.

A straightforward way of annotating samples would be to use guilt by association. With this method, samples would be annotated corresponding to the dominant signal found in the sample, but that is where we run into a significant issue. Most of the samples involved are derived from a biopsy where various signals (e.g., biological pathways, non-cancer tissues, nonbiological effects) are intermingled in the mRNA expression profiles. The annotation made for a given sample using guilt by association could be significantly influenced by a dominating signal, producing a biased annotation. A statistical method of signal processing for separating a multivariate signal called Independent component analysis (ICA) has been utilised by the research group to disentangle these signals into its components, which then allows me to work my machine learning magic on the dataset.

So, how does this involve a web application?

An excellent question, indeed! Once the model is trained and the data has been annotated, the problem is solved, is it not? Why, yes, but also no. Sure, this research group's problem has indeed been solved, but what about other researchers needing their mRNA transcription profiles annotated? That's what the web application will be for. Researchers can upload their data and have the model annotate it before returning the annotated dataset to them. The real beauty is in designing this system in such a way that components are easily interchangeable, and querying the model can be done in an automated fashion. The former suggests the system should be designed such that, for example, the model can easily be interchanged when a better one is trained. The latter could express itself as an API, enabling researchers with programming abilities to use the model in their programs and scripts.

Neat, when can I try it?

Well, my internship lasts until the end of January 2024, so I hope to be finished by then. If you're interested in following along, consider subscribing to the newsletter above to receive all my articles right in your inbox. If you're not into that, consider checking back in a few months to see how it's all coming along. In any case, thank you for making it to the end!