Training mRNA Language Models Across 25 Species for $165

seamossfet · 2026-04-04T18:49:42 1775328582

The problem with models like this is they're built on very little actual training data we can trace back to verifiable protein data. The protein data back, and other sources of training data for stuff like this, has a lot of broken structures in them and "creative liberties" taken to infer a structure from instrument data. It's a very complex process that leaves a lot for interpretation.

On top of that, we don't have a clear understanding on how certain positions (conformations) of a structure affect underlying biological mechanisms.

Yes, these models can predict surprisingly accurate structures and sequences. Do we know if these outputs are biologically useful? Not quite.

This technology is amazing, don't get me wrong, but to the average person they might see this and wonder why we can't go full futurism and solve every pathology with models like these.

We've come a long way, but there's still a very very long way to go.

stardust2 · 2026-04-04T21:17:22 1775337442

How do we get more verifiable protein data? So even if we had better data, we don't yet understand how the structure impacts the biology?

maziyar · 2026-04-01T20:38:33 1775075913

full article: https://huggingface.co/blog/OpenMed/training-mrna-models-25-...

pfisherman · 2026-04-04T20:11:57 1775333517

Nice work! Here is an article you may find helpful if you have not already come across it.[0]. You may also want to consider benchmarking against some non ML methods.[1]

0. https://pubmed.ncbi.nlm.nih.gov/35318324/

1. https://www.nature.com/articles/s41586-023-06127-z

xyz100 · 2026-04-04T15:18:34 1775315914

What makes this dataset or problem worth solving compared to other health datasets? Would the results on this task be broadly useful to health?

CyberDildonics · 2026-04-04T16:44:57 1775321097

What other "datasets" are you talking about? How do you "solve a dataset" ?

xyz100 · 2026-04-05T05:39:49 1775367589

You solve a dataset when you learn what there is to learn about the phenomenon of interest. The limit of such phenomenon is “cure all disease”, and clearly this is not solving that.

CyberDildonics · 2026-04-05T13:43:56 1775396636

What are you talking about? "the phenomenon of interest"? There is nothing you wrote in either comment that makes sense.

What is a "dataset" that has been "solved" and what did the program do that 'solved' it?

xyz100 · 2026-04-05T22:21:09 1775427669

MNIST (the number classification task) has been “solved” a billion times and it is hard to imagine any subsequent advances there as scores using a variety of methods have hit the saturation point of accuracy. Any further improvements are likely overfitting to noise. Therefore, we know that it is easy to detect handwritten numbers. However, we may not know how to detect other things as well, like reading an MRI. Those datasets/tasks are clearly different and require different techniques. Training an LLM is likewise different.

CyberDildonics · 2026-04-05T23:16:06 1775430966

has been “solved” a billion times

If it was really solved, wouldn't it just need to happen once?

You think classifying handwriting of 10 numbers is the same as this that took 55 hours of GPU time for someone to go through?

I have no idea what point you're trying to make and I can't tell if you do either. You were talking about "solving" other "health datasets" but you can't even come up with one or what that means.

xyz100 · 2026-04-06T18:43:11 1775500991

If you want to be literal with language, then do you ever really “solve” anything? Even tying your shoes is not solved. One day you may tie them better, but for practical purposes we can say it is solved.

Likewise, you can spend 55 hours of GPU time to produce very different things. Can those 55 hours cure cancer? Definitely not. Can it pick up correlations with a small subset of proteins that are perhaps not representative of practical problems? Probably. Can it learn a pattern to tie your shoes, given all your life experiences tying them? Sure.

I asked the question to determine what is the impact of the task and dataset. Curing cancer is huge, tying shoes is not. What are the strengths and limitations?

CyberDildonics · 2026-04-06T19:35:46 1775504146

If you want to be literal with language, then do you ever really “solve” anything?

You are the one who said it and you can't even explain what you meant, you just get mad that anyone would ask.

xyz100 · 2026-04-06T20:03:37 1775505817

Since I am hitting the reply depth: You “solve” a dataset or task when you translate some model into actual real world problems by creating a model that actually “works” (not just high accuracy). What is otherwise the point of training the model other than writing blog posts? Second to that, you can train a model that performs well on the dataset but is less useful in the real world.

This is a health dataset, there are many inputs and outputs to health (e.g., cell level, protein level, tumors, organs, etc.). In this case, it is mRNA focused, which is a broad category that translates to potentially immune responses like vaccines (exactly what kind of therapy, I’m not sure other than “25 species”). Once the model is trained, you can use it to solve real problems, perhaps to develop a therapy that makes its way to clinical trials and eventually actually treats some disease. The model by itself is useless without the ability to have that impact.

So for other examples, take any disease (e.g., Covid19), create a dataset to mirror that problem using some technique (e.g., Covid19 mRNA prediction of some sort), and solve it to create a treatment (e.g., get a safe and effective vaccine). Obviously, you can say the vaccine can be improved so it is not “solved”, but most people would be quite happy with a “almost cure for cancer” even if it wasn’t literally optimal (we don’t even know if a cure for cancer is possible).

My suggestion and question to the author is to outline what is the implications of the work rather than focusing on accuracy statistics that are meaningless without such context.

basyt · 2026-04-06T03:25:35 1775445935

yeah lol no shit. lets not get bothered by reactionaries...

nradclif · 2026-04-05T03:57:46 1775361466

"Complete results, architectural decisions, and runnable code below."

This is a weird post, there doesn't seem to be any "below" here. Another comment linked the article: https://huggingface.co/blog/OpenMed/training-mrna-models-25-...

justinclift · 2026-04-05T17:00:30 1775408430

Yeah. Things like "Complete results, architectural decisions, and runnable code below." is literally how AI outputs stuff, so I'd expect the post was AI written too. :(

rubicon33 · 2026-04-04T16:11:19 1775319079

Can someone explain what one might use this model for? As a developer with a casual interest in biology it would be fun to play with but honestly not sure what I would do

colechristensen · 2026-04-04T16:25:41 1775319941

You can get your feet wet with genetic engineering for surprisingly little money.

This guy shows a lot of how it's done: https://www.youtube.com/@thethoughtemporium

Basically you can design/edit/inject custom genes into things and see real results spending on the scale of $100-$1000.

com2kid · 2026-04-05T00:41:44 1775349704

We actually did this in my highschool genetics class back in 1999! We made bacteria change color by splicing in a gene. Awesome stuff.

The (public!) school had a grant from one of Seattle's biotech boom companies.

someuser54541 · 2026-04-04T16:40:02 1775320802

Is there something like this in text/readable format?

_zoltan_ · 2026-04-04T19:12:11 1775329931

My main concern is using fungi. If it ends up in my lungs I'm most likely screwed, right?

nurettin · 2026-04-04T20:02:00 1775332920

Yes, but most students produce their best work while infected.

colechristensen · 2026-04-04T20:20:27 1775334027

This is the classic meme https://www.reddit.com/r/labrats/comments/mmv2ig/lab_strains...

Lab strains of things tend to be extremely sensitive and not human adapted. You shouldn't study and modify human-infecting organisms in your basement anyway. While you shouldn't ignore protective equipment and proper procedure... paranoia about infecting yourself with a lab leak isn't warranted.

_zoltan_ · 2026-04-05T19:04:24 1775415864

I'd love to experiment with this stuff, just literally have no idea how it would be safe to start.

jazzpush2 · 2026-04-05T06:36:26 1775370986

A Codon-based model is cool. I know NVIDIA is building quite a large one.

At GTC they showed an SAE they built on a smaller version of it, allowing you to see what their model learned: https://research.nvidia.com/labs/dbr/blog/sae/

dhruv3006 · 2026-04-05T03:39:42 1775360382

Interesting work - Looks like AI for science is having it's day right now.

khalic · 2026-04-04T15:28:22 1775316502

> In Progress: CodonJEPA

JEPA is going to break the whole industry :D

digdugdirk · 2026-04-04T15:50:21 1775317821

Can you explain this? I haven't heard of JEPA, and from a quick search it seems to be vision/robotics based?

khalic · 2026-04-04T16:42:16 1775320936

It’s a self supervised learning architecture, and it’s pretty much universal. The loss function runs on embeddings, and some other smart architectural choices allover. Worth diving into for a few hours, Yann LeCun gives some interesting talks about it

lukeinator42 · 2026-04-04T15:51:31 1775317891

https://openreview.net/pdf?id=BZ5a1r-kVsf

colingauvin · 2026-04-04T20:07:59 1775333279

HN's blindspots never cease to amaze me.

I am a structural biologist working in pharmaceutical design and this type of thing could be wildly useful (if it works).

justinclift · 2026-04-05T17:07:15 1775408835

Blind spot?

simianwords · 2026-04-04T15:28:27 1775316507

What makes these Domain specific models work when we don’t have good domain models for health care, chemistry, economics and so on

colechristensen · 2026-04-04T16:26:20 1775319980

>we don’t have good domain models for health care, chemistry, economics and so on

Who says we don't?

simianwords · 2026-04-04T16:38:31 1775320711

Examples please?

colechristensen · 2026-04-04T17:13:43 1775322823

No, it's really simple to search for domain specific models being used "in production" all over the place

simianwords · 2026-04-04T17:16:42 1775323002

I didn’t find a single one that outperforms a general model.

colechristensen · 2026-04-04T17:56:53 1775325413

Ok, alphafold.

simianwords · 2026-04-04T18:01:21 1775325681

It’s not a large language model

yieldcrv · 2026-04-04T16:12:41 1775319161

Distributing the load on this will probably be infinitely more useful than “folding at home”

HocusLocus · 2026-04-04T15:15:53 1775315753

gray goo of the future

skyskys · 2026-04-04T18:14:37 1775326477

hmmmm seems like some fake hype.