Chris Miller
chrismiller.science
Chris Miller
@chrismiller.science
I study cancer at Washington University in St Louis. Cancer Genomics, Bioinformatics, Data Viz, Tumor Evolution, AML, Immunotherapy, Irreverent humor 🧬 🖥️ mostly @chrisamiller on other platforms
Apologies - I misread that and I've deleted the post. FWIW, if we figure 4-5 characters per word, it becomes more like 20-30 years. In any case, I'm all for new ways to express just how big the genome is!
November 2, 2025 at 1:15 AM
Well, this joke aged poorly 😆
bsky.app/profile/chri...

Obviously not ideal in many situations, and thanks for putting an explainer out there, Claus!
When you want to do reproducible analysis in R, some packages require you to set a RNG seed. I'm not sure I trust anyone who doesn't immediately run `set.seed(42)`
October 22, 2025 at 2:06 PM
Reposted by Chris Miller
he's going to come after you either way, so you might as well act with integrity
October 8, 2025 at 11:47 PM
I can't ever see the MTHFR gene symbol without doing a double take
October 7, 2025 at 7:10 PM
The lessons here:
1) Many gene names are stupid.
2) Edge cases may be rare, but they often matter. (TP53 is a key cancer gene that wouldn't be accessible without some special accommodations here).
3) As always, check your assumptions!
(fin)
October 6, 2025 at 6:54 PM
For our little internal app, this probably won't matter much, and I will either set the number of records to 200 (because we generate almost no traffic) or might code up something that dynamically decides how many queries to return, based on which genes are in the input data. (8/n)
October 6, 2025 at 6:54 PM
For those who are interested, the plot showing cumulative percentage of human HUGO gene names (from ensembl protein-coding genes) covered by a set number of records looks like this. So 8 results covers 99% of genes, 34 results covers 99.9% of genes, and it takes 199 to cover everything. (7/n)
October 6, 2025 at 6:54 PM
So in order to guarantee that we'll get "AR" in the list, the value should be 200 records, which seems excessive. My instinctual guess of 30 wasn't bad, and covers 99.89% of gene names, but that's not all of them! (6/n)
a group of pokemon standing next to each other with gotta catch 'em all written on the bottom
ALT: a group of pokemon standing next to each other with gotta catch 'em all written on the bottom
media.tenor.com
October 6, 2025 at 6:54 PM
It introduces a new question, though - this failed on TP53 with 10 results, so how many results need to be returned to handle all genes correctly? A few seconds of bash/grep later, I get the following list of 21 genes that will still fail. (5/n)
October 6, 2025 at 6:54 PM
After some digging, it turns out that mygene.info has a default max of 10 records returned for each query, and the first 10 hits include genes like "TP53TGS", "TP53TG3F", "TP53RK-DT", but not "TP53" itself. Adding "&size=30" to the query allows it to return 30 hits, which solved this problem (4/n)
October 6, 2025 at 6:54 PM
But when I manually tried the query string - something like mygene.info/v3/query?spe... - TP53 didn't appear in the returned json - I know that's not right! (3/n)
October 6, 2025 at 6:54 PM