Alex Lovell-Troy
banner
lovelltroy.org
Alex Lovell-Troy
@lovelltroy.org
Open Source HPC. Formerly cloudy and data-y. Before that astronomy.

Twilio was fun, huh?
Pinned
Hello again.

You may know me from that slack we’re in together.

My job is to build an Open Source community around secure management of on-premise HPC.

OpenCHAMI.org
OpenCHAMI
The OpenCHAMI consortium represents some of the largest HPC operators in the world. Our goal is to build a community of sysadmins and operators around modular system management for all scales of syst...
OpenCHAMI.org
There’s still room for a few more at the #ISC25 OpenCHAMI tutorial tomorrow. Come join us and learn why the community is growing so fast! #HPC

app.swapcard.com/event/isc-hi...
Client Challenge
app.swapcard.com
June 12, 2025 at 6:58 AM
Want to move beyond xcat with a provisioner that’s ready for Confidential Computing?

I’ll be at #ISC25 this week talking about OpenCHAMI.

Free and Open Source with a growing community.

openchami.org
June 8, 2025 at 11:00 AM
Reposted by Alex Lovell-Troy
I’ve been working on OpenCHAMI for a couple of years now. This is an exciting step for the community!
OpenCHAMI has officially joined the High Performance Software Foundation (HPSF) 🚀 OpenCHAMI is a cloud-like HPC provisioning & management toolkit for on-premise systems—whether you're managing a few nodes or a full-scale supercomputer. Learn more: hpsf.io/blog/2025/op...
April 10, 2025 at 3:21 PM
I’ve been working on OpenCHAMI for a couple of years now. This is an exciting step for the community!
OpenCHAMI has officially joined the High Performance Software Foundation (HPSF) 🚀 OpenCHAMI is a cloud-like HPC provisioning & management toolkit for on-premise systems—whether you're managing a few nodes or a full-scale supercomputer. Learn more: hpsf.io/blog/2025/op...
April 10, 2025 at 3:21 PM
This is the worst thing I can tell you about Japan.
I spent a week in Japan with wet hands before anyone told me that I needed to carry my own hand towel. Apparently they’ve been pulling this shit on foreigners for centuries.
January 28, 2025 at 12:23 AM
I spent a week in Japan with wet hands before anyone told me that I needed to carry my own hand towel. Apparently they’ve been pulling this shit on foreigners for centuries.
January 27, 2025 at 11:17 PM
Reposted by Alex Lovell-Troy
Inside of you there are two wolves. Inside each wolf there are zero, one, or two wolves. Write a function to rebalance an arbitrary wolftree B such that it has minimal depth. The function should execute in O(logn) time. Show your work.
January 14, 2025 at 3:19 AM
Reposted by Alex Lovell-Troy
🌟 Hello BlueSky! 🌟

We’re Honeycomb, the observability platform for teams who manage software that matters. Send any data to our one-of-a-kind data store, solve problems with all the relevant context, and fix issues before your customers find them.
December 5, 2024 at 9:21 PM
Reposted by Alex Lovell-Troy
Moving from cloud #SRE to #HPC often means recalibrating what metrics matter.

Time to job launch?
Time to completion?
Mean time to job failure?
Time to snapshot recovery?

Cloud makes node loss a non-event. HPC typically doesn’t work that way.
Let’s make #HPC cool again.

I’ll start

Your five nines of uptime can go suck it. HPC workloads need 100% completion 100% of the time no matter how many nodes or network connections fail.

HPC is the OG king of resiliency.
November 30, 2024 at 2:20 PM
Reposted by Alex Lovell-Troy
GW4 Isambard 3 #Supercomputer is officially online🎉🧠 !

Part of a collaboration between the universities of Bath, Bristol, Cardiff and Exeter, alongside partners HPE, NVIDIA and Arm, Isambard 3 will push the boundaries of science.

🔗 https://buff.ly/4g7HtMK

December 4, 2024 at 8:05 AM
I once had to email someone with an important corporate email address.

Last Name: Fuchs
First Initial: E
Inexplicable Extra Letter: X

That’s right fuchsex was his official email address.

I often wonder why the X.
December 3, 2024 at 11:09 PM
“It’s like watching someone unlock a padlock on a wrench so they can use it to drive a nail”

Why?

“The padlock doesn’t fit the hammer.”
December 2, 2024 at 8:38 PM
Every single time I’ve been to Bristol, the weather has been fantastic. Highly recommend.
Bristol looking lovely in the winter sun today. Walking from the University to meet the Brunel Archive to explore collaboration around Isambard-AI. #HPC BriCS
December 2, 2024 at 2:23 PM
Reposted by Alex Lovell-Troy
I spent the Thanksgiving break typing up my notes from #SC24 which I've posted online. 30% more words than my notes from SC23 (sorry!). Feedback is welcome!

https://buff.ly/41fBhho

#HPC
SC'24 recap
The premiere annual conference of the high-performance computing community, SC24, was held in Atlanta last week, and it attracted a reco...
buff.ly
December 2, 2024 at 7:53 AM
Reposted by Alex Lovell-Troy
This is, technically, a sandwich.
December 1, 2024 at 5:51 PM
Lotsa fake Astronomy photos on Bluesky these days. Just remember, if they’re not credited, it’s not credible.

NASA has the original feed and does a good job of curation.

apod.nasa.gov/apod/
Astronomy Picture of the Day
A different astronomy and space science related image is featured each day, along with a brief explanation.
apod.nasa.gov
December 1, 2024 at 5:20 PM
As it turns out, when the Thanksgiving pies don’t last all weekend, you’re allowed to make more pie. Who’s going to stop you?

🥧 Maple Pumpkin
🥧 Bourbon Apple
November 30, 2024 at 9:51 PM
Reposted by Alex Lovell-Troy
I should write a bittorrent client
November 30, 2024 at 7:26 AM
Tail latency has entered the chat!
I'd also recommend looking at these metrics broken down by user/project, and try to make sure your 1% least reliable subset is still doing ok, or at least getting support, since failures are often not evenly distributed.

I really like this post on the topic: rachelbythebay.com/w/2019/07/15...
rachelbythebay.com
November 30, 2024 at 4:58 PM
Reposted by Alex Lovell-Troy
I'd also recommend looking at these metrics broken down by user/project, and try to make sure your 1% least reliable subset is still doing ok, or at least getting support, since failures are often not evenly distributed.

I really like this post on the topic: rachelbythebay.com/w/2019/07/15...
rachelbythebay.com
November 30, 2024 at 4:57 PM
Reposted by Alex Lovell-Troy
I have a half-written blog post about this that I should finish sometime.

I haven’t seen an SLO framework broadly adopted in HPC, but some sites adopt metrics like:

- % nodes up
- Scheduler RPC latency
- FS latency and BW
- Performance on standard benchmarks, either after maintenance or weekly
November 30, 2024 at 3:43 PM
Moving from cloud #SRE to #HPC often means recalibrating what metrics matter.

Time to job launch?
Time to completion?
Mean time to job failure?
Time to snapshot recovery?

Cloud makes node loss a non-event. HPC typically doesn’t work that way.
Let’s make #HPC cool again.

I’ll start

Your five nines of uptime can go suck it. HPC workloads need 100% completion 100% of the time no matter how many nodes or network connections fail.

HPC is the OG king of resiliency.
November 30, 2024 at 2:20 PM
Github's hosted attestation tooling is really amazing to work with.

github.com/actions/atte...
November 30, 2024 at 1:48 PM
Another fun experiment following #SC24 has been attested custom builds of iPXE binaries. Thanks to @felicitas.pojtinger.com for the original repo.

Working on how to make it easier for others to use.

github.com/OpenCHAMI/ip...
Attestations · OpenCHAMI/ipxe-binaries
Weekly builds of https://ipxe.org/, with an embedded script that chainloads /config.ipxe. - Attestations · OpenCHAMI/ipxe-binaries
github.com
November 30, 2024 at 1:46 PM
Inspired by conversations at #SC24, I've started experimenting with using bittorent for loading the #HPC kernel and initrd on large clusters.

Does anyone else have prior art for this that I couldn't find? Papers? Alternatives?

github.com/OpenCHAMI/ar...
GitHub - OpenCHAMI/aria2-initrd: experimental repository for using aria2 to download the kernel/initrd over bittorrent
experimental repository for using aria2 to download the kernel/initrd over bittorrent - OpenCHAMI/aria2-initrd
github.com
November 30, 2024 at 1:23 PM