Twilio was fun, huh?
You may know me from that slack we’re in together.
My job is to build an Open Source community around secure management of on-premise HPC.
OpenCHAMI.org
app.swapcard.com/event/isc-hi...
app.swapcard.com/event/isc-hi...
I’ll be at #ISC25 this week talking about OpenCHAMI.
Free and Open Source with a growing community.
openchami.org
I’ll be at #ISC25 this week talking about OpenCHAMI.
Free and Open Source with a growing community.
openchami.org
We’re Honeycomb, the observability platform for teams who manage software that matters. Send any data to our one-of-a-kind data store, solve problems with all the relevant context, and fix issues before your customers find them.
We’re Honeycomb, the observability platform for teams who manage software that matters. Send any data to our one-of-a-kind data store, solve problems with all the relevant context, and fix issues before your customers find them.
Time to job launch?
Time to completion?
Mean time to job failure?
Time to snapshot recovery?
Cloud makes node loss a non-event. HPC typically doesn’t work that way.
I’ll start
Your five nines of uptime can go suck it. HPC workloads need 100% completion 100% of the time no matter how many nodes or network connections fail.
HPC is the OG king of resiliency.
Part of a collaboration between the universities of Bath, Bristol, Cardiff and Exeter, alongside partners HPE, NVIDIA and Arm, Isambard 3 will push the boundaries of science.
🔗 https://buff.ly/4g7HtMK
Part of a collaboration between the universities of Bath, Bristol, Cardiff and Exeter, alongside partners HPE, NVIDIA and Arm, Isambard 3 will push the boundaries of science.
🔗 https://buff.ly/4g7HtMK
Last Name: Fuchs
First Initial: E
Inexplicable Extra Letter: X
That’s right fuchsex was his official email address.
I often wonder why the X.
Last Name: Fuchs
First Initial: E
Inexplicable Extra Letter: X
That’s right fuchsex was his official email address.
I often wonder why the X.
Why?
“The padlock doesn’t fit the hammer.”
Why?
“The padlock doesn’t fit the hammer.”
https://buff.ly/41fBhho
#HPC
https://buff.ly/41fBhho
#HPC
NASA has the original feed and does a good job of curation.
apod.nasa.gov/apod/
NASA has the original feed and does a good job of curation.
apod.nasa.gov/apod/
🥧 Maple Pumpkin
🥧 Bourbon Apple
🥧 Maple Pumpkin
🥧 Bourbon Apple
I really like this post on the topic: rachelbythebay.com/w/2019/07/15...
I really like this post on the topic: rachelbythebay.com/w/2019/07/15...
I really like this post on the topic: rachelbythebay.com/w/2019/07/15...
I haven’t seen an SLO framework broadly adopted in HPC, but some sites adopt metrics like:
- % nodes up
- Scheduler RPC latency
- FS latency and BW
- Performance on standard benchmarks, either after maintenance or weekly
I haven’t seen an SLO framework broadly adopted in HPC, but some sites adopt metrics like:
- % nodes up
- Scheduler RPC latency
- FS latency and BW
- Performance on standard benchmarks, either after maintenance or weekly
Time to job launch?
Time to completion?
Mean time to job failure?
Time to snapshot recovery?
Cloud makes node loss a non-event. HPC typically doesn’t work that way.
I’ll start
Your five nines of uptime can go suck it. HPC workloads need 100% completion 100% of the time no matter how many nodes or network connections fail.
HPC is the OG king of resiliency.
Working on how to make it easier for others to use.
github.com/OpenCHAMI/ip...
Working on how to make it easier for others to use.
github.com/OpenCHAMI/ip...
Does anyone else have prior art for this that I couldn't find? Papers? Alternatives?
github.com/OpenCHAMI/ar...
Does anyone else have prior art for this that I couldn't find? Papers? Alternatives?
github.com/OpenCHAMI/ar...