Nawaf Alageel
nawafalageel.bsky.social
Nawaf Alageel
@nawafalageel.bsky.social
Trying to teach computers how to see through math.
Yes, in some cases CPU and GPU users can be different. However, it wasn't the issue. Oversubscribing CPU memory was a concern initially, but again we never encountered such an issue.
July 19, 2025 at 3:35 PM
Yes, that was the main setup. But soon the question came up: if Project X (container/person) isn’t using GPUs this week, shouldn’t we allocate them to other projects? From then on, allocation/monitoring moved from node level to container level, tied directly to a person or project.
July 16, 2025 at 5:08 PM
Yep yup, you are right! The team used to run "--gpus device=0,2", and we would flag each container with gpu ID in the name. But as the workload grew, they started to do "--gpus all". That’s when things started getting messy.
July 16, 2025 at 11:16 AM
You can find the tool that I built here: github.com/nawafalageel...

I hope it can helps you squeeze more out of your expensive GPUs.

#Nvidia #GPU #Docker #MachineLearning #MLOps #CloudDev #DataScience #OpenSource #LLM #AI #GenAI #AIagents
github.com
July 15, 2025 at 11:39 AM
Now, instead of guessing or jumping through hoops to find the answer, we can see the tool can tell us:
"Container X is occupied 12GB on GPU #1 with Y memory utilization"

We went from blindfolded resource to actual insight.

And our question is finally answered 🥳🎉
July 15, 2025 at 11:39 AM
- Nvidia tools (e.g., nvidia-smi) show processes, but not container names.
- Docker tools (e.g, docker status) show CPU and memory, but no GPU data.

We would still be blindfolded. And our question is not answered yet!
July 15, 2025 at 11:39 AM
When it comes to monitoring GPU usage in containerized environment, Nvidia and Docker both of them provide good out-of-the-box tools, but they aren't compatible.

None of them can answer my simple question:
"Which container uses which GPU?"
July 15, 2025 at 11:39 AM