rajith77.bsky.social
@rajith77.bsky.social
🎯 RED metrics aren’t just another dashboard—they’re the foundation for SLO-driven monitoring and proactive issue detection.

Start small. Instrument a few key endpoints.
Build SLOs. Alert on what matters.

Monitor what your users feel—not just what your servers see.
August 15, 2025 at 5:29 PM
💡 Bonus: If you're using OpenTelemetry, you're already halfway there.

✅ Enable span metrics

✅ Export them to Prometheus or your observability backend

✅ Start building dashboards that track what users actually care about
August 15, 2025 at 5:29 PM
Let’s take an example:

Say you run an order service, and you want

📌 <5% error rate

📌 99% of orders complete <50ms

With RED metrics, you can:

▪️Detect when latency spikes or errors increase

▪️Set meaningful alerts tied to user experience

▪️Debug faster with clear SLOs and performance signals
August 15, 2025 at 5:29 PM
That’s where RED Metrics come in:

▪️Rate – how many requests are happening

▪️Errors – how many of them are failing

▪️Duration – how long they take (latency)
August 15, 2025 at 5:29 PM
We implemented Log Monitoring using the same concept. Since the logs are analyzed locally it saves a ton of money but also able to detect issues faster.
www.youtube.com/watch?v=R2tP...
How to catch Recurring Issues faster with In-cluster Log Analysis
YouTube video by Randoli
www.youtube.com
August 15, 2025 at 4:26 PM
Our solution is based on Federated control planes and it's starting resonate with customers.

Apart from the obvious cost savings there are other advantages.
A smart agent is able to process the data locally and identify issues faster.
August 15, 2025 at 4:26 PM
Going through this exercise can help you understand which telemetry data you really need to run your operations and so you can remove the rest. This will reduce your cost, improve customer experience and reduce the burden on the SRE teams.
August 12, 2025 at 2:56 AM
In a nutshell high quality telemetry helps you proactively identify & manage your customers experience, while low value telemetry adds noise & increase your cost.
August 12, 2025 at 2:56 AM
💡 6. BONUS - Think about keeping telemetry data local & retrieve on demand - i.e separating control plane from data plane.
August 12, 2025 at 2:56 AM
📌 5. Think about an information hierarchy when building dashboards and monitoring. You only need the details when you need to drill down.
August 12, 2025 at 2:56 AM
📌 4. Aggressively filter, transform, aggregate to create high quality telemetry. Delete low value metrics.
August 12, 2025 at 2:56 AM
📌 3. High cardinality metrics can provide high visibility into your customer experience/SLO. High cardinality metrics are expensive - so choose wisely.
August 12, 2025 at 2:56 AM
📌 2. Figure out how you can map (aggregate, enrich, correlate) your service level telemetry to measure & monitor those SLOs.
August 12, 2025 at 2:56 AM
💡The platform did its job, as it made it super easy to build and deploy the applications to the point they were fully insulated from having to deal at the kubernetes level.

In general this is not a downside. For us it was as our team required deep kubernetes knowledge to build our products.
July 23, 2025 at 12:15 PM
📌After one year, there was a clear difference in the kubernetes knowledge between the interns we hired before and the interns who just got hired when the platform was in place.
July 23, 2025 at 12:15 PM
We eat our own dog food. Infact some of the features and visualisations are motivated by our own observability needs.
www.randoli.io/product/kube...
randoli.io
July 11, 2025 at 10:04 PM