The Platform
case Study
Creating Operational Visibility in a Rapidly Scaling IoT Platform
-
As the company grew from a startup into one of Europe’s largest residential Virtual Power Plants, the complexity of the platform grew faster than its operational visibility.
Hundreds of thousands of connected devices interacted with cloud infrastructure, mobile applications, and energy market services. The organisation was expanding rapidly, engineering teams were shipping new capabilities, and customer adoption continued to accelerate.
Yet understanding what was happening across the platform remained surprisingly difficult.
Customer complaints were often treated as platform-wide incidents. Investigations relied heavily on Slack conversations, fragmented logs, and individual expertise. Different teams had visibility into different parts of the system, but no shared understanding of how the platform behaved as a whole.
The challenge was not a lack of data. The challenge was a lack of visibility.
Without a clear view of system behaviour, diagnosing issues was slow, ownership was often unclear, and operational decisions relied more on assumptions than evidence. As the platform prepared for significant growth, this way of operating was becoming unsustainable.
-
The solution was not a dashboard. It was the creation of a complete operational picture.
A platform-wide observability capability was established, bringing together backend services, frontend applications, infrastructure components, and IoT devices into a common monitoring framework. Logs, traces, telemetry, and operational events became accessible through a shared view of platform behaviour.
Alongside technical observability, a second layer of visibility was introduced through operational and product analytics.
Observability
Backend logs and distributed traces
Frontend monitoring and error tracking
Infrastructure monitoring
IoT device telemetry and health monitoring
Operational investigation tooling
Product & Operational Analytics
Product usage and adoption
Feature performance monitoring
Installation quality metrics
Device health indicators
Configuration drift analysis
Customer behaviour insights
Leadership performance reporting
For the first time, product teams, engineering teams, operations, customer success, and leadership were working from the same reality.
For the first time, platform behaviour became visible end-to-end.
-
As visibility improved, a second problem became obvious.
Many incidents were no longer difficult to identify, but the organisation lacked a consistent way to respond to them.
Visibility alone does not improve reliability. Ownership does.
A structured incident management framework was introduced to ensure that issues could be consistently identified, owned, and resolved.
Incident Management
Incident severity definitions
Ownership models
Escalation paths
Service level expectations
Operational runbooks
Structured postmortems
Working closely across Engineering, Product Operations, Cybersecurity, and Customer Success, reliability gradually evolved from a reactive activity into an operational discipline.
-
Further analysis revealed that a large proportion of incidents originated from software changes and releases.
This insight shifted the focus from reacting to failures toward preventing them.
Release Governance
Risk assessments before development began
Release readiness reviews
Deployment validation checklists
Rollback and mitigation planning
Post-release monitoring
Reliability became embedded earlier in the development lifecycle rather than addressed only after failures occurred.
-
Over time, the platform transformed from a system that generated data into a system that generated understanding.
Issues that previously required hours of investigation became diagnosable within minutes. Ownership became clearer. Teams collaborated more effectively. Incident frequency decreased, recovery times improved, and customer satisfaction increased.
Results
Incident resolution time reduced from approximately 4 hours to less than 90 minutes
Improved root-cause identification across more than 70 services
Reduced incident frequency through release governance
Increased customer satisfaction
Improved collaboration across Product, Engineering, Operations, and Customer Success
Greater leadership confidence in platform reliability
Scalable operational foundations supporting continued growth
Most importantly, platform reliability became a managed process rather than a reactive activity.
The lasting outcome was not the observability tooling itself.
It was the creation of a shared operational language that allowed the organisation to understand, operate, and scale an increasingly complex platform with confidence.