The Platform

case Study

Creating Operational Visibility in a Rapidly Scaling IoT Platform

  • As the company grew from a startup into one of Europe’s largest residential Virtual Power Plants, the complexity of the platform grew faster than its operational visibility.

    Hundreds of thousands of connected devices interacted with cloud infrastructure, mobile applications, and energy market services. The organisation was expanding rapidly, engineering teams were shipping new capabilities, and customer adoption continued to accelerate.

    Yet understanding what was happening across the platform remained surprisingly difficult.

    Customer complaints were often treated as platform-wide incidents. Investigations relied heavily on Slack conversations, fragmented logs, and individual expertise. Different teams had visibility into different parts of the system, but no shared understanding of how the platform behaved as a whole.

    The challenge was not a lack of data. The challenge was a lack of visibility.

    Without a clear view of system behaviour, diagnosing issues was slow, ownership was often unclear, and operational decisions relied more on assumptions than evidence. As the platform prepared for significant growth, this way of operating was becoming unsustainable.

  • The solution was not a dashboard. It was the creation of a complete operational picture.

    A platform-wide observability capability was established, bringing together backend services, frontend applications, infrastructure components, and IoT devices into a common monitoring framework. Logs, traces, telemetry, and operational events became accessible through a shared view of platform behaviour.

    Alongside technical observability, a second layer of visibility was introduced through operational and product analytics.

    Observability

    • Backend logs and distributed traces

    • Frontend monitoring and error tracking

    • Infrastructure monitoring

    • IoT device telemetry and health monitoring

    • Operational investigation tooling

    Product & Operational Analytics

    • Product usage and adoption

    • Feature performance monitoring

    • Installation quality metrics

    • Device health indicators

    • Configuration drift analysis

    • Customer behaviour insights

    • Leadership performance reporting

    For the first time, product teams, engineering teams, operations, customer success, and leadership were working from the same reality.

    For the first time, platform behaviour became visible end-to-end.

  • As visibility improved, a second problem became obvious.

    Many incidents were no longer difficult to identify, but the organisation lacked a consistent way to respond to them.

    Visibility alone does not improve reliability. Ownership does.

    A structured incident management framework was introduced to ensure that issues could be consistently identified, owned, and resolved.

    Incident Management

    • Incident severity definitions

    • Ownership models

    • Escalation paths

    • Service level expectations

    • Operational runbooks

    • Structured postmortems

    Working closely across Engineering, Product Operations, Cybersecurity, and Customer Success, reliability gradually evolved from a reactive activity into an operational discipline.

  • Further analysis revealed that a large proportion of incidents originated from software changes and releases.

    This insight shifted the focus from reacting to failures toward preventing them.

    Release Governance

    • Risk assessments before development began

    • Release readiness reviews

    • Deployment validation checklists

    • Rollback and mitigation planning

    • Post-release monitoring

    Reliability became embedded earlier in the development lifecycle rather than addressed only after failures occurred.

  • Over time, the platform transformed from a system that generated data into a system that generated understanding.

    Issues that previously required hours of investigation became diagnosable within minutes. Ownership became clearer. Teams collaborated more effectively. Incident frequency decreased, recovery times improved, and customer satisfaction increased.

    Results

    • Incident resolution time reduced from approximately 4 hours to less than 90 minutes

    • Improved root-cause identification across more than 70 services

    • Reduced incident frequency through release governance

    • Increased customer satisfaction

    • Improved collaboration across Product, Engineering, Operations, and Customer Success

    • Greater leadership confidence in platform reliability

    • Scalable operational foundations supporting continued growth

    Most importantly, platform reliability became a managed process rather than a reactive activity.

    The lasting outcome was not the observability tooling itself.

    It was the creation of a shared operational language that allowed the organisation to understand, operate, and scale an increasingly complex platform with confidence.

Next
Next

The Function