Arm Logo
AR

Service Operations & Assurance Specialist

Arm
Cambridge

Early applicant

On-site

Senior Level

As the Senior Service Operations & Assurance Specialist, this role acts as the operational integrity authority across Enterprise IT services. The position governs the end-to-end effectiveness of ITIL Service Operations processes through a structured Service Assurance framework.

A data-driven, risk-aware role focused on detection, response, analysis, and risk reduction operate as a unified reliability model — not as disconnected processes.

Requirements

  • ITIL V4 certification with foundation as a minimum, with strong knowledge of ITIL v4 across Incident, Major Incident, Problem, Event, Monitoring, Availability, and Continual Improvement.
  • Experience governing operational performance in high-availability, engineering environments.
  • Experience implementing and governing a Service Assurance framework.
  • Familiarity with SRE principles.
  • Experience with observability platforms, monitoring, and alerting tooling.
  • Strong data analysis capability, including trend interpretation and risk modelling.
  • Experience reducing repeat incidents and systemic operational risk.
  • Understanding of CI/CD and change governance integration.
  • Structured decision-making capability in high-pressure environments.
  • Good communication and influencing skills across technical and leadership audiences.
  • Experience implementing AIOps or event correlation tooling!
  • Experience designing predictive reliability and performance dashboards.
  • Exposure to SRE operating models in mature engineering organisations.
  • Experience within semiconductor, SaaS, or complex global technology environments.
  • Experience in Service Now!

Responsibilities

  • Own and evolve Incident, Major Incident, Problem, Event, and Availability Management processes.
  • Govern operational integrity, performance standards, and reliability compliance across ITSM practices.
  • Present service reliability risk posture and trigger corrective action when thresholds or error budgets are breached.
  • Ensure operational processes' function as an integrated, end-to-end reliability system.
  • Govern performance against the 15-minute P1 response SLA and monitor MTTR, response quality, and critical issues' effectiveness.
  • Drive structured improvements through incident trend analysis and repeat incident reduction, being able to identify patterns in incident response performance.
  • Ensure actionable root cause investigations (PIRs) and govern Known Error lifecycle through to permanent resolution.
  • Identify and address systemic architectural, process, and change-related risks.
  • Optimise monitoring and alerting to improve signal-to-noise ratio and Mean Time to Detect (MTTD).
  • Embed AI-assisted triage, correlation, and automation into detection and response workflows.
  • Monitor SLA and SLO performance, availability trends, reliability, and error budget consumption aligned and contributing to IT overall service health goals.
  • Align reliability insights with engineering backlogs and platform roadmaps.
  • Ensure resilience controls (failover, redundancy, disaster recovery) are visible and governed.
  • Use data and trend analysis to predict risk, prevent instability, and shift operations from reactive recovery to predictive prevention.

Benefits

  • With Arm’s growth trajectory, you’ll have clear opportunities to develop your career, take on new challenges, and make a real impact on our continued success

Skills

ITIL v4

Incident management

Major incident management

Problem management

Event management

Availability management

SRE

AI

Automation

Observability platforms

Monitoring

Alerting

Data analysis

Risk modelling

CI/CD

AIOps

ServiceNow