
Service Operations & Assurance Specialist
Early applicant
On-site
Senior Level
As the Senior Service Operations & Assurance Specialist, this role acts as the operational integrity authority across Enterprise IT services. The position governs the end-to-end effectiveness of ITIL Service Operations processes through a structured Service Assurance framework.
A data-driven, risk-aware role focused on detection, response, analysis, and risk reduction operate as a unified reliability model — not as disconnected processes.
Requirements
- ITIL V4 certification with foundation as a minimum, with strong knowledge of ITIL v4 across Incident, Major Incident, Problem, Event, Monitoring, Availability, and Continual Improvement.
- Experience governing operational performance in high-availability, engineering environments.
- Experience implementing and governing a Service Assurance framework.
- Familiarity with SRE principles.
- Experience with observability platforms, monitoring, and alerting tooling.
- Strong data analysis capability, including trend interpretation and risk modelling.
- Experience reducing repeat incidents and systemic operational risk.
- Understanding of CI/CD and change governance integration.
- Structured decision-making capability in high-pressure environments.
- Good communication and influencing skills across technical and leadership audiences.
- Experience implementing AIOps or event correlation tooling!
- Experience designing predictive reliability and performance dashboards.
- Exposure to SRE operating models in mature engineering organisations.
- Experience within semiconductor, SaaS, or complex global technology environments.
- Experience in Service Now!
Responsibilities
- Own and evolve Incident, Major Incident, Problem, Event, and Availability Management processes.
- Govern operational integrity, performance standards, and reliability compliance across ITSM practices.
- Present service reliability risk posture and trigger corrective action when thresholds or error budgets are breached.
- Ensure operational processes' function as an integrated, end-to-end reliability system.
- Govern performance against the 15-minute P1 response SLA and monitor MTTR, response quality, and critical issues' effectiveness.
- Drive structured improvements through incident trend analysis and repeat incident reduction, being able to identify patterns in incident response performance.
- Ensure actionable root cause investigations (PIRs) and govern Known Error lifecycle through to permanent resolution.
- Identify and address systemic architectural, process, and change-related risks.
- Optimise monitoring and alerting to improve signal-to-noise ratio and Mean Time to Detect (MTTD).
- Embed AI-assisted triage, correlation, and automation into detection and response workflows.
- Monitor SLA and SLO performance, availability trends, reliability, and error budget consumption aligned and contributing to IT overall service health goals.
- Align reliability insights with engineering backlogs and platform roadmaps.
- Ensure resilience controls (failover, redundancy, disaster recovery) are visible and governed.
- Use data and trend analysis to predict risk, prevent instability, and shift operations from reactive recovery to predictive prevention.
Benefits
- With Arm’s growth trajectory, you’ll have clear opportunities to develop your career, take on new challenges, and make a real impact on our continued success
Skills
ITIL v4
Incident management
Major incident management
Problem management
Event management
Availability management
SRE
AI
Automation
Observability platforms
Monitoring
Alerting
Data analysis
Risk modelling
CI/CD
AIOps
ServiceNow





