For this project, you will be responsible for ensuring the stability, availability, and performance of applications within the team鈥檚 scope. You will play a key role in incident management, ensuring timely resolution by collaborating with internal teams (Development and Infrastructure) and external stakeholders (Service Providers), while driving sustainable, long-term solutions.
Key Responsibilities
Application Stability & Availability
-
Monitor, maintain, and support in-scope applications to ensure high availability, reliability, and performance.
-
Actively participate in incident management activities, including:
-
Situation Rooms for P1/P2 incidents
-
Root Cause Analysis (RCA)
-
Identification of incident trends and contribution to permanent solutions
-
Ensure compliance with ITIL governance within IT Production, including SLA management.
-
Execute change requests and deployments in accordance with ITIL and DevOps processes and tools.
-
Proactively identify and resolve technical issues to ensure smooth business operations.
-
Participate in on-call rotations and provide 24/7 support for critical applications, when required.
Technical Support & Cross-Team Collaboration
-
Serve as a primary point of contact for Development teams, supporting troubleshooting activities and coordinating fixes.
-
Work closely with Scrum and Agile teams to design, deploy, and continuously improve systems.
-
Implement upgrades, patches, and new functionalities while ensuring minimal impact on end users.
Platform Monitoring & Observability
-
Implement, configure, and optimize monitoring solutions within the production environment (e.g., Dynatrace).
-
Collaborate with Development teams and Centers of Expertise to define effective monitoring and observability practices.
-
Promote observability awareness to enable early detection and proactive resolution of potential issues.
-
Utilize distributed tracing, logging, and metrics tools (e.g., Jaeger, Grafana, Prometheus, ELK).
Documentation & Knowledge Sharing
-
Create, maintain, and update technical documentation, including processes, configurations, and troubleshooting guides.
-
Share best practices and technical knowledge with global support teams to improve service quality and operational efficiency.