M

DevOps & SRE Engineer

Manus AI · Singapore · Full-time

2+ years Posted 27 Aug 2025

Quick Summary

  • Manage and maintain container clusters (Kubernetes, Docker) across multiple business units
  • Design, build, and enhance infrastructure operation platforms and CI/CD pipelines
  • Ensure maximum uptime for production services through proactive monitoring and incident response

Full Description

Key Responsibilities

Cluster Operations & Management

- Manage and maintain container clusters (Kubernetes, Docker) and open-source component clusters (Kafka, Redis, Elasticsearch) across multiple business units

- Ensure optimal performance, scalability, and reliability of distributed systems

Infrastructure Platform Development

- Design, build, and enhance infrastructure operation platforms

- Develop and maintain systems for infrastructure management, CI/CD pipelines, monitoring/alerting, and centralized logging

- Drive platform standardization and automation initiatives

High Availability & Reliability

- Ensure maximum uptime for production services through proactive monitoring and incident response

- Continuously optimize service architecture, deployment strategies, and operational processes

- Implement and maintain SLA/SLO frameworks and reliability engineering practices

Automation & Process Improvement

- Lead the development of automated operations and maintenance systems

- Create self-service tools and workflows to improve team productivity

- Establish best practices for infrastructure such as code and configuration management

Required Qualifications

Experience & Education

- 2+ years of hands-on experience in Systems Operations, DevOps, or Site Reliability Engineering (SRE)

- Bachelor's degree in Computer Science, Engineering, or related technical field preferred

Cloud & Infrastructure

- Experience with public cloud platforms (AWS, Azure, or GCP) is highly valued

- Strong understanding of large-scale internet architecture and distributed systems

- Proven experience with infrastructure monitoring, logging, and observability tools

Technical Skills

- Proficiency in scripting and automation using Shell, Python, or similar languages

- Strong knowledge of containerization technologies (Kubernetes, Docker)

- Hands-on experience operating production-grade container clusters and managing CI/CD pipelines

- Strong familiarity with common infrastructure components: Nginx, MySQL, Redis, Kafka, Elasticsearch

Advanced Networking (Preferred)

- Experience with Service Mesh architectures, Cilium CNI, and eBPF technologies

- Understanding network security, load balancing, and traffic management

- Knowledge of cloud-native networking patterns and best practices

About Manus AI

Manus is a general AI agent that bridges minds and actions: it doesn't just think, it delivers results. Manus excels at various tasks in work and life, getting everything done while you rest. At Manus AI, we offer a highly collaborative and innovative environment where experts across engineering, research, and business come together to push the boundaries of AI applications. If you're passionate about cutting-edge technology and making a real impact, we’d love to hear from you!

Contact us: [email protected]

Ready to apply?

This role is still accepting applications

Apply on company's site