Linux Monitoring Desk Associate - L1
Job Title: Monitoring Desk Associate – Service Assurance Team
Location: Mahape, Navi Mumbai
Type: Onsite - Work from office
About Neysa:
Neysa is an AI Acceleration Cloud System provider, dedicated to democratizing AI adoption with purpose-built platforms and services for AI-native applications and workloads. Co-founded by industry leaders, we empower businesses to discover, deploy, and scale Generative AI (Gen AI) and AI use cases securely and cost-effectively. Our flagship platforms—Neysa Velocis, Neysa Overwatch, and Neysa Aegis—accelerate AI deployment, optimize network performance, and safeguard AI/ML landscapes. We are committed to enabling AI-led innovation across industries and geographies.
Position Overview:
We are looking for a Monitoring Desk Associate to join our Service Assurance Team. This position will play a key role in ensuring the optimal performance of Neysa’s AI platforms by monitoring system health, responding to incidents, and performing troubleshooting and resolution in real time. The ideal candidate will have hands-on experience with Linux systems, a passion for operational excellence, and the ability to quickly resolve issues impacting service availability.
Key Responsibilities:
- Incident Monitoring & Response: Monitor Neysa's AI platforms and infrastructure for any system alerts, performance issues, or service disruptions. Respond to incidents promptly and escalate issues as needed to ensure timely resolution.
- Incident Management: Follow defined processes for incident identification, classification, and escalation. Ensure incidents are managed effectively, with minimal disruption to service and in alignment with service level agreements (SLAs).
- Troubleshooting & Resolution: Use your Linux expertise to investigate, diagnose, and resolve incidents affecting system performance. Troubleshoot system-level issues, application failures, and network-related problems.
- Proactive Monitoring: Continuously monitor the operational status of servers, applications, and networks, proactively identifying potential issues before they impact customers. Utilize monitoring tools such as Nagios, Prometheus, or Grafana to track system health.
- Documentation & Reporting: Accurately document incidents, actions taken, and resolutions in incident management systems. Provide detailed reports on recurring issues, root causes, and preventive measures for the Service Assurance team.
- Collaboration with Technical Teams: Work closely with system administrators, engineers, and developers to identify areas of improvement, share insights, and ensure issues are resolved with minimal business impact.
- Root Cause Analysis: Participate in post-incident reviews to analyze root causes, provide feedback, and suggest improvements to incident management processes.
- System Maintenance: Support periodic system checks, patch management, and routine maintenance to ensure systems are secure, optimized, and operating at peak efficiency.
Qualifications:
- Experience: 1-5 years of experience in an operations or service assurance role, with a focus on incident management and system monitoring in a Linux environment.
- Linux Skills: Solid experience with Linux operating systems (e.g., CentOS, Ubuntu, RHEL), including system administration, basic troubleshooting, and performance tuning.
- Incident Management: Knowledge of ITIL processes, specifically incident management, with the ability to handle incidents efficiently while maintaining communication with stakeholders.
- Troubleshooting Skills: Strong ability to troubleshoot technical issues in a timely manner, including server failures, network connectivity issues, and application problems.
- Monitoring Tools: Experience with monitoring and alerting tools such as Nagios, Prometheus, Grafana, or similar, and a strong understanding of how to use these tools to monitor system health and performance.
- Communication: Excellent communication skills, both verbal and written, with the ability to provide clear and concise updates to both technical and non-technical stakeholders.
- Team Player: Ability to work effectively in a collaborative team environment, with a proactive approach to problem-solving and incident resolution.
- Technical Aptitude: Basic understanding of cloud platforms (AWS, Azure, or Google Cloud) and networking fundamentals is a plus.
Preferred Qualifications:
- Experience with containerized environments (e.g., Docker, Kubernetes) is a plus.
- Familiarity with automated scripting for incident resolution and process improvement (e.g., Bash, Python).
- ITIL certification or similar incident management qualifications.
Apply for this job
*
indicates a required field