The first rule of server management is that shit breaks! And it breaks at 2AM. An engineer has to wake up, troubleshoot and fix the alert for an hour or two, during which the company is facing an IT outage.
Today, DevOps engineers are faced with the challenge of keeping their IT infrastructure up and running 24x7. The servers and applications generate thousands of alerts per day, which need manual intervention, resulting in several hours of IT outages, and lost business. It’s unbelievable how many of these alerts are still resolved manually! This is because of two reasons 1) It’s hard to automate remediation actions 2) The availability and reliability requirements of such an automation tool are an order of magnitude higher than a typical app since the tool kicks in only during an outage.
That’s why big companies like Amazon, Netflix, and Facebook built proprietary auto-remediation tools to fix large chunk of their server alerts automatically. Unfortunately, most other companies don’t have these tools because they don’t have the resources to build it for themselves. I experienced on-call burnout pain (aka alert fatigue) first hand for five years at Amazon. As an early founding engineer on AWS S3 and DynamoDB, I architected and built an auto-remediation tool for Amazon, managing hundreds of thousands of servers. This tool not only reduced downtime and improved IT Operations efficiency but also made tremendous difference to the quality of lives for engineers.
Satish and I founded Neptune.io to solve this exact problem. Through Neptune.io, we want to make an auto-remediation tool available for any company. We have known each other for more than a decade since our undergraduate school days and have been friends since then. When we both were finishing each other’s sentences, we both knew that we wanted to build a great technology company together.
Provisioning, config management, deployment, monitoring and notifications have been widely accepted as part of today’s DevOps Engineer tool set. However there isn’t any disruptive innovation that has happened in automation or auto-remediation space, even though a typical DevOps team spends close to 50% of their time on alert remediation. We were annoyed by the fact that every company has to build their own auto-remediation solution from ground up which takes them around 1-2 years with 5-6 engineers and the usual maintenance budget cycles thereafter. Through Neptune, we are addressing that problem to become the de-facto auto-remediation service for any company.
It is inevitable that every server and application on the planet needs to be self-healing. If you are interested in making it a reality, come join us!
Going forward, we’ll be sharing our experiences in building our product and company here. Stay tuned!
Kiran & Satish
Neptune.io: Make your servers and applications self-healing!