Jan 27, 2022. 12 min
How can startups leverage latest breakthrough of AI in observability?
As the world becomes increasingly digitized, software is becoming ubiquitous. The phrase “software is eating the world” is often used to describe this phenomenon. However, modern software architecture is becoming increasingly more complex, and the amount of observability data (e.g., logs, metrics, traces, code changes, etc.) is growing in tandem.
Cloud computing is a rapidly growing industry, with 94% of enterprises using cloud services . According to a 2022 study, the market size of cloud computing is $480 billion . However, running applications 24x7 on the cloud has become one of the most challenging parts of cloud adoption.
Enterprises on average experience about 8.7 major incidents every year, with the cost of downtime ranging between $100,000 and $540,000 per hour according to Gartner. The stakes are high and the burden on DevOps and SRE teams is enormous.
What does this mean for a startup? Here are three questions most often asked by early-stage founder:
- What monitoring solution would you put in place for a small team with no dedicated OPS/SRE group?
- Are there any third-party tools for monitoring spend that is affordable to startups, as existing products are expensive?
- How do you monitor CloudWatch logs? Are there any third-party tools for monitoring spend that is affordable to startups, as existing products are expensive?
For a small development team of a typical startup the maintenance of production environment is a major challenge, which is exacerbated by the current cost of observability solutions. Striking a balance between cloud operations work vs development work is clearly a struggle.
As a startup founder, you may consider the following while deciding on observability.
Keeping your observability data (e.g., logs, metrics, traces, tickets, code changes, documentation, chats, etc.) in different silos can make investigations challenging especially if you are facing a costly downtime. If you are not able to correlate data across silos, your war room will have a large number of people (30-40) for troubleshooting and finger-pointing will ensue.
For instance, if you use a few AWS services with your microservices-style application, keeping the logs separate may result in many separate log files. This complicates the investigation. Getting a full context from your data makes analysis go faster. Look for solutions that automate the collection and aggregation of logs and metrics from different services. Additionally, you want services that analyze the data and give you insights automatically. For logs, look for a solution that helps you parse your data into a consistent structure regardless of the source. Being able to search across all your data and filter through them will make investigation convenient.
You want to look for a solution that can capture anomalies and proactively detect abnormal conditions without requiring constant configuration tuning. It's also important to find a solution that doesn't generate false alarms or redundant events. For log management, you want to use a solution that can identify the clues without digging through the logs. If the service can automatically surface the root-cause indicators from your logs, that can save a lot of time.
When it comes to analyzing logs, their unstructured nature can make it challenging. The ability to quickly discover patterns from your logs can greatly simplify the investigation of a complex problem. You want a tool that allows you to group logs into event sequences (i.e., workflows). This better illustrates the system runtime execution paths and helps identify common event sequences.
You want a solution that discovers insights from your data. For example, application logs typically have a lot of different exceptions that might occur during an incident. Manually investigating all of these can be time-consuming. Instead, you want a solution that finds all the exceptions, catalogs them, and provides details on frequency, recency and duration information such as first and last seen, how to fix them, and notifications when they cross normal limits or new exceptions show up. In other words, you want a helpful assistant who will help you manage your application exceptions without doing the heavy lifting.
For the data ingested from your applications, it's important to place it in context of your application topology. You want to look for a solution that can show the application topology automatically and provide you with proper context.
For instance, you want a tool that can automatically discover components and dependencies of your entire technology stack in real-time using AI. It may extract and auto-discover all context-relevant topology information of your full-stack with the installation of a single agent.
You want a tool that helps you visualize your complex application services and navigate based on topology relationships. Additionally, you want to troubleshoot alerts within your application with the application topology as the context.
You want to look for a solution that provides answers to critical questions such as “Why is my service unhealthy?” It is immensely helpful if the solution analyzes all available data such as metrics, logs, traces, and deployments together to accurately pinpoint the root-cause of failures. Instead of spending hours investigating a problem, you want a tool that can work as a co-pilot.
For instance, you want a tool that can analyzes all available data in real-time and provides insights into the root-cause of failures within seconds. If the solution provides a visual representation of the entire incident timeline and helps you understand how different events are related, that will speed up the investigation.
When it comes to choosing a solution for SRE/DevOps, you want to look for a tool that can understand and respond to your questions using the latest breakthroughs in AI. You also want to avoid training your team members on how to use a certain tool and ensure that the on-ramp is fast and effective.
A conversational interface can be an excellent solution for this. Conversational AI tools such as Google's Vertex AI Conversation, Dialogflow CX, and Microsoft's Conversational AI tools can help you quickly create and deploy generative AI-powered chat and voice bots. These tools provide a simple UX that enables your team to interact with the backend services. Alternatively, look for an observability vendor that offers the integrated experience.
You want to look for a tool that is cost-effective and can grow with your needs. For log management, smart tiering of logs is a feature that can help you optimize your storage costs. It automatically moves your data to the most cost-effective storage tier and deletes them eventually based on a policy. This ensures that you are not paying more than you need to.
For instance, Amazon S3 Intelligent-Tiering is a cloud storage class that automatically optimizes costs by moving data to the most cost-effective access tier when access patterns change. It can monitor access patterns and can automatically move objects that have not been accessed to lower-cost access tiers.
Alternatively, look for a vendor that offers the integrated experience. The re-hydration should be cost effective so that you are not paying a lot. Avoid vendors that are doing a 'bait and switch'.
You want to use a solution that can provide greater insights and visualizations from the vast amount of observability data. You want to use an observability solution that is more proactive and less noisy. This type of observability will understand context from logs, metrics, traces, tickets, code changes, documentation, chats, etc. As your observability data grows, the AI-powered solution will automatically handle the new volume.
Is there such a tool?
I am biased, but checkout CloudAEye! Here is the high-level pitch:
Imagine AI models that can detect an anomaly, contextualize it, prioritize it and generate hypothesis about it by doing a root- cause analysis. All of this will be done automatically to enable customers to test fixes as soon as the problem arises.
CloudAEye Log Management offers a robust set of features that enable you to use latest breakthrough of AI. Checkout this overview video.
Curious about AIOps?
A seasoned engineering executive, Nazrul has been building enterprise products and services for 20 years. Nazrul is the founder and CEO of CloudAEye. Previously, he was Sr. Dir and Head of CloudBees Core where he focused on enterprise version of Jenkins. Before that, he was Sr. Dir of Engineering, Oracle Cloud. Nazrul graduated from the executive MBA program with high distinction (top 10% of the cohort) at University of Michigan Ross School of Business. Nazrul is named inventor in 47 patents.