Oct 12, 2023. 10 min

How can startups leverage latest breakthrough of AI in observability?

As the world becomes increasingly digitized, software is becoming ubiquitous. The phrase “software is eating the world” is often used to describe this phenomenon. However, modern software architecture is becoming increasingly more complex, and the amount of observability data (e.g., logs, metrics, traces, code changes, etc.) is growing in tandem.


Cloud computing is a rapidly growing industry, with 94% of enterprises using cloud services . According to a 2022 study, the market size of cloud computing is $480 billion . However, running applications 24x7 on the cloud has become one of the most challenging parts of cloud adoption.


Enterprises on average experience about 8.7 major incidents every year, with the cost of downtime ranging between $100,000 and $540,000 per hour according to Gartner. The stakes are high and the burden on DevOps and SRE teams is enormous.


Startup Challenges

What does this mean for a startup? Here are three questions most often asked by early-stage founder:

  1. What monitoring solution would you put in place for a small team with no dedicated OPS/SRE group?
  2. Are there any third-party tools for monitoring spend that is affordable to startups, as existing products are expensive?
  3. How do you monitor CloudWatch logs? Are there any third-party tools for monitoring spend that is affordable to startups, as existing products are expensive?

For a small development team of a typical startup the maintenance of production environment is a major challenge, which is exacerbated by the current cost of observability solutions. Striking a balance between cloud operations work vs development work is clearly a struggle.


Observability Considerations

As a startup founder, you may consider the following while deciding on your strategy for observability.


AI Won't Replace Humans - But Humans With AI Will Replace Humans Without AI

            -- Karim Lakhani, Professor, Harvard Business School


Data Consolidation

Keeping your observability data (e.g., logs, metrics, traces, tickets, code changes, documentation, chats, etc.) in different silos can make investigations challenging especially if you are facing a costly downtime. If you are not able to correlate data across silos, your war room will have a large number of people (30-40) for troubleshooting and finger-pointing will ensue.


For instance, if you use a few AWS services with your microservices-style application, keeping the logs separate may result in many separate log files. This complicates the investigation. Getting a full context from your data makes analysis go faster. Look for solutions that automate the collection and aggregation of logs and metrics from different services. Additionally, you want services that analyze the data and give you insights automatically. For logs, look for a solution that helps you parse your data into a consistent structure regardless of the source. Being able to search across all your data and filter through them will make investigation convenient.


Intelligent Monitoring

You want to look for a solution that can capture anomalies and proactively detect abnormal conditions without requiring constant configuration tuning. It's also important to find a solution that doesn't generate false alarms or redundant events. For log management, you want to use a solution that can identify the clues without digging through the logs. If the service can automatically surface the root-cause indicators from your logs, that can save a lot of time.


Pattern Discovery

When it comes to analyzing logs, their unstructured nature can make it challenging. The ability to quickly discover patterns from your logs can greatly simplify the investigation of a complex problem. You want a tool that allows you to group logs into event sequences (i.e., workflows). This better illustrates the system runtime execution paths and helps identify common event sequences.


Insights

You want a solution that discovers insights from your data. For example, application logs typically have a lot of different exceptions that might occur during an incident. Manually investigating all of these can be time-consuming. Instead, you want a solution that finds all the exceptions, catalogs them, and provides details on frequency, recency and duration information such as first and last seen, how to fix them, and notifications when they cross normal limits or new exceptions show up. In other words, you want a helpful assistant who will help you manage your application exceptions without doing the heavy lifting.


Topology

For the data ingested from your applications, it's important to place it in context of your application topology. You want to look for a solution that can show the application topology automatically and provide you with proper context.


For instance, you want a tool that can automatically discover components and dependencies of your entire technology stack in real-time using AI. It may extract and auto-discover all context-relevant topology information of your full-stack with the installation of a single agent.


You want a tool that helps you visualize your complex application services and navigate based on topology relationships. Additionally, you want to troubleshoot alerts within your application with the application topology as the context.


Root-cause Analysis

You want to look for a solution that provides answers to critical questions such as “Why is my service unhealthy?” It is immensely helpful if the solution analyzes all available data such as metrics, logs, traces, and deployments together to accurately pinpoint the root-cause of failures. Instead of spending hours investigating a problem, you want a tool that can work as a co-pilot.


For instance, you want a tool that can analyzes all available data in real-time and provides insights into the root-cause of failures within seconds. If the solution provides a visual representation of the entire incident timeline and helps you understand how different events are related, that will speed up the investigation.


Conversational

When it comes to choosing a solution for SRE/DevOps, you want to look for a tool that can understand and respond to your questions using the latest breakthroughs in AI. You also want to avoid training your team members on how to use a certain tool and ensure that the on-ramp is fast and effective.


A conversational interface can be an excellent solution for this. Conversational AI tools such as Google's Vertex AI Conversation, Dialogflow CX, and Microsoft's Conversational AI tools can help you quickly create and deploy generative AI-powered chat and voice bots. These tools provide a simple UX that enables your team to interact with the backend services. Alternatively, look for an observability vendor that offers the integrated experience.


Cost Effective

You want to look for a tool that is cost-effective and can grow with your needs. For log management, smart tiering of logs is a feature that can help you optimize your storage costs. It automatically moves your data to the most cost-effective storage tier and deletes them eventually based on a policy. This ensures that you are not paying more than you need to.


For instance, Amazon S3 Intelligent-Tiering is a cloud storage class that automatically optimizes costs by moving data to the most cost-effective access tier when access patterns change. It can monitor access patterns and can automatically move objects that have not been accessed to lower-cost access tiers.


Alternatively, look for a vendor that offers the integrated experience. The re-hydration should be cost effective so that you are not paying a lot. Avoid vendors that are doing a 'bait and switch'.


Semantic Observability

You want to use a solution that can provide greater insights and visualizations from the vast amount of observability data. You want to use an observability solution that is more proactive and less noisy. This type of observability will understand context from logs, metrics, traces, tickets, code changes, documentation, chats, etc. As your observability data grows, the AI-powered solution will automatically handle the new volume.


The latest advancement in generative AI can simplify access to deep insights and provide automated context. A simple conversational interface with intelligent observability backend can reduce the need for developers to understand cloud technologies at expert level. This will reduce skill set gap between developers and SRE/DevOps engineers. Each startup and enterprises will need to evaluate and implement an AI based strategy to remain competitive. Every organization may need to deal with prompt engineering, large language models, domain specific AI model development, MLOps, user feedback, etc. to deal with complexity and scalability of the system.


Is there such a tool?

You must be conscious about 'build vs buy' and have a clear set of objectives in mind. You need to clearly understand your business problem. You must have clarity on what would success look like. A good use case for AI is where you have lots of available data. Clarity on how you will navigate the data security concerns, ethical and responsible AI use are key considerations. And finally, it is important to clearly understand what your customer will demand that can only be delivered by an AI-powered solution.


I am obviously biased, but checkout CloudAEye for Semantic Observability! Here is the high-level pitch:

Imagine AI models that can detect an anomaly, contextualize it, prioritize it and generate hypothesis about it by doing a root- cause analysis. All of this will be done automatically to enable customers to test fixes as soon as the problem arises.


CloudAEye Log Management offers a robust set of features that enable you to use latest breakthrough of AI. Checkout this overview video.


Curious about AIOps?

Did you know that CloudAEye offers one of the most advanced AIOps solution for Log Management? It focuses on answering the what/why/how questions regarding your cloud operations. Get started with our free-tier (see overview video) today!

Nazrul Islam

A seasoned engineering executive, Nazrul has been building enterprise products and services for 20 years. Nazrul is the founder and CEO of CloudAEye. Previously, he was Sr. Dir and Head of CloudBees Core where he focused on enterprise version of Jenkins. Before that, he was Sr. Dir of Engineering, Oracle Cloud. Nazrul graduated from the executive MBA program with high distinction (top 10% of the cohort) at University of Michigan Ross School of Business. Nazrul is named inventor in 47 patents.