How Grafana made observability accessible

[ad_1]

Observability has evolved from a specialized platform and site reliability engineering discipline to a massive market in the “you build it, you run it” era. Ten years into Grafana’s open source journey, the project has made observability accessible to a huge audience of developers.

Grafana couldn’t have come at a better time. When Grafana creator Torkel Ödegaard first introduced his open source project for visualizing service behavior and performance 10 years ago, enterprises were still in catch-up mode, trying to figure out scalable methods for troubleshooting the proliferation of moving parts that microservices-based architectures had spawned. It was the dawn of a new era when, as Amazon CTO Werner Vogels declared in 2006, “You build it, you run it.”

Exciting? Sure. Complicated? Oh, yes.

After all, though continuous delivery and microservices were dramatically speeding up the “build” portion of that equation, the “run” part created all sorts of complicated questions about how to troubleshoot and maintain these services in production. Vogels’ pithy phrase glossed over just how complicated this was in practice, as he illustrated in 2008 with his famous “Death Star” diagram.

Grafana emerged as a way to resolve this Death Star situation, and now, 10 years into its evolution, Grafana looks set to get even better.

The birth of a new open source project

Ödegaard was an engineer and architect at eBay Sweden during those heady times for microservices and continuous delivery. His efforts to observe running systems led him to Graphite, the default time-series database before Prometheus. It had built-in and amazingly rich graphing and query capabilities. Working with tools like StatsD and other metrics frameworks, Ödegaard was struck by how easy it was to send metrics to Graphite to graph them. It was a new playground for effectively monitoring services across scale-out environments.

“I just loved seeing the applications and services come to life in real time,” said Ödegaard. “Being able to visualize service behavior and performance, as well as user behavior metrics—and see how these were impacted in real time and over time as changes were rolled out—was really transformative for me.”

But as is typical for creators of open source projects, Ödegaard hit a wall. Graphite had a big problem: usability. Creating the actual queries, graphs, and dashboards was very complex. Also bad: The graphs were not interactive. As such, it wasn’t possible to simply mark a region to zoom in, for example. Soon after, though, Ödegaard’s eBay team started using Kibana, a new open source tool for viewing and searching logs stored in Elasticsearch. As he remembers it, Ödegaard says Kibana changed the game for centralized log visualization and analytics.

Bouncing between the capabilities and limitations of these two tools inspired him: What if Kibana could query Graphite and visualize time-series metrics? It was an interesting thought, but one without an easy solution. After all, Kibana was fully focused on Elasticsearch as a data source, and the project maintainers didn’t want other data sources, such as Graphite.

This was all the motivation Ödegaard needed to fork it and follow his vision.

On December 5, 2013, he started modifying bits of Kibana, playing around with the graph visualization, trying to get it to visualize data coming from a Graphite query. “I honestly got lost in time as I spent days and weeks working almost nonstop on this passion project,” said Ödegaard. “I don’t remember eating, drinking, or sleeping.” He released Grafana v1 in January 2014. Its clean and good-looking UI, fast and feature-rich graphing, easy editing and query builder UI, and the addition of dashboard template variables to make dashboards more generic and reusable added up to something he thought others might want.

They did.

From Prometheus to all infrastructure data

Grafana was an instant hit with developers, plus devops and platform teams working with time-series data. It quickly became the de facto visualization engine for Prometheus monitoring data. A great start for sure, but the technology promised so much more. Indeed, the biggest reason it took off so dramatically was its open architecture and plug-in model, which allowed developers to easily connect and visualize all their disparate data without being limited to a single data source or locked into a proprietary observability solution.

Ödegaard refers to this as Grafana’s “big tent” philosophy, and it continues to guide the project today.

This decision to make Grafana multi-data-source inverted the model for a monitoring industry ruled by proprietary players that forced users to bring their data into a single solution. The Grafana community leveraged that plug-in model to create 150 data source plugs that capture just about every popular data store, letting users query their infrastructure data from within a single open source tool and bringing the native query language for each data source into the Grafana experience.

This unification of data, coupled with the emphasis on not forcing developers to learn new query languages, translated into much smoother ergonomics for developers. One way of thinking about this is in the context of “tab hell,” or the context switching that occurs when chasing down root causes of problems in production and having to traverse logins to numerous data sources and monitoring vendors. Grafana’s approach freed developers from this hell. Salvation never felt so good.

What’s more, Grafana is increasingly an industry default for unifying the “three pillars” of observability: logs, traces, and metrics. Through the years, Grafana has incrementally introduced scalable back ends for logs (Loki), traces (Grafana Tempo), and metrics (Mimir), resulting in the “LGTM” stack. In the process, Grafana has liberated developers from having to understand how to stand up and scale the back ends for these observability data sources.

For most issues, you really need all three telemetry types to get full visibility. For example, if you have only metrics and logs, you might know why it’s wrong and what went wrong, but if it’s a database issue you’d still need traces to see where it went wrong. By making it possible to query and correlate all these telemetry types from one view, Grafana further reduces cognitive overload.

20 million users later

Today the observability market is on track to reach $9 billion by 2025, according to IDC. On the 10-year anniversary of its creation, Grafana has passed 20 million users in its community of developers around the world. Today Grafana’s use cases go well beyond application and infrastructure monitoring. Its visualization capabilities and simplification of unifying disparate data stores have put Grafana in student projects, video games, and recently the SpaceX control center.

With Grafana’s popularity, Grafana Labs, Ödegaard’s company and the primary corporate sponsor of Grafana, also found itself in the envious position of being approached by all the cloud hyperscalers. Grafana Labs has built strong relationships with AWS, Microsoft Azure, Tencent, Alibaba, and Google Cloud, which all have first-party Grafana offerings through official partnerships. AWS was actually the first (which I wrote about while I worked for AWS), and Grafana gives AWS a lot of credit for that. Since establishing the partnership with AWS a few years ago, they’ve expanded it so AWS customers can now get the entire Grafana Cloud LGTM Stack and more running on their infrastructure, via AWS Marketplace.

“The idea of observability has exploded because it finally speaks at a level that is beyond just monitoring or application performance,” said Ödegaard. “It’s about understanding the current state of operations. It’s a uniting concept.”

No doubt at GrafanaCon in Sweden this week, Ödegaard and team will unveil a host of new Grafana 10 capabilities that will continue to up-level the developer experience for visualizing observability data. It’s what Ödegaard and the swelling Grafana community have been doing for a decade now with no signs of slowing down.

[ad_2]

Source link