Activating Observability - Building a Reliable Foundation
The Modern Data Stack’s foundation is built on metadata. As we build more complicated architectures, introduce new layers, pipe the data from and into our tools, stream it, transform it, and analyze it, we’ve pushed metadata to the side, and the trust and reliability of our data is suffering.
The problems arising are numerous and multi-faceted - discoverability, quality, security, automation, siloed context, trustworthiness, availability... the list goes on.
The data stack is rapidly breaking into smaller and smaller pieces, and as we celebrate a plethora of specialized, better, and self-service tooling - the whole thing threatens to topple without a system to bind it together.
Today, there exists two categories of tooling which are largely built on metadata: data discovery and data observability. We believe building a truly reliable data stack requires an open, vendor-neutral solution for active data observability.
What is Active Observability?
The concept of “observability” was popularized in infrastructure monitoring (DataDog, New Relic), but is rapidly becoming a key initiative for data organizations to monitor their data directly for quality and issues.
Fundamentally, observability seeks to provide answers about the critical components of your environment without needing to anticipate the questions in advance - it helps answer how much you can understand a system by looking at it from the outside.
Active observability bridges traditional data observability practices: data quality, auto-profiling, debugging, and incident management - with active metadata systems.
Active observability is accomplished through:
- Automatic collection and real-time metadata on the systems, tooling, and data itself to scale to the volumes and variety anticipated by modern data organizations
- Powerful enrichment pipeline to perform modular analytics & processing to form insights used for core observability practices.
- Built-in automation rules to orchestrate changes in your stack with full context to safely and reliably access, transform and send data.
Unfortunately, today, we have yet to achieve a true, active observability system. The major blocker to such a system lies in our ability to operationalize the metadata powering it.
Why Haven’t We Operationalized Metadata Yet?
1. Open Standard Fatigue
There are a number of projects that have attempted to create an open standard for ingesting, analyzing, and exposing metadata with limited success. The data industry is getting crowded, and numerous projects & companies have emerged touting ‘the newest, best, open standard’ (obligatory).
The idea that all the diverse participants in the data stack would consolidate around a single standard is unlikely. Standards can’t keep pace with the evolution of the industry to support the numerous, emerging use cases and tooling to guarantee end-to-end interoperability. They are destined to fail even before they launch.
2. Impossible Interoperability: Too Many Tools, Vendors, Users, Use Cases
To achieve true interoperability, we must align incentives across the industry, regardless of their position in the stack as a vendor, their use case as a practitioner, or an evolution in process.
Such a solution would likely be impossible to create in a proprietary fashion. An effective approach to the problem necessitates the creation of open, vendor-neutral systems that can be ubiquitously accessed to build capabilities around. It also requires building a system which is flexible enough to integrate with unforeseen systems and use cases.
3. Data volume & gravity
Today there is more metadata being generated, collected, and analyzed than ever before - and there are only signs of acceleration of this trend. The volume of metadata that can be accessed and analyzed is being driven by an increase in tooling, processes, and innovation, and is reaching ‘big metadata’ status.
Where we store this data matters. Data has gravity and as it grows larger and larger, it becomes more difficult to move. Organizations seeking full data enablement and ownership of their own data are currently at a loss - forced to maintain relationships with vendors that are not aligned with their interests. Migration costs are high, and these vendors know they have all of the power.
4. Required Foresight
Today, teams are largely reactive to problems occurring in their data stack. They write tests based on past failures, experiences, or their imagination; trust that expensive vendors’ analysis will detect all potential issues; and build custom, internal processes to avoid that last big problem.
In order to understand what went wrong with our data systems, the intelligence we can derive from the metadata they create, and truly understand and power use cases deriving from it, we need a system that doesn’t require the anticipation of problems, but rather collects all possible metadata, and allows for analysis for the many use cases, both already, and to be imagined.
A different approach is needed to better understand the health of our data systems, derive intelligence from the metadata they create, and power use cases deriving from this intelligence. We need a system that doesn’t require the anticipation of problems, but rather collects all possible metadata, and allows for analysis for the many use cases, both already, and to be imagined.
5. Data Security & Privacy
Companies will always need to consider both data security and privacy, especially at the enterprise-level - a trend that is not likely to change in the next few decades.
This severely limits options for these organizations, requiring lengthy compliance processes to adopt, especially in industries with particularly sensitive data such as health and finance.
Requirements
Here is the world we envision at Monosi in 5 years:
- Metadata should be directly in your own cloud, companies should have full control over their data.
- Metadata should be queryable by any/all tooling and stored in open formats to share context.
- It should be cheap and easy to collect any/all metadata from day 1, even as it continues to grow - ’big Metadata’.
- There should be built-in ingestion, with support for real-time changes and reactivity/automation - an active system.
- Existing systems and data tooling should update in real-time and be able to provide context on the system as a whole natively.
- Metadata should be schema-agnostic and format-agnostic to ensure future use cases are supported instead of conforming to a standard.
- Powered by a graph engine and type system flexible enough to capture state of the entire system.
- Observability should be made a priority over discoverability in the operationalization of metadata, with the flexibility to support emerging use cases.
Only then can we consider metadata a first-class citizen of a more reliable, modern data stack powered by active observability.
How do we Activate Data Observability?
1. Open Source: Bring the ecosystem together
Delivering this capability as open source is a critical part of this initiative, since multiple stakeholders must buy into the ecosystem. Vendors need access to contribute to the system to improve and integrate with it, and data teams need to maintain full control over how they manage their metadata.
We’ve tried this before with open standards, but we propose a new system which doesn’t require vendors and users to conform to a single standard, yet allows them to benefit from community contributions.
2. Metadata Lake & Open Formats
Introducing a metadata lake solves for the storage and retrieval of an organization’s observability data for the foreseeable future. It can store huge volumes of metadata cost effectively, even as the volume of metadata being collected increases. Data delivered to the metadata lake doesn’t require a specific schema before processing, and can also be stored in open formats that can be read by any number of solutions.
This has a number of interesting second-order effects: its queryable, can hydrate entirely new systems for analysis, has configurable retention for the fidelity of data, is schema-agnostic, and others yet to be realized. Access to orders of magnitude more metadata at high granularity has the potential to open up entirely new use cases.
3. Graph Engine & Type System
It’s critical that metadata can be modeled and related to each other in this system and that the system is flexible enough for continually evolving use cases. Because a data lake allows for schema-agnostic data to come in, users can send their data in the format they already have it to MonoSi, model it dynamically, and integrate new tooling and data in very little time.
A built-in graph system is also critical for many core observability use cases that exist today and those of tomorrow to understand the impact, effect, lineage, and root cause of data concerns.
4. Built-In Ingestion, Streaming Support
An always available, active ingestion system is critical to the reactivity that teams have to their data issues. Supporting streaming use cases is also necessary as the ecosystem moves to real-time and for automation, and many other use cases that have yet to be realized.
5. Automation & Reverse Metadata
Existing observability systems ingest and analyze metadata to detect issues and root cause them with very little incentive to share that data with other tools due to vendor lock-in. To truly support automation, we need to complete the loop (similar to reverse ETL) and allow our existing tooling to access contextual metadata across the entire system.
For instance, being able to see directly in your Looker dashboards the lineage, quality, and ownership, automatically stopping data pipelines in the event of detecting bad data, preventing a schema change that will break downstream reports, and more.
6. Beyond Observability — Automation, Discovery, etc.
It is inevitable that down the line multiple categories powered by metadata will converge. By being open source, contributors are enabled to build faster and deeper for more use cases. For instance, data process automation, embedded collaboration, data discovery, and more.
6. Enterprise Requirements
To overcome privacy and security concerns, there will also be enterprise requirements to consider such as compliance, user access management, SSO, and other features. By implementing these features, we lower the barrier for enterprises to use open-source technology, and is likely a roadmap to an enterprise version.
What’s the Roadmap?
We’re just getting started on this vision today. We will focus the next year on nailing the creation of this proposed architecture and core observability use cases - where we believe the ecosystem most needs and benefits from open tooling.
Right now, you can download our project and self-host it. We built a UI to allow anyone to auto-profile their data warehouse in minutes to understand the health of their data, just like many proprietary systems today, but this is just the first use case as we continue to build towards this vision.
To accelerate adoption, we built our own scheduler, initial data collectors, and alerting integrations so you can get started quickly and analyze your data for quality issues.
Our MVP supports Snowflake, PostgreSQL, and Redshift (we wanted to launch fast and get the community feedback as early as possible). We will add support for non-SQL systems and many more sources in the coming weeks.
Give it a spin: https://github.com/monosidev/monosi. Let us know what you think, and don’t hesitate to ⭐ star the project if you like our vision for active observability.
Written in collaboration by Kevin Unkrich & Ivan Porollo
Interested in partnering with us? Get in touch
Interested in contributing? Join the community