Data is increasingly becoming core to many products. Whether to provide recommendations for users, getting insights on how they use the product or using machine learning to improve the experience. This creates a critical need for reliable data operations and understanding how data is flowing through our systems. Data pipelines must be auditable, reliable and run on time. This proves particularly difficult in a constantly changing, fast paced environment.
Collecting this lineage metadata as data pipelines are running provides understanding of dependencies between many teams consuming and producing data and how constant changes impact them. It is the underlying foundation that enables the many use cases related to data operations.
The OpenLineage project is an API standardizing this metadata across the ecosystem, reducing complexity and duplicate work in collecting lineage information. It enables many projects, consumers of lineage in the ecosystem whether they focus on operations, governance or security.
Marquez is an open source projects part of the LF AI & Data foundation which instruments data pipelines to collect lineage and metadata and enable those use cases. It implements the OpenLineage API and provides context by making visible dependencies across organisations and technologies as they change over time.