The Value of a Data Catalog
Most organizations are moving their data onto the Cloud for improved versatility to support the needs of their business by being able to dynamically scale both horizontally and vertically. Once this data is moved to the Cloud, enterprises can also begin unlocking the value of the data by generating insights from AI and ML services. Data platform offerings from cloud vendors traditionally included traditional databases like SQL server or warehouses but recently, have also included data lakes, cloud warehouses, data engineering, ETL stacks, and analytical workbenches for storing this data at scale. However, such data modernization by moving to the Cloud also creates complexity to actively manage and govern data given its distribution across these hybrid and multi-cloud landscapes. Within this context, the value of a data catalog which can integrate metadata across different environments cannot be overstated.
At it’s core, a catalog scans any database ranging from relational databases to NoSQL or Graph and gives out useful information”. Some of this useful information includes –
Table & Column Name
Modeled data-type
Inferred data types
Patterns of data
Data Length with minimum and largest threshold
Minimal and maximum values
Other profiling characteristics of data like frequency of values and their distribution
End-End lineage across data sources along with Extract Transform Load characteristics
However, most on-prem catalog vendor solutions lack the functionality to scan different environments and specifically lack the capabilities to scan cloud object storage. Cloud vendor solutions on the other hand, have cloud connectors which provide deeper insight into the physical location of data assets and their semantic meaning, for distributed assets being managed across different environments. To contrast Cloud-based catalogs with traditional on-prem catalogs, let’s pick a few examples of the former and dig a bit deeper into their core capabilities.
Managing Metadata with a Catalog on the Cloud
To pick an example- AWS Glue catalog provides features that include scanning relevant storages, such as S3, RDS, Redshift etc. to build out the physical characteristics of the tables and files. Further, references to data that is used as sources and targets in Glue are stored as metadata in the catalog. This helps provide the lineage of the data as it moves through its lifecycle. Similarly, Azure has a catalog offering named Purview.
Let us look at how a scanner popularly called a crawler works in the Glue data catalog.
The crawler connects to the data store of choice in AWS storage, such as S3 or Redshift. Connection properties would include configuring the data store properties, data paths etc.
The type of crawling required – Full folder scans or incremental folder scans
The inferred schema is then created from your data stores.
The crawler writes metadata in the Data Catalog
It creates definitions for databases and tables.
The Benefits of Metadata
Most data catalogs like Glue are not built to be glossaries but instead are used for the maintenance of schemas. The use of metadata has exploded over the last year, much of this attributed to technological advancements and public policy changes related to GDPR and CCPA regulations, to name a few. Some of the benefits of an Enterprise metadata function include simplification of the data landscape, democratizing search for data assets, and managing schema drifts.
To summarize, what are the basic benefits of managing Metadata in catalog?
1. Increased availability of intelligence about data that brings out better context to insights
2. The reduced turnaround time to find answers during the analysis
3. Increased efficiency of subject-matter-experts in turning out information for impact analysis
4. Removes ambiguity in relationships among data in the landscape
5. Simplifies the views of data through meaning, identified redundancy, and relationships