Volume 1: Metadata Management – Part 2: Deep-dive on Metadata Management with DataHub

Let’s look at implementing DataHub 

The metadata management options highlighted in our previous blog Volume 1: Metadata Management – Part 1 depend on several considerations, and all may be a great choice for your specific needs. However, let’s choose one tool and dive into how such an implementation may look like. For this purpose, we are going to choose DataHub. Because well, a) it’s a great tool which LinkedIn has developed, and b) it’s open source, allowing us to quickly get started.

There are 3 main options for deploying DataHub:

  1. With Docker;
  2. Kubernetes; or
  3. Using the Managed DataHub.

For this tutorial we will be using Docker. 

Prerequisites 

Before we begin, make sure you have the following prerequisites installed: 

  • Docker and Docker Compose v2 
  • Python 3.8+ 
  • A computer with at least 2 CPUs, 8 GB RAM and 10GB disk space 

Step 1: Installing DataHub CLI 

DataHub CLI is a tool that simplifies the installation process, helping to make the container images up to date. We can install it using Python’s pip. 

Zsh
python3 -m pip install --upgrade pip wheel setuptools  
python3 -m pip install --upgrade acryl-datahub 

Step 2: Starting DataHub 

With the CLI installed let’s start DataHub with the following command: 

Zsh
datahub docker quickstart 

This will deploy a DataHub instance using docker-compose. If you want to verify what’s installing, the docker-compose.yaml file is downloaded in your home directory under `.datahub/quickstart`. 

 If the process goes well, it should output the following message: 

Zsh
Finished pulling docker images!  Starting up DataHub...  
[+] Running 12/12  
 Container datahub-zookeeper-1  
Healthy                                                 0.0 
 Container datahub-elasticsearch-1  
Healthy                                                 0.0 
 Container datahub-mysql-1            
Healthy                                                 0.0 
 Container datahub-elasticsearch-setup-1
Exited                                                  3.4 
 Container datahub-broker-1          
Healthy                                                 0.0 
 Container datahub-mysql-setup-1     
Exited                                                  3.6 
 Container datahub-schema-registry-1 
Healthy                                                 0.0 
 Container datahub-kafka-setup-1    
Exited                                                  0.0 
 Container datahub-datahub-upgrade-1 
Exited                                                 26.5 
 Container datahub-datahub-gms-1     
Healthy                                                81.6 
 Container datahub-datahub-actions-1  
Started                                                82.2 
 Container datahub-datahub-frontend-react-
Started                                                82.2 

...............  

 DataHub is now running 

Ingest some demo data using `datahub docker ingest-sample-data` or head to http://localhost:9002 (username: datahub, password: datahub) to play around with the frontend. Need support? Get in touch on Slack: https://slack.datahubproject.io/

Step 4: Accessing DataHub 

After the setup is complete, access DataHub by navigating to http://localhost:9002 in your web browser. The default credentials are: 

Zsh
Username: datahub  
Pass: datahub 

You should see the DataHub UI where you can start exploring your data. 

Step 5: Loading Data 

As you can see, our DataHub instance does not have any data loaded! To jumpstart our exploration of DataHub potential, let’s use DataHub’s CLI to ingest sample data with the following command: 

Zsh
datahub docker ingest-sample-data

Refreshing DataHub page, we are now faced with data from different sources. 

Step 6: Data Exploration 

Let’s check the Hive data that we have available: 

As shown in the above Screenshot, we have 1 database, 1 schema, and some datasets. 

Let’s start exploring this data hierarchically, starting from the database: 

We can see that the database `datahub_db` contains the schema `datahub_schema`.  

Looking at the right side of the page, we can see that there are some attributes available for our database. 

  • About: A brief description of the object 
  • Owners: Which users own and is responsible for the object 
  • Tags: Custom tags that can make data more discoverable 
  • Glossary Terms: Which glossary terms we can link to the object, to help distribute knowledge company-wide 
  • Domain: A Domain is a scope linked with your organization structure that determines which business unit is responsible for the object 
  • Data Product: Which Data Product is associated with the object 

Continuing our exploration, let’s view our `datahub_schema`: 

Here we can see that one of the datasets is associated with our schema. Also we can see a summary of the dataset, their shape (number of rows and columns), and it’s owners. 

Further exploring, clicking on the `SampleHiveDataset`, we are greeted with more information:  

On the ‘Schema‘ tab, we can see the dataset fields, their data type, if it is a primary key/foreign key, and if there’s any tag associated with the field. 

Clicking on ‘Documentation‘ we can see that this dataset has documentation attached, and that we can add more documentation clicking on the ‘Add Link‘. 

Clicking on ‘Lineage‘, we can explore the dependencies between datasets. 

⚠️ This page is only capable of showing tables that are downstream or upstream, not both at the same time. Which is a small feature perhaps DataHub will hopefully improve! 

However, clicking on ‘Visualize Lineage‘ we can see both down and upstream relationships at the same time.  

Going back to the table page, pressing back on the browser, and then clicking on the Queries tab, it shows highlighted queries that use our dataset. 

⚠️ Note that these queries are not automatically populated, rather you need to supply the query yourself. Which is a feature that DataHub needs to improve on! As it would be quite useful to directly see the transformation logic applied to a dataset. 

Navigating to the Stats page, we can see that our data has been profiled. Profiled data provides summary information of the data, checking their null percentage, mean, median, and other metrics, giving any user of DataHub fast insights on data accuracy. 

The profiler runs when DataHub ingests new data. You can filter which datasets to scan or skip. To view the possible modifications to the profiler, check DataHub documentation for the `profiling` options. Reference: Hive | DataHub  

Clicking on ‘Validation‘, we can see that this data had a Data Quality check using Great Expectations. 


Great Expectations is an open-source Python library that enables users to create assertions about data assets that are validated when data gets updated. It’s a powerful tool for data quality management, providing a way to define, test, and monitor data quality. 

The integration between DataHub and Great Expectations allows data owners to automatically surface validation outcomes from Great Expectations in the DataHub UI. This seamless integration provides a full 360-degree view of data assets, enhancing data quality and collaboration. 

You can check more info about this integration in this Great Expectations blog post: Better Together: DataHub and Great Expectations • Great Expectations 

On the incidents page, we can check if there’s any active incidents, and can also check on past incidents. For our dataset, there were no incidents raised. 

⚠️ Note that alerts can be only created on the Managed version of DataHub, which would have been a great open-source feature. 

Step 7: DataHub – Analytics 

Clicking on Analytics on the main webapp bar, we can check how’s DataHub being used in our organization. This can be a great way to track adoption and facilitate a playfully competitive view across domains and / or departments in the organization. 

We can see how many weekly or monthly users access the platform, how many times they’ve searched and the top searches on keywords, among others. This info can be further filtered by pre-defined Data Domains.  

These can be very useful metrics in defining priority alongside actual usage of the data products themselves. 

Step 8: DataHub – Govern 

Clicking on Govern on the main webapp bar, and then clicking on Business Glossary, we are greeted with a dictionary of our business terms. 

These terms can be associated in term groups, in our case, we have the group `ClientsAndAccounts`. 

Clicking on the term group, we can see it’s definition and related terms.  

And then clicking on the term `AccountBalance`, we can see their definition. 

  • Clicking on ‘Related Entities‘, it shows which dataset fields are related with our term, in this case, we don’t have any related entity. 
  • Clicking on ‘Related Terms‘, as with the previous screen, it shows related terms, and in our case, we don’t have any related term. 
  • Related terms can have the following types of relationships: contains, inherits, contained by, inherited by. 
  • And last but not least, clicking on ‘Properties‘ we can check if there is any property associated with our term. In this case, it states that any field that describes an account balance needs to be written as their FQDN (Fully Qualified Domain Name), in this case, AccountBalance. 

Clicking on ‘Govern‘ on the main webapp bar, and then clicking on Domains, we can take a look on our Data Domains and Data Products. 

Our sample data doesn’t include a pre-defined Data Domain, so let’s define our own! 

Clicking on ‘Create Domain‘ it opens a pop-up for us to define our domain. 

After creating our Supply Chain domain, clicking on it shows a clean state Domain, with no associated entities, documents or products. 

Let’s add some assets to our domain, clicking on `Add assets`. Then, let’s add our Hive database `datahub_db`. 

We have successfully added the database.  

⚠️ Recall that in step 6, this database was associated with a schema. However, despite this pre-existing relationship, the schema was not automatically included in the Supply Chain domain. 

Step 9: DataHub – Ingestion 

Navigating to Ingestion on the main webapp bar, it leads us to where we can integrate new data sources into DataHub using its web-based interface. 

Here we can add new data sources. Clicking on `Create new source` it shows a diverse set of databases and dashboard tools to add. 

Let’s add a new CSV file. It needs to be hosted in the web, as we can’t add a file from our PC as of version 0.13. 

Let’s add the addresses.csv file from https://people.sc.fsu.edu/~jburkardt/data/csv/addresses.csv. It contains 6 records of fake addresses. 

After adding the file, we have the option to add a schedule to our ingestion. Let’s click on the toggle below `Run on a schedule` to create it. 

At last, let’s give this ingestion the name `FSU-Addresses` and then click `Save & Run`. 

Right after creating our ingestion, it starts running. 

Conclusion

The addition of a metadata management tool to an organization can improve their data strategy, decreasing the efforts to generate insights and decreasing knowledge silos in the company. It provides a comprehensive view of data assets, facilitating data discovery, data governance, and data management. 

I hope you enjoyed this detailed hands-on tutorial. Stay tuned for the next blog in our Data Governance blog series.

Share this post:

Related Articles
Data Engineering in Azure: understand PDFs using LLMs
Organization Migration in Terraform Cloud
Data Builder Dan: Episode 1 – Metadata Mayhem

Interested to join our team?

We’re always looking for our next data builder. Checkout our careers page to see our current openings. Your voice powers our innovation at Data Build Company. Join our team, where your ideas are not just heard but championed, paving the way for future developments in data engineering.

Join the Data Build Company family!