
16 2| Background
2.5. Data catalog solutions
A variety of solutions offering data catalog, and data and metadata management services
is already available on the market. In some cases, they are a part of a wider data platform,
in others they exist as standalone services. Essentially all the main computer companies
put on the market their own products, including big names as IBM, Microsoft, Oracle,
and Amazon. However, for the purposes of this thesis, the focus will be on open-source
software, developed to be freely used, modified, and distributed. In particular, seven
solutions will be presented: Apache Atlas, Amundsen, CKAN, Kylo, Magda, Truedat,
and iRODS, concentrating on the aspects more pertinent to metadata.
2.5.1. Apache Atlas
Apache Atlas [4], hosted by the Apache Software Foundation, is a metadata management
and data governance tool that allows to ingest, discover, catalog, classify, and govern data
from multiple data sources.
It employs a metadata model named ‘Type System’, which consists of definitions called
types. Instances of types, called entities, represent the managed metadata objects. The
Type system allows to define and manage types and entities. Metadata objects are per-
sisted through a graph model, under the control of a Graph Engine. The Graph Engine
creates the appropriate indices for the metadata objects as well. Further, ingest and
export components are included. Two integration methods are provided. The primary
mechanism to query and discover metadata is a REST API that enables types and entities
to be created, updated and deleted. In addition, a messaging interfaced based on Kafka
useful for communicating metadata objects to Atlas and to consume metadata change
events from Atlas is in place. At this moment, the supported metadata sources include
HBase, Hive, Sqoop, Storm, and Kafka.
Among the functionalities offered by Atlas, it is worth mentioning the presence of pre-
defined metadata types, coupled with the possibility of defining new types. Each type
can have attributes and objects references. Moreover, it is possible to dynamically cre-
ate classifications and propagate them through data lineage. A basic search is available,
which allows to query data by type, classification, attribute value or free text, as well
as an advanced search based on a SQL-like language named Domain Specific Language
(DSL). A rich REST API is available to search by complex criteria as well. It is also
possible to filter the search results. Furthermore, Atlas provides a glossary of business
terms, an intuitive UI to view lineage of data as they move through various processes and