MetaCat Concepts#
MetaCat is a general purpose metadata catalog. It has 4 major functions:
Store metadata associated with a file
Provide a mechanism to retrieve the metadata associated with a file
Efficiently query the metadata database to find files matching the list of criteria expressed in terms of file metadata
Provide a mechanism to seamlessly and efficiently integrate metadata stored in external sources to the query results and use it to select files based on their metadata values
Namespace#
Namespace is a named entity, which provides the mechanism to avoid name clashes. Files, datasets and named queries are uniquely identified by their namespace, name combination. Namespace is very close equivalent to Rucio scope.
Another function of a namespace is object ownership.
File#
MetaCat is a purely metadata database. Replica management is outside of MetaCat scope. That is why in MetaCat, file is rather abstract object. File can be a member of one or more Datasets.
MetaCat separates file metadata into two parts, fixed file attributes and flexible metadata or just metadata. Flexible metadata can be any JSON dictionary.
File Attributes#
The following are file attributes, they are defined for every file and some of them have some default values.
File ID - ASCII string with some limitations, unique for the MetaCat instance. It can be defined by the user at the time of the file declaration or it can be generated by MetaCat.
File namespace and name - file name is unique within the namespace. File name and namespace correspond to Rucio concept of file name and scope
Creator username and creation timestamp - username of the user who declared the file and the declaration timestamp
Update username and timestamp - username of the user who updated the file metadata last time and the timestamp
File size - integer number of bytes
Checksums - a dictionary with checksum types and values for the file
Parents - file or files used to produce the file
Children - files produced from this file. This property is not actually stored in the database, but rather derived from Parents
File size and checksums are not used by MetaCat, therefore, depending on the application, they do not have to have meaningful values.
MetaCat file also has a dictionary of flexible metadata. This dictionary can be any dictionary representable in JSON. MetaCat itself does not put any restrictions on the values stored in this dictionary. In this dictionary, top level key names must have the following strucrure:
<category>.<name>
<category>.<subcategory>.<name> and so on
Category names, subcategory names and parameter names are separated with dots. Categories and subcategories can be used to define restrictions on the metadata namespace.
Dataset#
In MetaCat, Dataset is a relatively static collection of files. “Relatively static” means that files are added to and removed from a dataset explicitly. There is no such thing as a “dynamic” dataset, which automatically contains files matching certain criteria.
A dataset can be set to monotonic or frozen. If the dataset is frozen, no files can be added to or removed from it. If the dataset is monotonic, files only can be added to it, but not removed. The dataset owner can change these flags at any time.
A file can be added to multiple datasets.
Like file, dataset metadata has attributes and flexible metadata. Dataset metadata can be any JSON dictionary.
Dataset Attributes#
Dataset attributes are:
Name and namespace
Name and namespace of parent dataset, if any
Creator username
Creation time
Zero or more children datasets
Frozen
Monotonic
Description
Dataset metadata restrictions#
Dataset can define metadata restrictions. They can be used to enforce certain requirements for files added to the dataset. The rules can be defined to:
Require certain metadata fields to be present in the file metadata
Define acceptable ranges, enumeration or patterns for parameter values
Parameter Categories#
All metadata parameters are grouped into categories. Categories may have subcategories. Each metadata parameter name must include at least one dot.
Parameter category is the portion of the parameter name before the last dot and parameter name within the category is the part of the name after
the last dot. For example, if the full parameter name is detector.beam.status
, then the category name is detector.beam
and the parameter name
is status
.
The purpose of the parameter category is to allow adding metadata constraints for the parameters in the category. Each constraint is defined in terms of:
The name of the parameter
Parameter type (any, int, string, float, dictionary, list, list of ints, etc.)
Allowed parameter values (range, enumeration, pattern)
The category can be either restricted or not. If restricted, the category may not contain any parameters other than those listed in the category constraints.
Ownership#
MetaCat has the notion of object ownership. Namespaces and Parameter Categories are explicitly owned either by an individual user or by a role (group of users). The owner of a namespace automatically owns all the datasets, files and named queries in the namespace. The following operations can only be performed by the object owner:
Adding a file to a dataset - can only be done by the dataset owner
Creating a dataset in a namespace or modifying a dataset flags - by the namespace owner
Adding a dataset to another dataset - by the parent dataset owner
Creating parameter subcategories - if the category is restricted, only by the parent category owner
Modifying metadata constraints in the category - by the category owner
Create or modify a named query in the namespace - by the namespace owner
Query#
MetaCat query is an algorithm to select files based on the set of criteria defined by the user. Result of a query execution is a file set. File set is an unordered collection of files, which match a given set of criteria at the time when the query is executed. Because the contents of the database are dynamic and can change at any time, the same query is not guaranteed to always return the same results next time it is executed.
Currently, MetaCat does not have a mechanism to specify the order of the resulting file set. Therefore, even if the set of files returned by the query is the same, MetaCat does not guarantee that they are returned in the same order.
MetaCat queries are written in Metadata Query Language (MQL). Fundamental concept behind MQL is that it provides a mechanism to build a complicated query from simpler queries. The file sets produced by simple queries are transformed into results of more complicated queries as the query is executed.
Most of MQL queries are translated internally into SQL. This allows most of the queries to be executed by the database engine, which is supposed to be able to do that efficiently. The only exception is when an external data filter is used in the query. In this case, MetaCat translates portions of the query into SQL as much as possible and the rest of the query.
The following file and dataset attributes can be used in a metadata query without a category:
fid - file ID
namespace
name
creator - creator username
create_timestamp - floating point number, standard UNIX epoch timestamp
size - file size
Currently, queries do not require any authorization.