MetaCat API#

Creating MetaCat Client Interface#

MetaCatClient object is the client side interface to MetaCat server. To create a MetaCatClient, you need the URL for the server:

from metacat.webapi import MetaCatClient

client = MetaCatClient("http://host.domain:8080/metacat/instance")

For client authentication, the MetaCatClient object uses the token created by calling one of its login methods or using CLI. To get a token using CLI, use metacat auth login command:

$ metacat auth login alice
Password:...

Not all client methods require any client authentication. Most of read-only methods can be used without any authentication. Authentication is required to get information about users and roles (these methods are not yet implemented).

MetaCatClient Class Methods Reference#

class metacat.webapi.MetaCatClient(server_url=None, auth_server_url=None, max_concurrent_queries=5, token=None, token_file=None, token_library=None, timeout=None)

Initializes the MetaCatClient object

Parameters:
  • server_url (str) – The server endpoint URL, defult = from METACAT_SERVER_URL environment variable

  • auth_server_url (str) – The endpoint URL for the Authentication server, default = server_url + “/auth”

  • max_concurrent_queries (int, optional) – Controls the concurrency when asynchronous queries are used

  • token_file (str) – File path to read the authentication token from

  • token (bytes or str or SignedToken) – Use this token for authentication, optional

  • timeout (int or float) – Request timeout in seconds. Default: None - use default timeout, which is 300 seconds

add_child_dataset(parent_spec, child_spec)

Adds a child dataset to a dataset.

Parameters:
  • parent_spec (str) – Parent namespace, name (“namespace:name”)

  • child_spec (str) – Child namespace, name (“namespace:name”)

add_files(dataset, file_list=None, namespace=None, query=None)

Add existing files to an existing dataset. Requires client authentication.

Parameters:
  • dataset (str) – “namespace:name” or “name”, if namespace argument is given

  • query (str) – MQL query to run and add files matching the query

  • file_list (list) –

    List of dictionaries, one dictionary per file. Each dictionary must contain either a file id

    { "fid": "abcd12345" }
    

    or namespace/name:

    { "name": "filename.data", "namespace": "my_namespace" }
    

    or DID:

    { "did": "my_namespace:filename.data" }
    

  • namespace (str, optional) – Default namespace. If a file_list item is specified with a name without a namespace, the default namespace will be used.

Returns:

number of files added to the dataset

Return type:

int

Notes

Either file_list or query must be specified, but not both

async_query(query, data=None, **args)

Run the query asynchronously. Requires client authentication if save_as or add_to are used.

Parameters:
  • query (str) – Query in MQL

  • data (anything) – Arbitrary data associated with this query

  • args – Same keyword arguments as for the query() method

Returns:

pythreader Promise object associated with this query. The promise object will have Data attribute containig the object passed as the data argument to the async_query call.

See notes below for more on how to use this method.

Return type:

Promise

auth_info()

Returns information about current authentication token.

Returns:

  • str – username of the authenticated user

  • numeric – token expiration timestamp

create_dataset(did, frozen=False, monotonic=False, metadata=None, metadata_requirements=None, files_query=None, subsets_query=None, description='')

Creates new dataset. Requires client authentication.

Parameters:
  • did (str) – “namespace:name”

  • frozen (bool)

  • monotonic (bool)

  • metadata (dict) – Dataset metadata

  • metadata_requirements (dict) – Metadata requirements for files in the dataset

  • file_query (str) – Run MQL file query and add resulting files to the new dataset

  • dataset_query (str) – Run MQL dataset query and add resulting datasets to the new dataset as subsets

  • description (str)

Returns:

created dataset attributes

Return type:

dict

create_namespace(name, owner_role=None, description=None)

Creates new namespace. Requires client authentication.

Parameters:
  • name (str) – Namespace name

  • owner_role (str) – Owner role for the new namespace. The user must be a member of the role. Optional. If unspecified, the new namespace will be owned by the user.

  • description (str) – New namespace description

Returns:

New namespace information

Return type:

dict

declare_file(did=None, namespace=None, name=None, auto_name=None, dataset_did=None, dataset_namespace=None, dataset_name=None, size=0, metadata={}, fid=None, parents=[], checksums={}, dry_run=False)

Declare new file and add it to the dataset. Requires client authentication.

Parameters:
  • did (str) – file “namespace:name”

  • namespace (str) – file namespace

  • name (str) – file name

  • auto_name (str) – pattern to use for file name auto generation, default None - do not auto-generate file name

  • dataset_did (str) – dataset “namespace:name”

  • dataset_namespace (str) – dataset namespace

  • dataset_name (str) – dataset name

  • size (int) – file size in bytes, default 0

  • metadata (dict) – file metadata, default empty dictionary

  • fid (str) – file id, default None - to be auto-generated

  • checksums (dict) – dictionary with checksum values by the checksum type: {“type”:”value”, …}

  • parents (list of dicts) –

    each dict represents one parent file. The dict must contain one of the the following
    • ”fid” - parent file id

    • ”namespace” and “name” - parent file namespace and name

    • ”did” - parent file DID (“<namespace>:<name>”)

  • dry_run (boolean) – If true, run all the necessary checks but stop short of actual file declaraion or adding to a dataset. If not all checks are successful, generate eirher InvalidMetadataError or WebApiError. Default: False = do declare

Returns:

dictionary with file name, namespace and file id. Names and file ids will be auto-generated as necessary.

Return type:

dict

Notes

At least one of the following must be specified for the file:
  • did

  • namespace and either name or auto_name

At least one of the following must be specified for the dataset:
  • dataset_did

  • dataset_namespace and dataset_name

Auto-name pattern can be any string with the following substrings, which will be replaced with appropriate values to generate the file name:

  • $clock - current interger timestamp in milliseconds

  • $clock3 - last 3 digits of $clock - milliseconds only

  • $clock6 - last 6 digits of $clock

  • $clock9 - last 9 digits of $clock

  • $uuid - random UUID in hexadecimal representation, 32 hex digits

  • $uuid16 - 16 hex digits from random UUID hexadecimal representation

  • $uuid8 - 8 hex digits from random UUID hexadecimal representation

  • $fid - file id

declare_files(dataset, files, namespace=None, dry_run=False, as_required=None)

Declare new files and add them to an existing dataset. Requires client authentication.

Parameters:
  • dataset (str) – “namespace:name”

  • files (list or dict) – List of dictionaries, one dictionary per a file to be declared. See Notes below for the expected contents of each dictionary. For convenience, if declaring single file, the argument can be the single file dictionary instead of a list.

  • namespace (str, optional) – Default namespace for files to be declared

  • dry_run (boolean) – If true, run all the necessary checks but stop short of actual file declaraion or adding to a dataset. If not all checks are successful, generate eirher InvalidMetadataError or WebApiError. Default: False = do declare

Returns:

list of dictionaries, one dictionary per file with file ids: { “fid”: “…” }

Return type:

list

Notes

Each file to be declared must be represented with a dictionary. The dictionary must contain one of:

“did” - string in the format “<namespace>:<name>”

“name” - file name and optionaly “namespace”. If namespace is not present, the namespace argument will be used

as the default namespace

“auto_name” - pattern to auto-generate file name

{
    "namespace": "namespace",           # optional, namespace can be specified for each file explicitly or implicitly using the namespace=... argument
    "name": "filename",                 # optional,
    "did": "namespace:filename",        # optional, convenience for Rucio users
                                        # either "did" or "name", "namespace" must be present
    "size": ...,                        # required, integer number of bytes
    "metadata": {...},                  # optional, file metadata, a dictionary with arbitrary JSON'able contents
    "fid":  "...",                      # optional, file id. Will be auto-generated if unspecified.
                                        # if specified, must be unique
    "parents": [...],                   # optional, list of dicts, one dict per parent. See below.
    "checksums": {                      # optional, checksums dictionary
        "method": "value",...
    },
    "auto_name": "..."                  # optional, pattern to auto-generate file name if name is not specified or null
},...

Parents are specified with dictionaries, one dictionary per file. Each dictionary specifies the parent file in one of three ways:

  • “did”: “<namespace>:<name>”

  • “namespace”:”…”, “name”:”…”

  • “fid”: “<file id>”

DEPRECATED: if the parent is specified with a string instead of a dictionary, it is interpreferd as the parent file id.

delete_file(did=None, namespace=None, name=None, fid=None)

Delete an existing file. The file will be removed from all datasets and the database and its name and file id can be reused.

Parameters:
  • did (str) – file “namespace:name”

  • fid (str) – file id

  • namespace (str) – file namespace

  • name (str) – file name

  • retire (bool) – whether the file should be retired

get_category(path)

Get category information

Returns:

A dictionary with category information or None if not found

Return type:

dict

get_dataset(did=None, namespace=None, name=None, exact_file_count=False)

Gets single dataset

Parameters:
  • did (str - "namespace:name")

  • namespace (str)

  • name (str)

Returns:

dataset attributes or None if the dataset was not found

Return type:

dict

get_dataset_counts(did=None, namespace=None, name=None)

Gets single dataset files, subsets, supersets, etc. counts

Parameters:
  • did (str - "namespace:name")

  • namespace (str)

  • name (str)

Returns:

dataset counts or None if the dataset was not found

Return type:

dict

get_dataset_files(did, namespace=None, name=None, with_metadata=False, include_retired_files=False)

Gets single dataset

Parameters:
  • did (str - "namespace:name")

  • namespace (str)

  • name (str)

Returns:

generates sequence of dictionaries, one dictionary per file

Return type:

generator

get_file(name=None, namespace=None, fid=None, did=None, with_metadata=True, with_provenance=True, with_datasets=False)

Get one file record

Parameters:
  • fid (str, optional) – File id

  • name (str, optional)

  • namespace (str, optional) – name and namespace must be specified together

  • did (str, optional) – “nemaspace:name”

  • with_metadata (boolean) – whether to include file metadata

  • with_provenance (boolean) – whether to include parents and children list

  • with_datasets (boolean) – whether to include the list of datasets the file is in

Returns:

dictionary with file information or None if the file was not found

{
    "name": "namespace:filename",       # file name, namespace
    "fid":  "...",                      # files id
    "creator":  "...",                  # username of the file creator
    "created_timestamp":   ...,         # numeric UNIX timestamp
    "size": ...,                        # file size in bytes
    "checksums": { ... },               # file checksums

    # included if with_provenance=True
    "parents":  ["fid",...],            # list of ids for the file parent files
    "children": ["fid",...],            # list of ids for the file child files

    # included if with_metadata=True
    "metadata": { ... },                # file metadata

    # included if with_datasets=True
    "datasets": [
        {"namespace":"...", "name":"..."}, ...
    ]
}

Return type:

dict

Notes

Retrieving file provenance and metadata takes slightly longer time

get_files(lookup_list, with_metadata=True, with_provenance=True)

Get many file records

Parameters:
  • lookup_list (list) –

    List of dictionaries, one dictionary per file. Each dictionary must have either

    ”did”:”namespace:name”, or “namespace”:”…” and “name”:”…” or “fid”:”file id”

  • with_metadata (boolean) – whether to include file metadata

  • with_provenance – whether to include parents and children list

Return type:

List of file records, each record is the same as returned by get_file()

get_named_query(namespace, name)

Get named query

Parameters:
  • namespace (str)

  • name (str)

Returns:

A dictionary with information about the named query or None if the named query does not exist.

Return type:

dict or None

get_namespace(name)

Get information about a snamespace

Parameters:

name (str) – Namespace name

Returns:

Namespace information or None if the namespace was not found

Return type:

dict

get_namespaces(names)

Get information for multiple namespaces

Parameters:

names (list of str) – Namespace names

Returns:

Namespace information

Return type:

list

get_version()

Returns server version as text

list_categories(root=None)

List namespaces

Parameters:

root (str) – Optional, if present, list only categories under the root

Returns:

List of dictionaries with category information sorted by category path

Return type:

list

list_datasets(namespace_pattern=None, name_pattern=None, with_counts=False)

Gets the list of datasets with namespace/name matching the templates. The templates are Python fnmatch module style templates where '*' matches any substring and '?' matches a single character.

Parameters:
  • namespace_pattern (str)

  • name_pattern (str)

  • with_file_counts (boolean) – controls whether the results should include file counts or dataset names only

Yields:

generator – yields dictionaries like {“namespace”:…, “name”:…, “file_count”:…}

list_named_queries(namespace=None)

Get multiple named queries

Parameters:

namespace (str) – optional, if specified the list will include all named queries in the namespace. Orherwise all named queries will be returned

Returns:

List of dictionaries with information about the named queries.

Return type:

list

list_namespaces(pattern=None, owner_user=None, owner_role=None, directly=False)

List namespaces

Parameters:
  • pattern (str) – Optional fnmatch style pattern to filter namespaces by name

  • owner_user (str) – Optional, return only namespaces owned by the specified user

  • directly (boolean) – If False and owner_user is specified, return also namespaces owned by all roles the user is in Ignored if owner_user is not specified

  • owner_role (str) – Optional, return only namespaces owned by the specified role. Ignored if owner_user is also specified

Returns:

List of dictionaries with namespace information sorted by the namespace name

Return type:

list

login_digest(username, password, save_token=False)

Performs password-based authentication and stores the authentication token locally.

Parameters:
  • username (str)

  • password (str) – Password is not sent over the network. It is hashed and then used for digest authentication (RFC 2617).

Returns:

  • str – username of the authenticated user (same as usernme argument)

  • numeric – token expiration timestamp

login_ldap(username, password)

Performs password-based authentication and stores the authentication token locally using LDAP.

Parameters:
  • username (str)

  • password (str) – Password

Returns:

  • str – username of the authenticated user (same as usernme argument)

  • numeric – token expiration timestamp

login_password(username, password)

Combines LDAP and RFC 2617 digest authentication by calling login_ldap first and then, if it fails, ldap_digest methods

Parameters:
  • username (str)

  • password (str) – Password

Returns:

  • str – username of the authenticated user (same as usernme argument)

  • numeric – token expiration timestamp

login_token(username, encoded_token)

Authenticate using a JWT or a SciToken.

Parameters:
  • username (str)

  • encoded_token (str or bytes)

Returns:

  • str – username of the authenticated user (same as usernme argument)

  • numeric – authentication expiration timestamp

login_x509(username, cert, key=None)

Performs X.509 authentication and stores the authentication token locally.

Parameters:
  • username (str)

  • cert (str) – Path to the file with the X.509 certificate or the certificate and private key

  • key (str) – Path to the file with the X.509 private key

Returns:

  • str – username of the authenticated user (same as usernme argument)

  • numeric – token expiration timestamp

move_files(namespace, file_list=None, query=None)
Parameters:
  • namespace (str) – namespace to move files to

  • query (str) – MQL query to run and add files matching the query

  • file_list (list) –

    List of dictionaries, one dictionary per file. Each dictionary must contain either a file id

    { "fid": "abcd12345" }
    

    or namespace/name:

    { "name": "filename.data", "namespace": "my_namespace" }
    

    or DID:

    { "did": "my_namespace:filename.data" }
    

Returns:

number of files moved, list of errors, if any

Return type:

tuple

query(query, namespace=None, with_metadata=False, with_provenance=False, save_as=None, add_to=None, include_retired_files=False, summary=None, batch_size=0)

Run file query. Requires client authentication if save_as or add_to are used.

Parameters:
  • query (str) – Query in MQL

  • namespace (str) – default namespace for the query

  • include_retired_files – boolean, whether to include retired files into the query results, default=False

  • with_metadata (boolean) – whether to return file metadata

  • with_provenance (boolean) – whether to return parents and children list

  • save_as (str) – namespace:name for a new dataset to create and add found files to

  • add_to (str) – namespace:name for an existing dataset to add found files to

  • summary (str or None) – “count” - return [{“count”: n, “total_size”: nbytes }] “keys” - return list of list of all top level metadata keys for the selected files summary can not be used together with save_as or add_to

Returns:

dictionary with file information. Each file will be represented with a dictionary in this list.

Return type:

list of dicts

Notes

Retrieving file provenance and metadata takes slightly longer time

remove_dataset(dataset)

Remove a dataset. Requires client authentication.

Parameters:

dataset (str) – “namespace:name”

remove_files(dataset, file_list=None, namespace=None, query=None)

Remove files from a dataset. Requires client authentication.

Parameters:
  • dataset (str) – “namespace:name” or “name”, if namespace argument is given

  • query (str) – MQL query to run and add files matching the query

  • file_list (list) –

    List of dictionaries, one dictionary per file. Each dictionary must contain either a file id

    { "fid": "abcd12345" }
    

    or namespace/name:

    { "name": "filename.data", "namespace": "my_namespace" }
    

    or DID:

    { "did": "my_namespace:filename.data" }
    

  • namespace (str, optional) – Default namespace. If a file_list item is specified with a name without a namespace, the default namespace will be used.

Returns:

actual number of files removed from the dataset

Return type:

int

Notes

Either file_list or query must be specified, but not both

retire_file(did=None, namespace=None, name=None, fid=None, retire=True)

Modify retired status of the file. Retured file remains in the database, “occupies” the name in the namespace, but id not visible to normal queries. Retired file can be brought back to normal using this method too.

If you need to completely remove the file, use delete_file method.

Parameters:
  • did (str) – file “namespace:name”

  • fid (str) – file id

  • namespace (str) – file namespace

  • name (str) – file name

  • retire (bool) – whether the file should be retired

Returns:

Dictionary with updated file information

Return type:

dict

retry_request(method, url, timeout=None, **args)

Implements the functionality to retry on 503 response with random exponentially growing delay Use timemout = 0 to try the request exactly once Returns the response with status=503 on timeout

search_named_queries(query)

Run MQL query for named queries

Parameters:

query (str) – Query in MQL

Returns:

The list contains one dictionary per matching named query with the query information.

Return type:

list of dicts

update_dataset(dataset, metadata=None, mode='update', frozen=None, monotonic=None, description=None)

Update dataset. Requires client authentication.

Parameters:
  • dataset (str) – “namespace:name”

  • metadata (dict or None) – New metadata values, or, if None, leave the metadata unchanged

  • mode (str) – Either "update" or "replace". If "update", metadata will be updated with new values. If "replace", metadata will be replaced with new values. If metadata is None, mode is ignored

  • frozen (boolean or None) – if boolean, new value for the flag. If None, leave it unchanged

  • monotonic (boolean or None) – if boolean, new value for the flag. If None, leave it unchanged

  • description (str or None) – if str, new dataset description. If None, leave the description unchanged

Returns:

dictionary with new dataset information

Return type:

dict

update_file(did=None, namespace=None, name=None, fid=None, replace=False, size=None, checksums=None, parents=None, children=None, metadata=None)
Parameters:
  • did (str) – file “namespace:name”

  • fid (str) – file id

  • namespace (str) – file namespace

  • name (str) – file name

  • replace (bool) – If True, the specified attribute values will be replaced with new values. Otherwise added (for parents and children) and updated (for checksums and metadata)

  • size (int >= 0) – file size, optional

  • checksums (dict) – checksum values, optional

  • parents (list) – list of parent file ids, optional

  • children (list) – list of child file ids, optional

  • metadata (dict) – dictionary with metadata to update or replace, optional

Returns:

Dictionary with updated file information

Return type:

dict

update_file_meta(metadata, files=None, names=None, fids=None, namespace=None, dids=None, mode='update')

Updates metadata for existing files. Requires client authentication.

DEPRECATED update_file() should be used instead

Parameters:
  • metadata (dict) – see Notes

  • files (list of dicts) – Each dict specifies a file. See Notes

  • names (list of strings) – List of file names. Requires namespace to be specified

  • dids (list of strings) – List of DIDs (“namespace:name”) strings

  • fids (list of strings) – List of file ids. The list of files can be specified with fids or with names argument, but not both.

  • namespace (string) – Default namespace

  • mode (str) – Either "update" (default) or "replace". If mode is "update", existing metadata will be updated with values in metadata. If "replace", then new values will replace existing metadata. Also, see notes below.

Returns:

list of dictionaries, one dictionary per file with file ids: { “fid”: “…” }

Return type:

list

Notes

This method can be be used to apply common metadata changes to a list of files. This method can not be used to update file provenance information.

The``metadata`` argument is used to specify the common changes to the metadata to apply to multiple files. The metadata dictionary will be used to either update existing metadata of listed files (if mode="update") or replace it (if mode="replace").

Files to update have to be specified in one of the following ways:

  • files = [list of dicts] - each dict must be in one of the following formats:

    • {“fid”:”<file id>”}

    • {“namespace”:”<file namespace>”, “name”:”<file name>”} - namespace is optional. Default: the value of the “namespace” method argument

    • {“did”:”<file namespace>:<file name>”}

  • dids = [list of file DIDs]

  • names = [list of file names] - “namespace” argument method must be used to specify the common namespace

  • fids = [list of file ids]

wait_queries()

Wait for all issued asynchronous queries to complete

Exceptions#

MetaCatClient methods can raise the following exceptions:

  • metacat.webapi.InvalidArgument - method was called with an invalid argument

  • metacat.webapi.NotFoundError - an object is not found in the database

  • metacat.webapi.BadRequestError - invalid request

  • metacat.webapi.AlreadyExistsError - an object already exists in the database

  • metacat.webapi.PermissionDeniedError - permission denied

  • metacat.webapi.InvalidMetadataError - metadata validation failed

All these exceptions inherit from metacat.webapi.WebAPIError class. WebAPIError.__str__ can be used to get humanly readable description of the exceptions, e.g.:

try:
  results = metacat_client.method(...)
except metacat.webapi.WebAPIError as e:
  print(e)

Asynchronous Queries#

When you need to run multiple queries, you can use the async_query to run them concurrently by starting them asynchronously and then waiting for their results:

client = MetaCatClient(url)

datasets = [ "production:A", "production:B" ]

promises = []
for dataset_name in datasets:
    query = f"files from {dataset_name} where created_timestamp > '2020-10-10'"
    promise = client.async_query(query, dataset_name)
    promises.append(promise)

for promise in promises:
    results = promise.wait()
    n = len(results)
    dataset_name = promise.Data
    print(f"Dataset {dataset_name}: {n} files")

In this example, we start 2 queries concurrently. Each will get files from its own dataset. When we start the asynchronous queries, instead of query results, the client object returns promises. Promise is an object, on which you can wait for actual results. Also, we pass the dataset name to the async_query method as the data argument to be able to associate the returned results with the dataset.

In the second for-loop, we wait for the results from each query and use the promise Data attribute to refer to the actual datset name to print the results. Note that the second for-loop loops through the promises in the same order as they were created. But that does not mean that we expect the queries to complete in the same order. If the query completes before we call the corresponding promise wait method, it will simply return the results immediately.

Another way to wait for all asynchronous queries to complete is to call wait_queries method of the MetaCatClient:

client = MetaCatClient(url)

datasets = [ "production:A", "production:B" ]
promises = {}

for dataset_name in datasets:
    query = f"files from {dataset_name} where created_timestamp > '2020-10-10'"
    promise = client.async_query(query, None)
    promises[dataset_name] = promise

client.wait_queries()

for dataset_name, promise in promises.items():
    results = promise.wait()
    n = len(results)
    print(f"Dataset {dataset_name}: {n} files")

The wait_queries method will block until all asynchronous queries started by the client complete. In this case, calling wait method of the promise is still necessary to get the results of each individual query, but because we called wait_queries first, the wait method will return results immediately without blocking.