MetaCat API#
Creating MetaCat Client Interface#
MetaCatClient
object is the client side interface to MetaCat server. To create a MetaCatClient
, you need the URL for the server:
from metacat.webapi import MetaCatClient
client = MetaCatClient("http://host.domain:8080/metacat/instance")
For client authentication, the MetaCatClient
object uses the token created by calling one of its login methods or using CLI.
To get a token using CLI, use metacat auth login
command:
$ metacat auth login alice
Password:...
Not all client methods require any client authentication. Most of read-only methods can be used without any authentication. Authentication is required to get information about users and roles (these methods are not yet implemented).
MetaCatClient Class Methods Reference#
- class metacat.webapi.MetaCatClient(server_url=None, auth_server_url=None, max_concurrent_queries=5, token=None, token_file=None, token_library=None, timeout=None)
Initializes the MetaCatClient object
- Parameters:
server_url (str) – The server endpoint URL, defult = from METACAT_SERVER_URL environment variable
auth_server_url (str) – The endpoint URL for the Authentication server, default = server_url + “/auth”
max_concurrent_queries (int, optional) – Controls the concurrency when asynchronous queries are used
token_file (str) – File path to read the authentication token from
token (bytes or str or SignedToken) – Use this token for authentication, optional
timeout (int or float) – Request timeout in seconds. Default: None - use default timeout, which is 300 seconds
- add_child_dataset(parent_spec, child_spec)
Adds a child dataset to a dataset.
- Parameters:
parent_spec (str) – Parent namespace, name (“namespace:name”)
child_spec (str) – Child namespace, name (“namespace:name”)
- add_files(dataset, file_list=None, namespace=None, query=None)
Add existing files to an existing dataset. Requires client authentication.
- Parameters:
dataset (str) – “namespace:name” or “name”, if namespace argument is given
query (str) – MQL query to run and add files matching the query
file_list (list) –
List of dictionaries, one dictionary per file. Each dictionary must contain either a file id
{ "fid": "abcd12345" }
or namespace/name:
{ "name": "filename.data", "namespace": "my_namespace" }
or DID:
{ "did": "my_namespace:filename.data" }
namespace (str, optional) – Default namespace. If a
file_list
item is specified with a name without a namespace, thedefault namespace
will be used.
- Returns:
number of files added to the dataset
- Return type:
int
Notes
Either
file_list
orquery
must be specified, but not both
- async_query(query, data=None, **args)
Run the query asynchronously. Requires client authentication if save_as or add_to are used.
- Parameters:
query (str) – Query in MQL
data (anything) – Arbitrary data associated with this query
args – Same keyword arguments as for the query() method
- Returns:
pythreader
Promise object associated with this query. The promise object will have Data attribute containig the object passed as thedata
argument to theasync_query
call.See notes below for more on how to use this method.
- Return type:
Promise
- auth_info()
Returns information about current authentication token.
- Returns:
str – username of the authenticated user
numeric – token expiration timestamp
- create_dataset(did, frozen=False, monotonic=False, metadata=None, metadata_requirements=None, files_query=None, subsets_query=None, description='')
Creates new dataset. Requires client authentication.
- Parameters:
did (str) – “namespace:name”
frozen (bool)
monotonic (bool)
metadata (dict) – Dataset metadata
metadata_requirements (dict) – Metadata requirements for files in the dataset
file_query (str) – Run MQL file query and add resulting files to the new dataset
dataset_query (str) – Run MQL dataset query and add resulting datasets to the new dataset as subsets
description (str)
- Returns:
created dataset attributes
- Return type:
dict
- create_namespace(name, owner_role=None, description=None)
Creates new namespace. Requires client authentication.
- Parameters:
name (str) – Namespace name
owner_role (str) – Owner role for the new namespace. The user must be a member of the role. Optional. If unspecified, the new namespace will be owned by the user.
description (str) – New namespace description
- Returns:
New namespace information
- Return type:
dict
- declare_file(did=None, namespace=None, name=None, auto_name=None, dataset_did=None, dataset_namespace=None, dataset_name=None, size=0, metadata={}, fid=None, parents=[], checksums={}, dry_run=False)
Declare new file and add it to the dataset. Requires client authentication.
- Parameters:
did (str) – file “namespace:name”
namespace (str) – file namespace
name (str) – file name
auto_name (str) – pattern to use for file name auto generation, default None - do not auto-generate file name
dataset_did (str) – dataset “namespace:name”
dataset_namespace (str) – dataset namespace
dataset_name (str) – dataset name
size (int) – file size in bytes, default 0
metadata (dict) – file metadata, default empty dictionary
fid (str) – file id, default None - to be auto-generated
checksums (dict) – dictionary with checksum values by the checksum type: {“type”:”value”, …}
parents (list of dicts) –
- each dict represents one parent file. The dict must contain one of the the following
”fid” - parent file id
”namespace” and “name” - parent file namespace and name
”did” - parent file DID (“<namespace>:<name>”)
dry_run (boolean) – If true, run all the necessary checks but stop short of actual file declaraion or adding to a dataset. If not all checks are successful, generate eirher InvalidMetadataError or WebApiError. Default: False = do declare
- Returns:
dictionary with file name, namespace and file id. Names and file ids will be auto-generated as necessary.
- Return type:
dict
Notes
- At least one of the following must be specified for the file:
did
namespace and either name or auto_name
- At least one of the following must be specified for the dataset:
dataset_did
dataset_namespace and dataset_name
Auto-name pattern can be any string with the following substrings, which will be replaced with appropriate values to generate the file name:
$clock - current interger timestamp in milliseconds
$clock3 - last 3 digits of $clock - milliseconds only
$clock6 - last 6 digits of $clock
$clock9 - last 9 digits of $clock
$uuid - random UUID in hexadecimal representation, 32 hex digits
$uuid16 - 16 hex digits from random UUID hexadecimal representation
$uuid8 - 8 hex digits from random UUID hexadecimal representation
$fid - file id
- declare_files(dataset, files, namespace=None, dry_run=False, as_required=None)
Declare new files and add them to an existing dataset. Requires client authentication.
- Parameters:
dataset (str) – “namespace:name”
files (list or dict) – List of dictionaries, one dictionary per a file to be declared. See Notes below for the expected contents of each dictionary. For convenience, if declaring single file, the argument can be the single file dictionary instead of a list.
namespace (str, optional) – Default namespace for files to be declared
dry_run (boolean) – If true, run all the necessary checks but stop short of actual file declaraion or adding to a dataset. If not all checks are successful, generate eirher InvalidMetadataError or WebApiError. Default: False = do declare
- Returns:
list of dictionaries, one dictionary per file with file ids: { “fid”: “…” }
- Return type:
list
Notes
Each file to be declared must be represented with a dictionary. The dictionary must contain one of:
“did” - string in the format “<namespace>:<name>”
- “name” - file name and optionaly “namespace”. If namespace is not present, the
namespace
argument will be used as the default namespace
“auto_name” - pattern to auto-generate file name
{ "namespace": "namespace", # optional, namespace can be specified for each file explicitly or implicitly using the namespace=... argument "name": "filename", # optional, "did": "namespace:filename", # optional, convenience for Rucio users # either "did" or "name", "namespace" must be present "size": ..., # required, integer number of bytes "metadata": {...}, # optional, file metadata, a dictionary with arbitrary JSON'able contents "fid": "...", # optional, file id. Will be auto-generated if unspecified. # if specified, must be unique "parents": [...], # optional, list of dicts, one dict per parent. See below. "checksums": { # optional, checksums dictionary "method": "value",... }, "auto_name": "..." # optional, pattern to auto-generate file name if name is not specified or null },...
Parents are specified with dictionaries, one dictionary per file. Each dictionary specifies the parent file in one of three ways:
“did”: “<namespace>:<name>”
“namespace”:”…”, “name”:”…”
“fid”: “<file id>”
DEPRECATED: if the parent is specified with a string instead of a dictionary, it is interpreferd as the parent file id.
- delete_file(did=None, namespace=None, name=None, fid=None)
Delete an existing file. The file will be removed from all datasets and the database and its name and file id can be reused.
- Parameters:
did (str) – file “namespace:name”
fid (str) – file id
namespace (str) – file namespace
name (str) – file name
retire (bool) – whether the file should be retired
- get_category(path)
Get category information
- Returns:
A dictionary with category information or None if not found
- Return type:
dict
- get_dataset(did=None, namespace=None, name=None, exact_file_count=False)
Gets single dataset
- Parameters:
did (str - "namespace:name")
namespace (str)
name (str)
- Returns:
dataset attributes or None if the dataset was not found
- Return type:
dict
- get_dataset_counts(did=None, namespace=None, name=None)
Gets single dataset files, subsets, supersets, etc. counts
- Parameters:
did (str - "namespace:name")
namespace (str)
name (str)
- Returns:
dataset counts or None if the dataset was not found
- Return type:
dict
- get_dataset_files(did, namespace=None, name=None, with_metadata=False, include_retired_files=False)
Gets single dataset
- Parameters:
did (str - "namespace:name")
namespace (str)
name (str)
- Returns:
generates sequence of dictionaries, one dictionary per file
- Return type:
generator
- get_file(name=None, namespace=None, fid=None, did=None, with_metadata=True, with_provenance=True, with_datasets=False)
Get one file record
- Parameters:
fid (str, optional) – File id
name (str, optional)
namespace (str, optional) – name and namespace must be specified together
did (str, optional) – “nemaspace:name”
with_metadata (boolean) – whether to include file metadata
with_provenance (boolean) – whether to include parents and children list
with_datasets (boolean) – whether to include the list of datasets the file is in
- Returns:
dictionary with file information or None if the file was not found
{ "name": "namespace:filename", # file name, namespace "fid": "...", # files id "creator": "...", # username of the file creator "created_timestamp": ..., # numeric UNIX timestamp "size": ..., # file size in bytes "checksums": { ... }, # file checksums # included if with_provenance=True "parents": ["fid",...], # list of ids for the file parent files "children": ["fid",...], # list of ids for the file child files # included if with_metadata=True "metadata": { ... }, # file metadata # included if with_datasets=True "datasets": [ {"namespace":"...", "name":"..."}, ... ] }
- Return type:
dict
Notes
Retrieving file provenance and metadata takes slightly longer time
- get_files(lookup_list, with_metadata=True, with_provenance=True)
Get many file records
- Parameters:
lookup_list (list) –
- List of dictionaries, one dictionary per file. Each dictionary must have either
”did”:”namespace:name”, or “namespace”:”…” and “name”:”…” or “fid”:”file id”
with_metadata (boolean) – whether to include file metadata
with_provenance – whether to include parents and children list
- Return type:
List of file records, each record is the same as returned by get_file()
- get_named_query(namespace, name)
Get named query
- Parameters:
namespace (str)
name (str)
- Returns:
A dictionary with information about the named query or None if the named query does not exist.
- Return type:
dict or None
- get_namespace(name)
Get information about a snamespace
- Parameters:
name (str) – Namespace name
- Returns:
Namespace information or None if the namespace was not found
- Return type:
dict
- get_namespaces(names)
Get information for multiple namespaces
- Parameters:
names (list of str) – Namespace names
- Returns:
Namespace information
- Return type:
list
- get_version()
Returns server version as text
- list_categories(root=None)
List namespaces
- Parameters:
root (str) – Optional, if present, list only categories under the root
- Returns:
List of dictionaries with category information sorted by category path
- Return type:
list
- list_datasets(namespace_pattern=None, name_pattern=None, with_counts=False)
Gets the list of datasets with namespace/name matching the templates. The templates are Python
fnmatch
module style templates where'*'
matches any substring and'?'
matches a single character.- Parameters:
namespace_pattern (str)
name_pattern (str)
with_file_counts (boolean) – controls whether the results should include file counts or dataset names only
- Yields:
generator – yields dictionaries like {“namespace”:…, “name”:…, “file_count”:…}
- list_named_queries(namespace=None)
Get multiple named queries
- Parameters:
namespace (str) – optional, if specified the list will include all named queries in the namespace. Orherwise all named queries will be returned
- Returns:
List of dictionaries with information about the named queries.
- Return type:
list
- list_namespaces(pattern=None, owner_user=None, owner_role=None, directly=False)
List namespaces
- Parameters:
pattern (str) – Optional fnmatch style pattern to filter namespaces by name
owner_user (str) – Optional, return only namespaces owned by the specified user
directly (boolean) – If False and owner_user is specified, return also namespaces owned by all roles the user is in Ignored if owner_user is not specified
owner_role (str) – Optional, return only namespaces owned by the specified role. Ignored if owner_user is also specified
- Returns:
List of dictionaries with namespace information sorted by the namespace name
- Return type:
list
- login_digest(username, password, save_token=False)
Performs password-based authentication and stores the authentication token locally.
- Parameters:
username (str)
password (str) – Password is not sent over the network. It is hashed and then used for digest authentication (RFC 2617).
- Returns:
str – username of the authenticated user (same as
usernme
argument)numeric – token expiration timestamp
- login_ldap(username, password)
Performs password-based authentication and stores the authentication token locally using LDAP.
- Parameters:
username (str)
password (str) – Password
- Returns:
str – username of the authenticated user (same as
usernme
argument)numeric – token expiration timestamp
- login_password(username, password)
Combines LDAP and RFC 2617 digest authentication by calling login_ldap first and then, if it fails, ldap_digest methods
- Parameters:
username (str)
password (str) – Password
- Returns:
str – username of the authenticated user (same as
usernme
argument)numeric – token expiration timestamp
- login_token(username, encoded_token)
Authenticate using a JWT or a SciToken.
- Parameters:
username (str)
encoded_token (str or bytes)
- Returns:
str – username of the authenticated user (same as
usernme
argument)numeric – authentication expiration timestamp
- login_x509(username, cert, key=None)
Performs X.509 authentication and stores the authentication token locally.
- Parameters:
username (str)
cert (str) – Path to the file with the X.509 certificate or the certificate and private key
key (str) – Path to the file with the X.509 private key
- Returns:
str – username of the authenticated user (same as
usernme
argument)numeric – token expiration timestamp
- move_files(namespace, file_list=None, query=None)
- Parameters:
namespace (str) – namespace to move files to
query (str) – MQL query to run and add files matching the query
file_list (list) –
List of dictionaries, one dictionary per file. Each dictionary must contain either a file id
{ "fid": "abcd12345" }
or namespace/name:
{ "name": "filename.data", "namespace": "my_namespace" }
or DID:
{ "did": "my_namespace:filename.data" }
- Returns:
number of files moved, list of errors, if any
- Return type:
tuple
- query(query, namespace=None, with_metadata=False, with_provenance=False, save_as=None, add_to=None, include_retired_files=False, summary=None, batch_size=0)
Run file query. Requires client authentication if save_as or add_to are used.
- Parameters:
query (str) – Query in MQL
namespace (str) – default namespace for the query
include_retired_files – boolean, whether to include retired files into the query results, default=False
with_metadata (boolean) – whether to return file metadata
with_provenance (boolean) – whether to return parents and children list
save_as (str) – namespace:name for a new dataset to create and add found files to
add_to (str) – namespace:name for an existing dataset to add found files to
summary (str or None) – “count” - return [{“count”: n, “total_size”: nbytes }] “keys” - return list of list of all top level metadata keys for the selected files
summary
can not be used together withsave_as
oradd_to
- Returns:
dictionary with file information. Each file will be represented with a dictionary in this list.
- Return type:
list of dicts
Notes
Retrieving file provenance and metadata takes slightly longer time
- remove_dataset(dataset)
Remove a dataset. Requires client authentication.
- Parameters:
dataset (str) – “namespace:name”
- remove_files(dataset, file_list=None, namespace=None, query=None)
Remove files from a dataset. Requires client authentication.
- Parameters:
dataset (str) – “namespace:name” or “name”, if namespace argument is given
query (str) – MQL query to run and add files matching the query
file_list (list) –
List of dictionaries, one dictionary per file. Each dictionary must contain either a file id
{ "fid": "abcd12345" }
or namespace/name:
{ "name": "filename.data", "namespace": "my_namespace" }
or DID:
{ "did": "my_namespace:filename.data" }
namespace (str, optional) – Default namespace. If a
file_list
item is specified with a name without a namespace, thedefault namespace
will be used.
- Returns:
actual number of files removed from the dataset
- Return type:
int
Notes
Either
file_list
orquery
must be specified, but not both
- retire_file(did=None, namespace=None, name=None, fid=None, retire=True)
Modify retired status of the file. Retured file remains in the database, “occupies” the name in the namespace, but id not visible to normal queries. Retired file can be brought back to normal using this method too.
If you need to completely remove the file, use delete_file method.
- Parameters:
did (str) – file “namespace:name”
fid (str) – file id
namespace (str) – file namespace
name (str) – file name
retire (bool) – whether the file should be retired
- Returns:
Dictionary with updated file information
- Return type:
dict
- retry_request(method, url, timeout=None, **args)
Implements the functionality to retry on 503 response with random exponentially growing delay Use timemout = 0 to try the request exactly once Returns the response with status=503 on timeout
- search_named_queries(query)
Run MQL query for named queries
- Parameters:
query (str) – Query in MQL
- Returns:
The list contains one dictionary per matching named query with the query information.
- Return type:
list of dicts
- update_dataset(dataset, metadata=None, mode='update', frozen=None, monotonic=None, description=None)
Update dataset. Requires client authentication.
- Parameters:
dataset (str) – “namespace:name”
metadata (dict or None) – New metadata values, or, if None, leave the metadata unchanged
mode (str) – Either
"update"
or"replace"
. If"update"
, metadata will be updated with new values. If"replace"
, metadata will be replaced with new values. Ifmetadata
is None,mode
is ignoredfrozen (boolean or None) – if boolean, new value for the flag. If None, leave it unchanged
monotonic (boolean or None) – if boolean, new value for the flag. If None, leave it unchanged
description (str or None) – if str, new dataset description. If None, leave the description unchanged
- Returns:
dictionary with new dataset information
- Return type:
dict
- update_file(did=None, namespace=None, name=None, fid=None, replace=False, size=None, checksums=None, parents=None, children=None, metadata=None)
- Parameters:
did (str) – file “namespace:name”
fid (str) – file id
namespace (str) – file namespace
name (str) – file name
replace (bool) – If True, the specified attribute values will be replaced with new values. Otherwise added (for parents and children) and updated (for checksums and metadata)
size (int >= 0) – file size, optional
checksums (dict) – checksum values, optional
parents (list) – list of parent file ids, optional
children (list) – list of child file ids, optional
metadata (dict) – dictionary with metadata to update or replace, optional
- Returns:
Dictionary with updated file information
- Return type:
dict
- update_file_meta(metadata, files=None, names=None, fids=None, namespace=None, dids=None, mode='update')
Updates metadata for existing files. Requires client authentication.
DEPRECATED update_file() should be used instead
- Parameters:
metadata (dict) – see Notes
files (list of dicts) – Each dict specifies a file. See Notes
names (list of strings) – List of file names. Requires namespace to be specified
dids (list of strings) – List of DIDs (“namespace:name”) strings
fids (list of strings) – List of file ids. The list of files can be specified with
fids
or withnames
argument, but not both.namespace (string) – Default namespace
mode (str) – Either
"update"
(default) or"replace"
. If mode is"update"
, existing metadata will be updated with values inmetadata
. If"replace"
, then new values will replace existing metadata. Also, see notes below.
- Returns:
list of dictionaries, one dictionary per file with file ids: { “fid”: “…” }
- Return type:
list
Notes
This method can be be used to apply common metadata changes to a list of files. This method can not be used to update file provenance information.
The``metadata`` argument is used to specify the common changes to the metadata to apply to multiple files. The
metadata
dictionary will be used to either update existing metadata of listed files (ifmode="update"
) or replace it (ifmode="replace"
).Files to update have to be specified in one of the following ways:
files = [list of dicts] - each dict must be in one of the following formats:
{“fid”:”<file id>”}
{“namespace”:”<file namespace>”, “name”:”<file name>”} - namespace is optional. Default: the value of the “namespace” method argument
{“did”:”<file namespace>:<file name>”}
dids = [list of file DIDs]
names = [list of file names] - “namespace” argument method must be used to specify the common namespace
fids = [list of file ids]
- wait_queries()
Wait for all issued asynchronous queries to complete
Exceptions#
MetaCatClient methods can raise the following exceptions:
metacat.webapi.InvalidArgument
- method was called with an invalid argumentmetacat.webapi.NotFoundError
- an object is not found in the databasemetacat.webapi.BadRequestError
- invalid requestmetacat.webapi.AlreadyExistsError
- an object already exists in the databasemetacat.webapi.PermissionDeniedError
- permission deniedmetacat.webapi.InvalidMetadataError
- metadata validation failed
All these exceptions inherit from metacat.webapi.WebAPIError
class. WebAPIError.__str__
can be
used to get humanly readable description of the exceptions, e.g.:
try:
results = metacat_client.method(...)
except metacat.webapi.WebAPIError as e:
print(e)
Asynchronous Queries#
When you need to run multiple queries, you can use the async_query
to run them concurrently by starting them asynchronously and then waiting for their
results:
client = MetaCatClient(url)
datasets = [ "production:A", "production:B" ]
promises = []
for dataset_name in datasets:
query = f"files from {dataset_name} where created_timestamp > '2020-10-10'"
promise = client.async_query(query, dataset_name)
promises.append(promise)
for promise in promises:
results = promise.wait()
n = len(results)
dataset_name = promise.Data
print(f"Dataset {dataset_name}: {n} files")
In this example, we start 2 queries concurrently. Each will get files from its own dataset. When we start the asynchronous
queries, instead of query results, the client object returns promises
. Promise is an object, on which you can wait for
actual results. Also, we pass the dataset name to the async_query
method as the data
argument to be able to
associate the returned results with the dataset.
In the second for-loop, we wait for the results from each query and use the promise Data
attribute to refer to the
actual datset name to print the results. Note that the second for-loop loops through the promises in the same order
as they were created. But that does not mean that we expect the queries to complete in the same order. If the query completes
before we call the corresponding promise wait
method, it will simply return the results immediately.
Another way to wait for all asynchronous queries to complete is to call wait_queries
method of the MetaCatClient
:
client = MetaCatClient(url)
datasets = [ "production:A", "production:B" ]
promises = {}
for dataset_name in datasets:
query = f"files from {dataset_name} where created_timestamp > '2020-10-10'"
promise = client.async_query(query, None)
promises[dataset_name] = promise
client.wait_queries()
for dataset_name, promise in promises.items():
results = promise.wait()
n = len(results)
print(f"Dataset {dataset_name}: {n} files")
The wait_queries
method will block until all asynchronous queries started by the client complete. In this case, calling wait
method of the promise
is still necessary to get the results of each individual query, but because we called wait_queries
first, the wait
method will return
results immediately without blocking.