Abstract Interfaces

Registry

A database that holds metadata, relationships, and provenance for managed Datasets.

Many Registry implementations will consist of both a client and a server (though the server will frequently be just a database server with no additional code).

A Registry is almost always backed by a SQL database (e.g. PostgreSQL, MySQL or SQLite) that exposes a schema common to all Registries, described in the many “SQL Representation” sections of this document. As the common schema is used only for SELECT queries, concrete Registries can implement it as set of direct tables, a set of views against private tables, or any combination thereof.

In some important contexts (e.g. processing data staged to scratch space), only a small subset of the full Registry interface is needed, and we may be able to utilize a simple key-value database instead. These are called limited Registries. A limited Registry implements only a small subset of the full Registry Python interface and has no SQL interface at all, and methods that would normally accept DatasetLabel require a full DatasetRef instead. In general, limited Registries have enough functionality to support Butler.get() and Butler.put(), but no more. A limited Registry may be implented on top of a simple persistent key-value store (e.g. a YAML file) rather than a full SQL database. The operations supported by a limited Registry are indicated in the Python API section below.

Warning

It is not yet clear if limited registries will ever exist, or if we can simply always assume SQLite as the minimum.

Note

Limited registries that are used on scratch space during processing need to handle provenance and can require DatasetRefs instead of just DatasetLabels, but dumb ones used for one-off, interactive work do not need provenance and would want to use DatasetLabels. We should probably formalize the different levels of functionality required of Registries in different contexts as different Python class interfaces (with a possibly-multiple inheritance relationship).

Transition

The v14 Butler’s Mapper class contains a Registry object that is also implemented as a SQL database, but the new Registry concept differs in several important ways:

  • new Registries can hold multiple Collections, instead of being identified strictly with a single Data Repository;
  • new Registries also assume most of the responsibilities of the v14 Butler’s Mapper;
  • non-limited Registries now have a much richer set of tables, permitting many more types of queries.

Python API

class Registry
query(self, sql, parameters)

Execute an arbitrary SQL SELECT query on the Registry’s database and return the results.

The given SQL statement should be restricted to the schema and SQL dialect common to all Registries, but Registries are not required to check that this is the case.

Todo

This should be a very simple pass-through to SQLAlchemy or a DBAPI driver. Should be made explicit about exactly what that means for parameters and returned objects.

Not supported by limited Registries.

registerDatasetType(self, datasetType)

Add a new DatasetType to the Registry.

Parameters:datasetType (DatasetType) – the DatasetType to be added
Returns:None

Not supported by limited Registries.

Todo

If the new DatasetType already exists, we need to make sure it’s consistent with what’s already present, but if it is, we probably shouldn’t throw. Need to see if there’s also a use case for throwing if the DatasetType exists or overwriting if its inconsistent.

getDatasetType(self, name)

Return the DatasetType associated with the given name.

addDataset(self, ref, uri, components, run, producer=None)

Add a Dataset to a Collection.

This always adds a new Dataset; to associate an existing Dataset with a new Collection, use associate().

The Quantum that generated the Dataset can optionally be provided to add provenance information.

Parameters:
  • ref – a DatasetRef that identifies the Dataset and contains its DatasetType.
  • uri (str) – the URI that has been associated with the Dataset by a Datastore.
  • components (dict) – if the Dataset is a composite, a {name : URI} dictionary of its named components and storage locations.
  • run (Run) – the Run instance that produced the Dataset. Ignored if producer is passed (producer.run is then used instead). A Run must be provided by one of the two arguments.
  • producer (Quantum) – the Quantum instance that produced the Dataset. May be None to store no provenance information, but if present the Quantum must already have been added to the Registry.
Returns:

a newly-created DatasetHandle instance.

Raises:

an exception if a Dataset with the given DatasetRef already exists in the given Collection.

associate(self, collection, handles)

Add existing Datasets to a Collection, possibly creating the Collection in the process.

Parameters:
Returns:

None

Not supported by limited Registries.

disassociate(self, collection, handles, remove=True)

Remove existing Datasets from a Collection.

Parameters:
Returns:

If remove is True, the list of DatasetHandles that were removed.

collection and handle combinations that are not currently associated are silently ignored.

Not supported by limited Registries.

Todo

What is the interface for removal of datasets that are no longer part of any colection?

makeRun(self, collection)

Create a new Run in the Registry and return it.

Parameters:collection (str) – the Collection used to identify all inputs and outputs of the Run.
Returns:a Run instance.

Not supported by limited Registries.

updateRun(self, run)

Update the environment and/or pipeline of the given Run in the database, given the DatasetHandles attributes of the given Run.

Not supported by limited Registries.

getRun(self, collection=None, id=None)

Get a Run corresponding to it’s collection or id

Parameters:
  • collection (str) – Collection collection
  • id (int) – If given, lookup by id instead and ignore collection.
Returns:

a Run instance

addQuantum(self, quantum)

Add a new Quantum to the Registry.

Parameters:quantum (Quantum) – a Quantum instance to add to the Registry.

The given Quantum must not already be present in the Registry (or any other); its pkey attribute must be None.

The predictedInputs attribute must be fully populated with DatasetHandles. The actualInputs and outputs will be ignored.

markInputUsed(self, quantum, handle)

Record the given DatasetHandle as an actual (not just predicted) input of the given Quantum.

This updates both the Registry’s Quantum table and the Python Quantum.actualInputs attribute.

Raises an exception if handle is not already in the predicted inputs list.

addDataUnit(self, unit, replace=False)

Add a new DataUnit, optionally replacing an existing one (for updates).

Parameters:
  • unit (DataUnit) – the DataUnit to add or replace.
  • replace (bool) – if True, replace any matching DataUnit that already exists (updating its non-unique fields) instead of raising an exception.

Not supported by limited Registries.

Todo

This will need to update many-to-many join tables between DataUnits in some cases. We may want to vectorize this operation or otherwise allow many new DataUnits to be added before updating the join tables.

findDataUnit(self, cls, values)

Return a DataUnit given a dictionary of values

Parameters:
  • cls (type) – a class that inherits from DataUnit.
  • values (dict) – A dictionary of values that uniquely identify the DataUnit.
Returns:

a DataUnit instance of type cls, or None if no matching unit is found.

See also DataUnitMap.findDataUnit().

Not supported by limited Registries.

expand(self, label)

Expand a DatasetLabel, returning an equivalent DatasetRef.

Must be a simple pass-through if label is already a DatasetRef.

For limited Registries, label must be a DatasetRef , making this a guaranteed no-op (but still callable, for interface compatibility).

find(self, collection, label)

Look up the location of the Dataset associated with the given DatasetLabel.

This can be used to obtain the URI that permits the Dataset to be read from a Datastore.

Must be a simple pass-through if label is already a DatasetHandle.

Parameters:
Returns:

a DatasetHandle instance

makeDataGraph(self, collections, expr, neededDatasetTypes, futureDatasetTypes)

Evaluate a filter expression and lists of DatasetTypes and return a QuantumGraph.

Parameters:
  • collections (list[str]) – an ordered list of collections indicating the Collections to search for Datasets.
  • expr (str) – an expression that limits the DataUnits and (indirectly) the Datasets returned.
  • neededDatasetTypes (list[DatasetType]) – the list of DatasetTypes whose instances should be included in the graph and limit its extent.
  • futureDatasetTypes (list[DatasetType]) – the list of DatasetTypes whose instances may be added to the graph later, which requires that their DataUnit types must be present in the graph.
Returns:

a QuantumGraph instance with a QuantumGraph.units attribute that is not None.

Not supported by limited Registries.

Todo

More complete description for expressions.

subset(self, collection, expr, datasetTypes)

Create a new Collection by subsetting an existing one.

Parameters:
Returns:

a str Collection

Not supported by limited Registries.

Todo

This should probably have the same signature as makeDataGraph; since that implies it can merge as it subsets, it might need a new name.

merge(self, outputCollection, inputCollections)

Create a new Collection from a series of existing ones.

Entries earlier in the list will be used in preference to later entries when both contain Datasets with the same DatasetRef.

Parameters:
  • outputCollection – a str Collection to use for the new Collection.
  • inputCollections (list[str]) – a list of :ref:`Collection`s to combine.

Not supported by limited Registries.

makeProvenanceGraph(self, expr, types=None)

Return a QuantumGraph that contains the full provenance of all Datasets matching an expression.

Parameters:expr (str) – an expression (SQL query that evaluates to a list of dataset_id) that selects the Datasets.
Returns:a QuantumGraph instance (with units set to None).

Todo

Should have convenience versions that operate on e.g. DatasetHandles provided by the user.

export(self, expr)

Export contents of the Registry, limited to those reachable from the Datasets identified by the expression expr, into a TableSet format such that it can be imported into a different database.

Parameters:expr (str) – an expression (SQL query that evaluates to a list of dataset_id) that selects the Datasets, or a QuantumGraph that can be similarly interpreted.
Returns:a TableSet containing all rows, from all tables in the Registry that are reachable from the selected Datasets.

Not supported by limited Registries.

Todo

Should have convenience versions that operate on e.g. DatasetHandles provided by the user. Should also have a version that operates on a QuantumGraph; that may also permit some optimizations, especially if QuantumGraph.units is not None.

import_(self, tables, collection)

Import (previously exported) contents into the (possibly empty) Registry.

Parameters:
  • tables (TableSet) – a TableSet containing the exported content.
  • collection (str) – an additional Collection assigned to the newly imported Datasets.

Limited Registries will import only some of the information exported by full Registry.

transfer(self, src, expr, collection)

Transfer contents from a source Registry, limited to those reachable from the Datasets identified by the expression expr, into this Registry and collection them with a Collection.

Parameters:

Trivially implemented as:

def transfer(self, src, expr, collection):
    self.import_(src.export(expr), collection)

Datastore

A system that holds persisted Datasets and can read and optionally write them.

This may be based on a (shared) filesystem, an object store, a SQL database, or some other system.

Many Datastore implementations will consist of both a client and a server.

Transition

Datastore represents a refactoring of some responsibilities previously held by the v14 Butler and Mapper objects.

Datastore implementations are the most likely place in the new design where existing v14 Butler code could be used.

Python API

class Datastore
get(self, uri, storageClass, parameters=None)

Load an InMemoryDataset from the store.

Parameters:
Returns:

an InMemoryDataset or slice thereof.

put(self, inMemoryDataset, storageClass, storageHint, typeName=None)

Write a InMemoryDataset with a given StorageClass to the store.

Parameters:
Returns:

the str URI and a dictionary of URIs for the Dataset’s components. The latter will be empty (or None?) if the Dataset is not a composite.

remove(self, uri)

Indicate to the Datastore that a Dataset can be removed.

Some Datastores may implement this method as a silent no-op to disable Dataset deletion through standard interfaces.

transfer(self, inputDatastore, inputUri, storageClass, storageHint, typeName=None)

Retrieve a Dataset with a given URI from an input Datastore, and store the result in this Datastore.

Parameters:
Returns:

the str URI and a dictionary of URIs for the Dataset’s components. The latter will be empty (or None?) if the Dataset is not a composite.

Todo

This interface does not permit composite datasets stored as separate compoonents to be transferred into a single file. We have a clear use case for that, but it probably can’t be done with a signature that transfers one Dataset at a time.

TableSet

A serialializable set of exported database tables.

Note

A TableSet does not need to cointain all information needed to recreate the database tables themselves (since the tables are part of the common schema), but should contain all nessesary information to recreate all the content within them.

Transition

Todo

Fill in transition details.

Python API

Todo

Specify Python API.