Abstract Interfaces¶
Registry¶
A database that holds metadata, relationships, and provenance for managed Datasets.
Many Registry implementations will consist of both a client and a server (though the server will frequently be just a database server with no additional code).
A Registry is almost always backed by a SQL database (e.g. PostgreSQL, MySQL or SQLite) that exposes a schema common to all Registries, described in the many “SQL Representation” sections of this document. As the common schema is used only for SELECT queries, concrete Registries can implement it as set of direct tables, a set of views against private tables, or any combination thereof.
In some important contexts (e.g. processing data staged to scratch space), only a small subset of the full Registry interface is needed, and we may be able to utilize a simple key-value database instead. These are called limited Registries.
A limited Registry implements only a small subset of the full Registry Python interface and has no SQL interface at all, and methods that would normally accept DatasetLabel
require a full DatasetRef
instead.
In general, limited Registries have enough functionality to support Butler.get()
and Butler.put()
, but no more.
A limited Registry may be implented on top of a simple persistent key-value store (e.g. a YAML file) rather than a full SQL database.
The operations supported by a limited Registry are indicated in the Python API section below.
Warning
It is not yet clear if limited registries will ever exist, or if we can simply always assume SQLite as the minimum.
Note
Limited registries that are used on scratch space during processing need to handle provenance and can require DatasetRefs instead of just DatasetLabels, but dumb ones used for one-off, interactive work do not need provenance and would want to use DatasetLabels. We should probably formalize the different levels of functionality required of Registries in different contexts as different Python class interfaces (with a possibly-multiple inheritance relationship).
Transition¶
The v14 Butler’s Mapper class contains a Registry object that is also implemented as a SQL database, but the new Registry concept differs in several important ways:
- new Registries can hold multiple Collections, instead of being identified strictly with a single Data Repository;
- new Registries also assume most of the responsibilities of the v14 Butler’s Mapper;
- non-limited Registries now have a much richer set of tables, permitting many more types of queries.
Python API¶
-
class
Registry
¶ -
query
(self, sql, parameters)¶ Execute an arbitrary SQL SELECT query on the Registry’s database and return the results.
The given SQL statement should be restricted to the schema and SQL dialect common to all Registries, but Registries are not required to check that this is the case.
Todo
This should be a very simple pass-through to SQLAlchemy or a DBAPI driver. Should be made explicit about exactly what that means for parameters and returned objects.
Not supported by limited Registries.
-
registerDatasetType
(self, datasetType)¶ Add a new DatasetType to the Registry.
Parameters: datasetType (DatasetType) – the DatasetType to be added Returns: None Not supported by limited Registries.
Todo
If the new DatasetType already exists, we need to make sure it’s consistent with what’s already present, but if it is, we probably shouldn’t throw. Need to see if there’s also a use case for throwing if the DatasetType exists or overwriting if its inconsistent.
-
getDatasetType
(self, name)¶ Return the
DatasetType
associated with the given name.
-
addDataset
(self, ref, uri, components, run, producer=None)¶ Add a Dataset to a Collection.
This always adds a new Dataset; to associate an existing Dataset with a new Collection, use
associate()
.The Quantum that generated the Dataset can optionally be provided to add provenance information.
Parameters: - ref – a DatasetRef that identifies the Dataset and contains its DatasetType.
- uri (str) – the URI that has been associated with the Dataset by a Datastore.
- components (dict) – if the Dataset is a composite, a
{name : URI}
dictionary of its named components and storage locations. - run (Run) – the Run instance that produced the Dataset. Ignored if
producer
is passed (producer.run
is then used instead). A Run must be provided by one of the two arguments. - producer (Quantum) – the Quantum instance that produced the Dataset. May be
None
to store no provenance information, but if present theQuantum
must already have been added to the Registry.
Returns: a newly-created
DatasetHandle
instance.Raises: an exception if a Dataset with the given DatasetRef already exists in the given Collection.
-
associate
(self, collection, handles)¶ Add existing Datasets to a Collection, possibly creating the Collection in the process.
Parameters: - collection (str) – the Collection the Datasets should be associated with.
- handles (list[DatasetHandle]) – a list of
DatasetHandle
instances that already exist in this Registry.
Returns: None
Not supported by limited Registries.
-
disassociate
(self, collection, handles, remove=True)¶ Remove existing Datasets from a Collection.
Parameters: - collection (str) – the Collection the Datasets should no longer be associated with.
- handles (list[DatasetHandle]) – a list of
DatasetHandle
instances that already exist in this Registry. - remove (bool) – if True, remove Datasets from the Registry if they are not associated with any Collection (including via any composites).
Returns: If
remove
is True, the list ofDatasetHandles
that were removed.collection
andhandle
combinations that are not currently associated are silently ignored.Not supported by limited Registries.
Todo
What is the interface for removal of datasets that are no longer part of any colection?
-
makeRun
(self, collection)¶ Create a new Run in the Registry and return it.
Parameters: collection (str) – the Collection used to identify all inputs and outputs of the Run. Returns: a Run
instance.Not supported by limited Registries.
-
updateRun
(self, run)¶ Update the
environment
and/orpipeline
of the given Run in the database, given theDatasetHandles
attributes of the givenRun
.Not supported by limited Registries.
-
getRun
(self, collection=None, id=None)¶ Get a Run corresponding to it’s collection or id
Parameters: Returns: a
Run
instance
-
addQuantum
(self, quantum)¶ Add a new Quantum to the Registry.
Parameters: quantum (Quantum) – a Quantum
instance to add to the Registry.The given Quantum must not already be present in the Registry (or any other); its
pkey
attribute must beNone
.The
predictedInputs
attribute must be fully populated withDatasetHandles
. TheactualInputs
andoutputs
will be ignored.
-
markInputUsed
(self, quantum, handle)¶ Record the given
DatasetHandle
as an actual (not just predicted) input of the given Quantum.This updates both the Registry’s Quantum table and the Python
Quantum.actualInputs
attribute.Raises an exception if
handle
is not already in the predicted inputs list.
-
addDataUnit
(self, unit, replace=False)¶ Add a new DataUnit, optionally replacing an existing one (for updates).
Parameters: Not supported by limited Registries.
Todo
This will need to update many-to-many join tables between DataUnits in some cases. We may want to vectorize this operation or otherwise allow many new DataUnits to be added before updating the join tables.
-
findDataUnit
(self, cls, values)¶ Return a DataUnit given a dictionary of values
Parameters: Returns: a
DataUnit
instance of typecls
, orNone
if no matching unit is found.See also
DataUnitMap.findDataUnit()
.Not supported by limited Registries.
-
expand
(self, label)¶ Expand a
DatasetLabel
, returning an equivalentDatasetRef
.Must be a simple pass-through if
label
is already a DatasetRef.For limited Registries,
label
must be aDatasetRef
, making this a guaranteed no-op (but still callable, for interface compatibility).
-
find
(self, collection, label)¶ Look up the location of the Dataset associated with the given
DatasetLabel
.This can be used to obtain the URI that permits the Dataset to be read from a Datastore.
Must be a simple pass-through if
label
is already aDatasetHandle
.Parameters: - collection (str) – a Collection indicating the Collection to search.
- label (DatasetLabel) – a
DatasetLabel
that identifies the Dataset. For limited Registries, must be aDatasetRef
.
Returns: a
DatasetHandle
instance
-
makeDataGraph
(self, collections, expr, neededDatasetTypes, futureDatasetTypes)¶ Evaluate a filter expression and lists of DatasetTypes and return a QuantumGraph.
Parameters: - collections (list[str]) – an ordered list of collections indicating the Collections to search for Datasets.
- expr (str) – an expression that limits the DataUnits and (indirectly) the Datasets returned.
- neededDatasetTypes (list[DatasetType]) – the list of DatasetTypes whose instances should be included in the graph and limit its extent.
- futureDatasetTypes (list[DatasetType]) – the list of DatasetTypes whose instances may be added to the graph later, which requires that their DataUnit types must be present in the graph.
Returns: a QuantumGraph instance with a
QuantumGraph.units
attribute that is notNone
.Not supported by limited Registries.
Todo
More complete description for expressions.
-
subset
(self, collection, expr, datasetTypes)¶ Create a new Collection by subsetting an existing one.
Parameters: - collection (str) – a Collection indicating the input Collection to subset.
- expr (str) – an expression that limits the DataUnits and (indirectly) the Datasets in the subset.
- datasetTypes (list[DatasetType]) – the list of DatasetTypes whose instances should be included in the subset.
Returns: a str Collection
Not supported by limited Registries.
Todo
This should probably have the same signature as makeDataGraph; since that implies it can merge as it subsets, it might need a new name.
-
merge
(self, outputCollection, inputCollections)¶ Create a new Collection from a series of existing ones.
Entries earlier in the list will be used in preference to later entries when both contain Datasets with the same DatasetRef.
Parameters: - outputCollection – a str Collection to use for the new Collection.
- inputCollections (list[str]) – a list of :ref:`Collection`s to combine.
Not supported by limited Registries.
-
makeProvenanceGraph
(self, expr, types=None)¶ Return a QuantumGraph that contains the full provenance of all Datasets matching an expression.
Parameters: expr (str) – an expression (SQL query that evaluates to a list of dataset_id
) that selects the Datasets.Returns: a QuantumGraph
instance (withunits
set to None).Todo
Should have convenience versions that operate on e.g. DatasetHandles provided by the user.
-
export
(self, expr)¶ Export contents of the Registry, limited to those reachable from the Datasets identified by the expression
expr
, into a TableSet format such that it can be imported into a different database.Parameters: expr (str) – an expression (SQL query that evaluates to a list of dataset_id
) that selects the Datasets, or a QuantumGraph that can be similarly interpreted.Returns: a TableSet containing all rows, from all tables in the Registry that are reachable from the selected Datasets. Not supported by limited Registries.
Todo
Should have convenience versions that operate on e.g. DatasetHandles provided by the user. Should also have a version that operates on a QuantumGraph; that may also permit some optimizations, especially if
QuantumGraph.units
is notNone
.
-
import_
(self, tables, collection)¶ Import (previously exported) contents into the (possibly empty) Registry.
Parameters: Limited Registries will import only some of the information exported by full Registry.
-
transfer
(self, src, expr, collection)¶ Transfer contents from a source Registry, limited to those reachable from the Datasets identified by the expression
expr
, into this Registry and collection them with a Collection.Parameters: Trivially implemented as:
def transfer(self, src, expr, collection): self.import_(src.export(expr), collection)
-
Datastore¶
A system that holds persisted Datasets and can read and optionally write them.
This may be based on a (shared) filesystem, an object store, a SQL database, or some other system.
Many Datastore implementations will consist of both a client and a server.
Transition¶
Datastore represents a refactoring of some responsibilities previously held by the v14 Butler and Mapper objects.
Datastore implementations are the most likely place in the new design where existing v14 Butler code could be used.
Python API¶
-
class
Datastore
¶ -
get
(self, uri, storageClass, parameters=None)¶ Load an InMemoryDataset from the store.
Parameters: - uri (str) – a URI that specifies the location of the stored Dataset.
- storageClass (StorageClass) – the StorageClass associated with the DatasetType.
- parameters (dict) – StorageClass-specific parameters that specify a slice of the Dataset to be loaded.
Returns: an InMemoryDataset or slice thereof.
-
put
(self, inMemoryDataset, storageClass, storageHint, typeName=None)¶ Write a InMemoryDataset with a given StorageClass to the store.
Parameters: - inMemoryDataset – the InMemoryDataset to store.
- storageClass (StorageClass) – the StorageClass associated with the DatasetType.
- storageHint (str) – A StorageHint that provides a hint that the Datastore may use as (part of) the URI.
- typeName (str) – The DatasetType name, which may be used by the Datastore to override the default serialization format for the StorageClass.
Returns: the
str
URI and a dictionary of URIs for the Dataset’s components. The latter will be empty (or None?) if the Dataset is not a composite.
-
remove
(self, uri)¶ Indicate to the Datastore that a Dataset can be removed.
Some Datastores may implement this method as a silent no-op to disable Dataset deletion through standard interfaces.
-
transfer
(self, inputDatastore, inputUri, storageClass, storageHint, typeName=None)¶ Retrieve a Dataset with a given URI from an input Datastore, and store the result in this Datastore.
Parameters: - inputDatastore (Datastore) – the external Datastore from which to retreive the Dataset.
- inputUri (str) – the URI of the Dataset in the input Datastore.
- storageClass (StorageClass) – the StorageClass associated with the DatasetType.
- storageHint (str) – A StorageHint that provides a hint that this Datastore may use as [part of] the URI.
- typeName (str) – The DatasetType name, which may be used by this Datastore to override the default serialization format for the StorageClass.
Returns: the
str
URI and a dictionary of URIs for the Dataset’s components. The latter will be empty (or None?) if the Dataset is not a composite.Todo
This interface does not permit composite datasets stored as separate compoonents to be transferred into a single file. We have a clear use case for that, but it probably can’t be done with a signature that transfers one Dataset at a time.
-
TableSet¶
A serialializable set of exported database tables.
Note
A TableSet does not need to cointain all information needed to recreate the database tables themselves (since the tables are part of the common schema), but should contain all nessesary information to recreate all the content within them.
Transition¶
Todo
Fill in transition details.
Python API¶
Todo
Specify Python API.