Grouping and Provenance¶
Collection¶
An entity that contains Datasets, with the following conditions:
- Has at most one Dataset per DatasetRef.
- Has a unique, human-readable identifier string.
- Can be combined with a DatasetRef to obtain a globally unique URI.
Most Registries contain multiple Collections.
Transition¶
The v14 Butler’s Data Repository concept plays a similar role in many contexts, but with a very different implementation and a very different relationship to the Registry concept.
Python API¶
Collections are simply Python strings.
A QuantumGraph may be constructed to hold exactly the contents of a single Collection, but does not do so in general.
SQL Representation¶
Collections are defined by a many-to-many “join” table that links Dataset to Collections. Because Collections are just strings, we have no independent Collection table.
DatasetCollections¶
- Fields:
collection varchar NOT NULL dataset_id int NOT NULL registry_id int NOT NULL - Primary Key:
- (collection, dataset_id, registry_id)
- Foreign Keys:
- (dataset_id, registry_id) references Dataset (dataset_id, registry_id)
This table should be present even in Registries that only represent a single Collection (though in this case it may of course be a trivial view on Dataset).
Todo
Storing the collection for every Dataset is costly (but may be mitigated by compression).
Perhaps better to have a separate Collection table and reference by collection_id
instead?
Run¶
An action that produces Datasets, usually associated with a well-defined software environment.
Most Runs will correspond to a launch of a SuperTask Pipeline.
Every Dataset must be associated with a Run, though Registries may define one or more special Runs to act as defaults or label continuous operations (e.g. raw data ingest).
Transition¶
A Run is at least initially associated with a Collection, making it (like Collection) similar to the v14 Data Repository concept. Again like Collection, its implementation is entirely different.
Python API¶
-
class
Run
¶ A concrete, final class representing a Run.
Run instances in Python can only be created by
Registry.makeRun()
.-
collection
¶ The Collection associated with a Run. While a new collection is created for a Run when the Run is created, that collection may later be deleted, so this attribute may be None.
-
environment
¶ A
DatasetHandle
that can be used to retreive a description of the software environment used to create the Run.
-
pipeline
¶ A
DatasetHandle
that can be used to retreive the Pipeline (including configuration) used during this Run.
-
pkey
¶ The
(run_id, registry_id)
tuple used to uniquely identify this Run.
-
Todo
If a Collection table is adopted, the collection
can be replaced by a collection_id
for increased space efficiency.
Quantum¶
A discrete unit of work that may depend on one or more Datasets and produces one or more Datasets.
Most Quanta will be executions of a particular SuperTask’s runQuantum
method, but they can also be used to represent discrete units of work performed manually by human operators or other software agents.
Transition¶
The Quantum concept does not exist in the v14 Butler.
A Quantum is analogous to an Open Provenance Model “process”.
Python API¶
-
class
Quantum
¶ -
-
predictedInputs
¶ A dictionary of input datasets that were expected to be used, with DatasetType names as keys and a
set
ofDatasetRef
instances as values.Input Datasets that have already been stored may be
DatasetHandles
, and in many contexts may be guaranteed to be.Read-only; update via
addPredictedInput()
.
-
actualInputs
¶ A dictionary of input datasets that were actually used, with the same form as
predictedInputs
.All returned sets must be subsets of those in
predictedInputs
.Read-only; update via
Registry.markInputUsed()
.
-
addPredictedInput
(ref)¶ Add an input DatasetRef to the Quantum.
This does not automatically update a Registry; all
predictedInputs
must be present before aRegistry.addQuantum()
is called.
-
outputs
¶ A dictionary of output datasets, with the same form as
predictedInputs
.Read-only; update via
Registry.addDataset()
,QuantumGraph.addDataset()
, orButler.put()
.
-
task
¶ If the Quantum is associated with a SuperTask, this is the SuperTask instance that produced and should execute this set of inputs and outputs. If not, a human-readable string identifier for the operation. Some Registries may permit the value to be None, but are not required to in general.
-
SQL Representation¶
Quanta are stored in a single table that records its scalar attributes:
Quantum¶
- Fields:
quantum_id int NOT NULL registry_id int NOT NULL run_id int NOT NULL task varchar - Primary Key:
- quantum_id, registry_id
- Foreign Keys:
- (run_id, registry_id) references Run (run_id, registry_id)
Quantum uses the same compound primary key approach as Dataset.
The Datasets produced by a Quantum (the Quantum.outputs
attribute in Python) is stored in the producer_id field in the Dataset table.
The inputs, both predicted and actual, are stored in an additional join table:
DatasetConsumers¶
- Fields:
quantum_id int NOT NULL quantum_registry_id int NOT NULL dataset_id int NOT NULL dataset_registry_id int NOT NULL actual bool NOT NULL - Primary Key:
- None
- Foreign Keys:
There is no guarantee that the full provenance of a Dataset is captured by these tables in all Registries, because subset and transfer operations do not require provenance information to be included. Furthermore, Registries may or may not require a Quantum to be provided when calling Registry.addDataset()
(which is called by Butler.put()
), making it the callers responsibility to add provenance when needed.
However, all Registries (including limited Registries) are required to record provenance information when it is provided.
Note
As with everything else in the common Registry schema, the provenance system used in the operations data backbone will almost certainly involve additional fields and tables, and what’s in the schema will just be a view. But the provenance tables here are even more of a blind straw-man than the rest of the schema (which is derived more directly from SuperTask requirements), and I certainly expect it to change based on feedback; I think this reflects all that we need outside the operations system, but how operations implements their system should probably influence the details.
QuantumGraph¶
A graph in which the nodes are DatasetRefs and Quanta and the edges are the producer/consumer relations between them.
Python API¶
-
class
QuantumGraph
¶ -
datasets
¶ A dictionary with DatasetType names as keys and sets of
DatasetRefs
of those types as values.Read-only (possibly only by convention); use
addDataset()
to insert newDatasetRefs
.
-
quanta
¶ A sequence of
Quantum
instances whose order is consistent with their dependency ordering.Read-only (possibly only by convention); use
addQuantum()
to insert newQuanta
.
-
addQuantum
(quantum)¶ Add a
Quantum
to the graph.Any entries in
Quantum.predictedInputs
orQuantum.actualInputs
must already be present in the graph. TheQuantum.outputs
attribute should be empty.
-
addDataset
(ref, producer)¶ Add a
DatasetRef
to the graph.Parameters: - ref (DatasetRef) – a pointer to the Dataset to be added.
- producer (Quantum) – the
Quantum
responsible for producing the Dataset. Must already be present in the graph.
-
units
¶ A
DataUnitMap
that describes the relationships between the DataUnits that label the graph’s Datasets.May be
None
in some QuantumGraphs.
-