Grouping and Provenance

Collection

An entity that contains Datasets, with the following conditions:

  • Has at most one Dataset per DatasetRef.
  • Has a unique, human-readable identifier string.
  • Can be combined with a DatasetRef to obtain a globally unique URI.

Most Registries contain multiple Collections.

Transition

The v14 Butler’s Data Repository concept plays a similar role in many contexts, but with a very different implementation and a very different relationship to the Registry concept.

Python API

Collections are simply Python strings.

A QuantumGraph may be constructed to hold exactly the contents of a single Collection, but does not do so in general.

SQL Representation

Collections are defined by a many-to-many “join” table that links Dataset to Collections. Because Collections are just strings, we have no independent Collection table.

DatasetCollections

Fields:
collection varchar NOT NULL
dataset_id int NOT NULL
registry_id int NOT NULL
Primary Key:
  • (collection, dataset_id, registry_id)
Foreign Keys:
  • (dataset_id, registry_id) references Dataset (dataset_id, registry_id)

This table should be present even in Registries that only represent a single Collection (though in this case it may of course be a trivial view on Dataset).

Todo

Storing the collection for every Dataset is costly (but may be mitigated by compression). Perhaps better to have a separate Collection table and reference by collection_id instead?

Run

An action that produces Datasets, usually associated with a well-defined software environment.

Most Runs will correspond to a launch of a SuperTask Pipeline.

Every Dataset must be associated with a Run, though Registries may define one or more special Runs to act as defaults or label continuous operations (e.g. raw data ingest).

Transition

A Run is at least initially associated with a Collection, making it (like Collection) similar to the v14 Data Repository concept. Again like Collection, its implementation is entirely different.

Python API

class Run

A concrete, final class representing a Run.

Run instances in Python can only be created by Registry.makeRun().

collection

The Collection associated with a Run. While a new collection is created for a Run when the Run is created, that collection may later be deleted, so this attribute may be None.

environment

A DatasetHandle that can be used to retreive a description of the software environment used to create the Run.

pipeline

A DatasetHandle that can be used to retreive the Pipeline (including configuration) used during this Run.

pkey

The (run_id, registry_id) tuple used to uniquely identify this Run.

Todo

If a Collection table is adopted, the collection can be replaced by a collection_id for increased space efficiency.

SQL Representation

Run

Fields:
run_id int NOT NULL
registry_id int NOT NULL
collection varchar  
environment_id int  
pipeline_id int  
Primary Key:
run_id, registry_id
Foreign Keys:
  • (environment_id, registry_id) references Dataset (dataset_id, registry_id)
  • (pipeline_id, registry_id) references Dataset (dataset_id, registry_id)

Run uses the same compound primary key approach as Dataset.

Quantum

A discrete unit of work that may depend on one or more Datasets and produces one or more Datasets.

Most Quanta will be executions of a particular SuperTask’s runQuantum method, but they can also be used to represent discrete units of work performed manually by human operators or other software agents.

Transition

The Quantum concept does not exist in the v14 Butler.

A Quantum is analogous to an Open Provenance Model “process”.

Python API

class Quantum
run

The Run this Quantum is a part of.

predictedInputs

A dictionary of input datasets that were expected to be used, with DatasetType names as keys and a set of DatasetRef instances as values.

Input Datasets that have already been stored may be DatasetHandles, and in many contexts may be guaranteed to be.

Read-only; update via addPredictedInput().

actualInputs

A dictionary of input datasets that were actually used, with the same form as predictedInputs.

All returned sets must be subsets of those in predictedInputs.

Read-only; update via Registry.markInputUsed().

addPredictedInput(ref)

Add an input DatasetRef to the Quantum.

This does not automatically update a Registry; all predictedInputs must be present before a Registry.addQuantum() is called.

outputs

A dictionary of output datasets, with the same form as predictedInputs.

Read-only; update via Registry.addDataset(), QuantumGraph.addDataset(), or Butler.put().

task

If the Quantum is associated with a SuperTask, this is the SuperTask instance that produced and should execute this set of inputs and outputs. If not, a human-readable string identifier for the operation. Some Registries may permit the value to be None, but are not required to in general.

pkey

The (quantum_id, registry_id) tuple used to uniquely identify this Run, or None if it has not yet been inserted into a Registry.

SQL Representation

Quanta are stored in a single table that records its scalar attributes:

Quantum

Fields:
quantum_id int NOT NULL
registry_id int NOT NULL
run_id int NOT NULL
task varchar  
Primary Key:
quantum_id, registry_id
Foreign Keys:
  • (run_id, registry_id) references Run (run_id, registry_id)

Quantum uses the same compound primary key approach as Dataset.

The Datasets produced by a Quantum (the Quantum.outputs attribute in Python) is stored in the producer_id field in the Dataset table. The inputs, both predicted and actual, are stored in an additional join table:

DatasetConsumers

Fields:
quantum_id int NOT NULL
quantum_registry_id int NOT NULL
dataset_id int NOT NULL
dataset_registry_id int NOT NULL
actual bool NOT NULL
Primary Key:
None
Foreign Keys:
  • (quantum_id, quantum_registry_id) references Quantum (quantum_id, registry_id)
  • (dataset_id, dataset_registry_id) references Dataset (dataset_id, registry_id)

There is no guarantee that the full provenance of a Dataset is captured by these tables in all Registries, because subset and transfer operations do not require provenance information to be included. Furthermore, Registries may or may not require a Quantum to be provided when calling Registry.addDataset() (which is called by Butler.put()), making it the callers responsibility to add provenance when needed. However, all Registries (including limited Registries) are required to record provenance information when it is provided.

Note

As with everything else in the common Registry schema, the provenance system used in the operations data backbone will almost certainly involve additional fields and tables, and what’s in the schema will just be a view. But the provenance tables here are even more of a blind straw-man than the rest of the schema (which is derived more directly from SuperTask requirements), and I certainly expect it to change based on feedback; I think this reflects all that we need outside the operations system, but how operations implements their system should probably influence the details.

QuantumGraph

A graph in which the nodes are DatasetRefs and Quanta and the edges are the producer/consumer relations between them.

Python API

class QuantumGraph
datasets

A dictionary with DatasetType names as keys and sets of DatasetRefs of those types as values.

Read-only (possibly only by convention); use addDataset() to insert new DatasetRefs.

quanta

A sequence of Quantum instances whose order is consistent with their dependency ordering.

Read-only (possibly only by convention); use addQuantum() to insert new Quanta.

addQuantum(quantum)

Add a Quantum to the graph.

Any entries in Quantum.predictedInputs or Quantum.actualInputs must already be present in the graph. The Quantum.outputs attribute should be empty.

addDataset(ref, producer)

Add a DatasetRef to the graph.

Parameters:
units

A DataUnitMap that describes the relationships between the DataUnits that label the graph’s Datasets.

May be None in some QuantumGraphs.