Describing Datasets

Dataset

A Dataset is a discrete entity of stored data.

Datasets are uniquely identified by either a URI or the combination of a Collection and a DatasetRef.

Example: a “calexp” for a single visit and sensor produced by a processing run.

A Dataset may be composite, which means it contains one or more named component Datasets (for example a “WCS” is a subcomponent of a “calexp”). Composites may be stored either by storing the parent in a single file or by storing the components separately. Some composites simply aggregate that are always written as part of other Datasets, and are themselves read-only.

Datasets may also be sliced, which yields an InMemoryDataset of the same type containing a smaller amount of data, defined by some parameters. Subimages and filters on catalogs are both considered slices.

Datasets may include metadata in their persisted form in a Datastore, but Registries never hold Dataset metadata directly - all metadata is instead associated with the DataUnits associated with a Dataset. For example, metadata associated with an observation (e.g. zenith angle) would be associated with a Visit (or perhaps Snap or ObservedSensor) rather than a raw DatasetType. Because a raw is associated with those DataUnits, it is still associated with the metadata, but the association is indirect, and the metadata is also automatically associated with other Datasets that are associated with those units, like calexp or src.

Transition

The Dataset concept has essentially the same meaning that it did in the v14 Butler.

A Dataset is analogous to an Open Provenance Model “artifact”.

Python API

The Python representation of a Dataset is in some sense an InMemoryDataset, and hence we have no Python “Dataset” class. However, we have several Python objects that act like pointers to Datasets. These are described in the Python API section for DatasetRef.

SQL Representation

Datasets are represented by records in a single table that includes everything in a Registry, regardless of Collection or DatasetType:

Dataset

Fields:
dataset_id int NOT NULL
registry_id int NOT NULL
dataset_type_name varchar NOT NULL
uri varchar  
run_id int NOT NULL
producer_id int  
unit_hash binary NOT NULL
Primary Key:
dataset_id, registry_id
Foreign Keys:
  • dataset_type_name references DatasetType (name)
  • (run_id, registry_id) references Run (run_id, registry_id)
  • (producer_id, registry_id) references Quantum (quantum_id, registry_id)

Using a single table (instead of per-DatasetType and/or per-Collection tables) ensures that table-creation permissions are not required when adding new DatasetTypes or Collections. It also makes it easier to store provenance by associating Datasets with Quanta.

The disadvantage of this approach is that the connections between Datasets and DataUnits must be stored in a set of additional join tables (one for each DataUnit table). The connections are summarized by the unit_hash field, which contains a sha512 hash that is unique only within a Collection for a given DatasetType, constructed by hashing the values of the associated units. While a unit_hash value cannot be used to reconstruct a full DatasetRef, a unit_hash value can be used to quickly search for the Dataset matching a given DatasetRef. It also allows Registry.merge() to be implemented purely as a database operation by using it as a GROUP BY column in a query over multiple Collections.

Dataset utilizes a compound primary key that combines an autoincrement dataset_id field that is populated by the Registry in which the Dataset originated and a registry_id that identifies that Registry. When transferred between Registries, the registry_id should be transferred without modification, allowing new Datasets to be assigned dataset_id values that were used or may be used in the future in the transferred-from Registry.

DatasetComposition

Fields:
parent_dataset_id int NOT NULL
parent_registry_id int NOT NULL
component_dataset_id int NOT NULL
component_registry_id int NOT NULL
component_name varchar NOT NULL
Primary Key:
  • (parent_dataset_id, parent_registry_id, component_dataset_id, component_registry_id)
Foreign Keys:
  • (parent_dataset_id, parent_registry_id) references Dataset (dataset_id, registry_id)
  • (component_dataset_id, component_registry_id) references Dataset (dataset_id, registry_id)

A self-join table that links composite datasets to their components.

  • If a virtual Dataset was created by writing multiple component Datasets, the parent DatasetType’s template field and the parent Dataset’s uri field may be null (depending on whether there was also a parent Dataset stored whose components should be overridden).
  • If a single Dataset was written and we’re defining virtual components, the component DatasetTypes should have null template fields, but the component Datasets will have non-null uri fields with values returned by the Datastore when Datastore.put() was called on the parent.

DatasetType

A named category of Datasets that defines how they are organized, related, and stored.

In addition to a name, a DatasetType includes:

Transition

The DatasetType concept has essentially the same meaning that it did in the v14 Butler.

Python API

class DatasetType

A concrete, final class whose instances represent DatasetTypes.

DatasetType instances may be constructed without a Registry, but they must be registered via Registry.registerDatasetType() before corresponding Datasets may be added.

DatasetType instances are immutable.

Note

In the current design, DatasetTypes are not type objects, and the DatasetRef class is not an instance of DatasetType. We could make that the case with a lot of metaprogramming, but this adds a lot of complexity to the code with no obvious benefit. It seems most prudent to just rename the DatasetType concept and class to something that doesn’t imply a type-instance relationship in Python.

__init__(name, template, units, storageClass)

Public constructor. All arguments correspond directly to instance attributes.

name

Read-only instance attribute.

A string name for the Dataset; must correspond to the same DatasetType across all Registries.

template

Read-only instance attribute.

A string with str.format-style replacement patterns that can be used to create a StorageHint from a Run (and optionally its associated Collection) and a DatasetRef.

May be None to indicate a read-only Dataset or one whose templates must be provided at a higher level.

units

Read-only instance attribute.

A DataUnitTypeSet that defines the DatasetRefs corresponding to this DatasetType.

storageClass

Read-only instance attribute.

A StorageClass subclass (not instance) that defines how this DatasetType is persisted.

SQL Representation

DatasetTypes are stored in a Registry using two tables. The first has a single record for each DatasetType and contains most of the information that defines it:

Todo

I’m a bit worried about relying on name being globally unique across Registries, but clashes should be very rare, and it might be good from a confusion-avoidance standpoint to force people to use new names when they mean something different.

DatasetType

Fields:
dataset_type_name varchar NOT NULL
template varchar  
storage_class varchar NOT NULL
Primary Key:
name
Foreign Keys:
None

The second table has a many-to-one relationship with the first and holds the names of the DataUnit types utilized by its DatasetRefs:

DatasetTypeUnits

Fields:
dataset_type_name varchar NOT NULL
unit_name varchar NOT NULL
Primary Key:
  • dataset_type_name
Foreign Keys:

StorageClass

A category of DatasetTypes that utilize the same in-memory classes for their InMemoryDatasets and can be saved to the same file format(s).

Transition

The allowed values for “storage” entries in v14 Butler policy files are analogous to StorageClasses.

Python API

class StorageClass

An abstract base class whose subclasses are StorageClasses.

subclasses

Concrete class attribute: provided by the base class.

A dictionary holding all StorageClass subclasses, keyed by their name attributes.

name

Virtual class attribute: must be provided by derived classes.

A string name that uniquely identifies the derived class.

components

Virtual class attribute: must be provided by derived classes.

A dictionary that maps component names to the StorageClass subclasses for those components. Should be empty (or None?) if the StorageClass is not a composite.

assemble(parent, components)

Assemble a compound InMemoryDataset.

Virtual class method: must be implemented by derived classes.

Parameters:
  • parent – An instance of the compound InMemoryDataset to be returned, or None. If no components are provided, this is the InMemoryDataset that will be returned.
  • components (dict) – A dictionary whose keys are a subset of the keys in the components class attribute and whose values are instances of the component InMemoryDataset type.
  • parameters (dict) – details TBD; may be used for slices of Datasets.
Returns:

a InMemoryDataset matching parent with components replaced by those in components.

SQL Representation

The DatasetType table holds StorageClass names in a varchar field. As a name is sufficient to retreive the rest of the StorageClass definition in Python, the additional information is not duplicated in SQL.

Note

A need has been identified to have per-StorageClass tables that have a single row of metadata for each Dataset of that StorageClass, but details have not been worked out (including how to ensure those rows are populated when adding Datasets to the registry).

DatasetRef

An identifier for a Dataset that can be used across different Collections and Registries. A DatasetRef is effectively the combination of a DatasetType and a tuple of DataUnits.

Transition

The v14 Butler’s DataRef class played a similar role.

The DatasetLabel class also described here is more similar to the v14 Butler Data ID concept, though (like DatasetRef and DataRef, and unlike Data ID) it also holds a DatasetType name).

Python API

Warning

The Python representation of Dataset will likely change in the new preflight design. In particular DatasetLabel and DatasetHandle will disappear and be subsumed into DatasetRef.

The DatasetRef class itself is the middle layer in a three-class hierarchy of objects that behave like pointers to Datasets.

digraph Dataset {
node[shape=record]
edge[dir=back, arrowtail=empty]

DatasetLabel;
DatasetRef;
DatasetHandle;

DatasetLabel -> DatasetRef;
DatasetRef -> DatasetHandle;
}

The ultimate base class and simplest of these, DatasetLabel, is entirely opaque to the user; its internal state is visible only to a Registry (with which it has some Python approximation to a C++ “friend” relationship). Unlike the other classes in the hierarchy, instances can be constructed directly from Python PODs, without access to a Registry (or Datastore). Like a DatasetRef, a DatasetLabel only fully identifies a Dataset when combined with a Collection, and can be used to represent Datasets before they have been written. Most interactive analysis code will interact primarily with DatasetLabels, as these provide the simplest, least-structured way to use the Butler interface.

The next class, DatasetRef itself, provides access to the associated DataUnit instances and the DatasetType. A DatasetRef instance cannot be constructed without complete DataUnits and a complete DatasetType, making it somewhat more cumbersome to use in interactive contexts. The SuperTask pattern hides those extra construction steps from both SuperTask authors and operators, however, and DatasetRef is the class SuperTask authors will use most.

Instances of the final class in the hierarchy, DatasetHandle, always correspond to Datasets that have already been stored in a Datastore. A DatasetHandle instance cannot be constructed without interacting directly with a Registry. In addition to the DataUnits and DatasetType exposed by DatasetRef, a DatasetHandle also provides access to its URI and component Datasets. The additional functionality provided by DatasetHandle is rarely needed unless one is interacting directly with a Registry or Datastore (instead of a Butler), but the DatasetRef instances that appear in SuperTask code may actually be DatasetHandle instances (in a language other than Python, this would have been handled as a DatasetRef pointer to a DatasetHandle, ensuring that the user sees only the DatasetRef interface, but Python has no such concept).

All three classes are immutable.

class DatasetLabel
__init__(self, name, **units)

Construct a DatasetLabel from the name of a DatasetType and keyword arguments that describe DataUnits, with DataUnit type names as keys and DataUnit “values” as values.

name

Name of the DatasetType associated with the Dataset.

class DatasetRef(DatasetLabel)
__init__(self, type, units):

Construct a DatasetRef from a DatasetType and a complete tuple of DataUnits.

type

Read-only instance attribute.

The DatasetType associated with the Dataset the DatasetRef points to.

units

Read-only instance attribute.

A tuple of DataUnit instances that label the DatasetRef within a Collection.

makeStorageHint(run, template=None) → StorageHint

Construct the StorageHint part of a URI by filling in template with the Collection and the values in the units tuple.

This is often just a storage hint since the Datastore will likely have to deviate from the provided storageHint (in the case of an object-store for instance).

Although a Dataset may belong to multiple Collections, only the first Collection it is added to is used in its StorageHint.

Parameters:
  • run (Run) – the Run to which the new Dataset will be added; always implies a collection Collection that can also be used in the template.
  • template (str) – a storageHint template to fill in. If None, the template attribute of type will be used.
Returns:

a str StorageHint

producer

The Quantum instance that produced (or will produce) the Dataset.

Read-only; update via Registry.addDataset(), QuantumGraph.addDataset(), or Butler.put().

May be None if no provenance information is available.

predictedConsumers

A sequence of Quantum instances that list this Dataset in their predictedInputs attributes.

Read-only; update via Quantum.addPredictedInput().

May be an empty list if no provenance information is available.

actualConsumers

A sequence of Quantum instances that list this Dataset in their actualInputs attributes.

Read-only; update via Registry.markInputUsed().

May be an empty list if no provenance information is available.

class DatasetHandle(DatasetRef)
uri

Read-only instance attribute.

The URI that holds the location of the Dataset in a Datastore.

components

Read-only instance attribute.

A dict holding DatasetHandle instances that correspond to this Dataset’s named components.

Empty (or None?) if the Dataset is not a composite.

run

Read-only instance attribute.

The Run the Dataset was created with.

SQL Representation

As discussed in the description of the Dataset SQL representation, the DataUnits in a DatasetRefs are related to Datasets by a set of join tables. Each of these connects the Dataset table’s dataset_id to the primary key of a concrete DataUnit table.

InMemoryDataset

The in-memory manifestation of a Dataset

Example: an afw.image.Exposure instance with the contents of a particular calexp.

Transition

The “python” and “persistable” entries in v14 Butler dataset policy files refer to Python and C++ InMemoryDataset types, respectively.

StorageHint

A storage hint provided to aid in constructing a URI.

Frequently (in e.g. filesystem-based Datastores) the storageHint will be used as the full filename within a Datastore, and hence each Dataset in a Registry must have a unique storageHint (even if they are in different Collections). This can only guarantee that storageHints are unique within a Datastore if a single Registry manages all writes to the Datastore. Having a single Registry responsible for writes to a Datastore (even if multiple Registries are permitted to read from it) is thus probably the easiest (but by no means the only) way to guarantee storageHint uniqueness in a filesystem-basd Datastore.

StorageHints are generated from string templates, which are expanded using the DataUnits associated with a Dataset, its DatasetType name, and the Collection the Dataset was originally added to. Because a Dataset may ultimately be associated with multiple Collections, one cannot infer the storageHint for a Dataset that has already been added to a Registry from its template. That means it is impossible to reconstruct a URI from the template, even if a particular Datastore guarantees a relationship between storageHints and URIs. Instead, the original URI must be obtained by querying the Registry.

The actual URI used for storage is not required to respect the storageHint (e.g. for object stores).

Todo

Use Runs instead of Collections to define StorageHints.

Transition

The filled-in templates provided in Mapper policy files in the v14 Butler play the same role as the new StorageHint concept when writing Datasets. Mapper templates were also used in reading files in the v14 Butler, however, and StorageHints are not.

Python API

StorageHints are represented by simple Python strings.

SQL Representation

StorageHints do not appear in SQL at all, but the defaults for the templates that generate them are a field in the DatasetType table.

URI

A standard Uniform Resource Identifier pointing to a Dataset in a Datastore.

The Dataset pointed to may be primary or a component of a composite, but should always be serializable on its own. When supported by the Datastore the query part of the URI (i.e. the part behind the optional question mark) may be used for slices (e.g. a region in an image).

Todo

Datastore.get also accepts parameters for slices; is the above still true?

Transition

No similar concept exists in the v14 Butler.

Python API

We can probably assume a URI will be represented as a simple string initially.

It may be useful to create a class type to enforce grammar and/or provide convenience operations in the future.

SQL Representation

URIs are stored as a field in the Dataset table.