Describing Datasets¶
Dataset¶
A Dataset is a discrete entity of stored data.
Datasets are uniquely identified by either a URI or the combination of a Collection and a DatasetRef.
Example: a “calexp” for a single visit and sensor produced by a processing run.
A Dataset may be composite, which means it contains one or more named component Datasets (for example a “WCS” is a subcomponent of a “calexp”). Composites may be stored either by storing the parent in a single file or by storing the components separately. Some composites simply aggregate that are always written as part of other Datasets, and are themselves read-only.
Datasets may also be sliced, which yields an InMemoryDataset of the same type containing a smaller amount of data, defined by some parameters. Subimages and filters on catalogs are both considered slices.
Datasets may include metadata in their persisted form in a Datastore, but Registries never hold Dataset metadata directly - all metadata is instead associated with the DataUnits associated with a Dataset.
For example, metadata associated with an observation (e.g. zenith angle) would be associated with a Visit (or perhaps Snap or ObservedSensor) rather than a raw
DatasetType.
Because a raw
is associated with those DataUnits, it is still associated with the metadata, but the association is indirect, and the metadata is also automatically associated with other Datasets that are associated with those units, like calexp
or src
.
Transition¶
The Dataset concept has essentially the same meaning that it did in the v14 Butler.
A Dataset is analogous to an Open Provenance Model “artifact”.
Python API¶
The Python representation of a Dataset is in some sense an InMemoryDataset, and hence we have no Python “Dataset” class. However, we have several Python objects that act like pointers to Datasets. These are described in the Python API section for DatasetRef.
SQL Representation¶
Datasets are represented by records in a single table that includes everything in a Registry, regardless of Collection or DatasetType:
Dataset¶
- Fields:
dataset_id int NOT NULL registry_id int NOT NULL dataset_type_name varchar NOT NULL uri varchar run_id int NOT NULL producer_id int unit_hash binary NOT NULL - Primary Key:
- dataset_id, registry_id
- Foreign Keys:
- dataset_type_name references DatasetType (name)
- (run_id, registry_id) references Run (run_id, registry_id)
- (producer_id, registry_id) references Quantum (quantum_id, registry_id)
Using a single table (instead of per-DatasetType and/or per-Collection tables) ensures that table-creation permissions are not required when adding new DatasetTypes or Collections. It also makes it easier to store provenance by associating Datasets with Quanta.
The disadvantage of this approach is that the connections between Datasets and DataUnits must be stored in a set of additional join tables (one for each DataUnit table).
The connections are summarized by the unit_hash
field, which contains a sha512
hash that is unique only within a Collection for a given DatasetType, constructed by hashing the values of the associated units.
While a unit_hash
value cannot be used to reconstruct a full DatasetRef, a unit_hash
value can be used to quickly search for the Dataset matching a given DatasetRef.
It also allows Registry.merge()
to be implemented purely as a database operation by using it as a GROUP BY column in a query over multiple Collections.
Dataset utilizes a compound primary key that combines an autoincrement dataset_id
field that is populated by the Registry in which the Dataset originated and a registry_id
that identifies that Registry.
When transferred between Registries, the registry_id
should be transferred without modification, allowing new Datasets to be assigned dataset_id
values that were used or may be used in the future in the transferred-from Registry.
DatasetComposition¶
- Fields:
parent_dataset_id int NOT NULL parent_registry_id int NOT NULL component_dataset_id int NOT NULL component_registry_id int NOT NULL component_name varchar NOT NULL - Primary Key:
- (parent_dataset_id, parent_registry_id, component_dataset_id, component_registry_id)
- Foreign Keys:
A self-join table that links composite datasets to their components.
- If a virtual Dataset was created by writing multiple component Datasets, the parent DatasetType’s
template
field and the parent Dataset’suri
field may be null (depending on whether there was also a parent Dataset stored whose components should be overridden). - If a single Dataset was written and we’re defining virtual components, the component DatasetTypes should have null
template
fields, but the component Datasets will have non-nulluri
fields with values returned by the Datastore whenDatastore.put()
was called on the parent.
DatasetType¶
A named category of Datasets that defines how they are organized, related, and stored.
In addition to a name, a DatasetType includes:
- a template string that can be used to construct a StorageHint (may be overridden);
- a tuple of DataUnit types that define the structure of DatasetRefs;
- a StorageClass that determines how Datasets are stored and composed.
Transition¶
The DatasetType concept has essentially the same meaning that it did in the v14 Butler.
Python API¶
-
class
DatasetType
¶ A concrete, final class whose instances represent DatasetTypes.
DatasetType instances may be constructed without a Registry, but they must be registered via
Registry.registerDatasetType()
before corresponding Datasets may be added.DatasetType instances are immutable.
Note
In the current design,
DatasetTypes
are not type objects, and theDatasetRef
class is not an instance ofDatasetType
. We could make that the case with a lot of metaprogramming, but this adds a lot of complexity to the code with no obvious benefit. It seems most prudent to just rename the DatasetType concept and class to something that doesn’t imply a type-instance relationship in Python.-
__init__
(name, template, units, storageClass)¶ Public constructor. All arguments correspond directly to instance attributes.
-
name
¶ Read-only instance attribute.
A string name for the Dataset; must correspond to the same DatasetType across all Registries.
-
template
¶ Read-only instance attribute.
A string with
str.format
-style replacement patterns that can be used to create a StorageHint from a Run (and optionally its associated Collection) and a DatasetRef.May be None to indicate a read-only Dataset or one whose templates must be provided at a higher level.
-
units
¶ Read-only instance attribute.
A
DataUnitTypeSet
that defines the DatasetRefs corresponding to this DatasetType.
-
storageClass
¶ Read-only instance attribute.
A
StorageClass
subclass (not instance) that defines how this DatasetType is persisted.
-
SQL Representation¶
DatasetTypes are stored in a Registry using two tables. The first has a single record for each DatasetType and contains most of the information that defines it:
Todo
I’m a bit worried about relying on name
being globally unique across Registries, but clashes should be very rare, and it might be good from a confusion-avoidance standpoint to force people to use new names when they mean something different.
DatasetType¶
- Fields:
dataset_type_name varchar NOT NULL template varchar storage_class varchar NOT NULL - Primary Key:
- name
- Foreign Keys:
- None
The second table has a many-to-one relationship with the first and holds the names of the DataUnit types utilized by its DatasetRefs:
DatasetTypeUnits¶
- Fields:
dataset_type_name varchar NOT NULL unit_name varchar NOT NULL - Primary Key:
- dataset_type_name
- Foreign Keys:
- (dataset_type_name) references DatasetType (name)
StorageClass¶
A category of DatasetTypes that utilize the same in-memory classes for their InMemoryDatasets and can be saved to the same file format(s).
Transition¶
The allowed values for “storage” entries in v14 Butler policy files are analogous to StorageClasses.
Python API¶
-
class
StorageClass
¶ An abstract base class whose subclasses are StorageClasses.
-
subclasses
¶ Concrete class attribute: provided by the base class.
A dictionary holding all
StorageClass
subclasses, keyed by theirname
attributes.
-
name
¶ Virtual class attribute: must be provided by derived classes.
A string name that uniquely identifies the derived class.
-
components
¶ Virtual class attribute: must be provided by derived classes.
A dictionary that maps component names to the
StorageClass
subclasses for those components. Should be empty (orNone
?) if the StorageClass is not a composite.
-
assemble
(parent, components)¶ Assemble a compound InMemoryDataset.
Virtual class method: must be implemented by derived classes.
Parameters: - parent – An instance of the compound InMemoryDataset to be returned, or None. If no components are provided, this is the InMemoryDataset that will be returned.
- components (dict) – A dictionary whose keys are a subset of the keys in the
components
class attribute and whose values are instances of the component InMemoryDataset type. - parameters (dict) – details TBD; may be used for slices of Datasets.
Returns: a InMemoryDataset matching
parent
with components replaced by those incomponents
.
-
SQL Representation¶
The DatasetType table holds StorageClass names in a varchar
field.
As a name is sufficient to retreive the rest of the StorageClass definition in Python, the additional information is not duplicated in SQL.
Note
A need has been identified to have per-StorageClass tables that have a single row of metadata for each Dataset of that StorageClass, but details have not been worked out (including how to ensure those rows are populated when adding Datasets to the registry).
DatasetRef¶
An identifier for a Dataset that can be used across different Collections and Registries. A DatasetRef is effectively the combination of a DatasetType and a tuple of DataUnits.
Transition¶
The v14 Butler’s DataRef class played a similar role.
The DatasetLabel
class also described here is more similar to the v14 Butler Data ID concept, though (like DatasetRef and DataRef, and unlike Data ID) it also holds a DatasetType name).
Python API¶
Warning
The Python representation of Dataset will likely change in the new preflight design. In particular DatasetLabel
and DatasetHandle
will disappear and be subsumed into DatasetRef
.
The DatasetRef
class itself is the middle layer in a three-class hierarchy of objects that behave like pointers to Datasets.
The ultimate base class and simplest of these, DatasetLabel
, is entirely opaque to the user; its internal state is visible only to a Registry (with which it has some Python approximation to a C++ “friend” relationship).
Unlike the other classes in the hierarchy, instances can be constructed directly from Python PODs, without access to a Registry (or Datastore).
Like a DatasetRef
, a DatasetLabel
only fully identifies a Dataset when combined with a Collection, and can be used to represent Datasets before they have been written.
Most interactive analysis code will interact primarily with DatasetLabels
, as these provide the simplest, least-structured way to use the Butler interface.
The next class, DatasetRef
itself, provides access to the associated DataUnit instances and the DatasetType
.
A DatasetRef
instance cannot be constructed without complete DataUnits and a complete DatasetType, making it somewhat more cumbersome to use in interactive contexts.
The SuperTask pattern hides those extra construction steps from both SuperTask authors and operators, however, and DatasetRef
is the class SuperTask authors will use most.
Instances of the final class in the hierarchy, DatasetHandle
, always correspond to Datasets that have already been stored in a Datastore.
A DatasetHandle
instance cannot be constructed without interacting directly with a Registry.
In addition to the DataUnits and DatasetType exposed by DatasetRef
, a DatasetHandle
also provides access to its URI and component Datasets.
The additional functionality provided by DatasetHandle
is rarely needed unless one is interacting directly with a Registry
or Datastore
(instead of a Butler
), but the DatasetRef
instances that appear in SuperTask code may actually be DatasetHandle
instances (in a language other than Python, this would have been handled as a DatasetRef
pointer to a DatasetHandle
, ensuring that the user sees only the DatasetRef
interface, but Python has no such concept).
All three classes are immutable.
-
class
DatasetLabel
¶ -
__init__
(self, name, **units)¶ Construct a DatasetLabel from the name of a DatasetType and keyword arguments that describe DataUnits, with DataUnit type names as keys and DataUnit “values” as values.
-
name
¶ Name of the DatasetType associated with the Dataset.
-
-
class
DatasetRef
(DatasetLabel)¶ -
__init__(self, type, units):
Construct a DatasetRef from a
DatasetType
and a complete tuple ofDataUnits
.
-
type
¶ Read-only instance attribute.
The
DatasetType
associated with the Dataset the DatasetRef points to.
-
units
¶ Read-only instance attribute.
A tuple of
DataUnit
instances that label the DatasetRef within a Collection.
-
makeStorageHint
(run, template=None) → StorageHint¶ Construct the StorageHint part of a URI by filling in
template
with the Collection and the values in theunits
tuple.This is often just a storage hint since the Datastore will likely have to deviate from the provided storageHint (in the case of an object-store for instance).
Although a Dataset may belong to multiple Collections, only the first Collection it is added to is used in its StorageHint.
Parameters: Returns: a str StorageHint
-
producer
¶ The
Quantum
instance that produced (or will produce) the Dataset.Read-only; update via
Registry.addDataset()
,QuantumGraph.addDataset()
, orButler.put()
.May be None if no provenance information is available.
-
predictedConsumers
¶ A sequence of
Quantum
instances that list this Dataset in theirpredictedInputs
attributes.Read-only; update via
Quantum.addPredictedInput()
.May be an empty list if no provenance information is available.
-
actualConsumers
¶ A sequence of
Quantum
instances that list this Dataset in theiractualInputs
attributes.Read-only; update via
Registry.markInputUsed()
.May be an empty list if no provenance information is available.
-
SQL Representation¶
As discussed in the description of the Dataset SQL representation, the DataUnits in a DatasetRefs are related to Datasets by a set of join tables.
Each of these connects the Dataset table’s dataset_id
to the primary key of a concrete DataUnit table.
InMemoryDataset¶
The in-memory manifestation of a Dataset
Example: an afw.image.Exposure
instance with the contents of a particular calexp
.
Transition¶
The “python” and “persistable” entries in v14 Butler dataset policy files refer to Python and C++ InMemoryDataset types, respectively.
StorageHint¶
A storage hint provided to aid in constructing a URI.
Frequently (in e.g. filesystem-based Datastores) the storageHint will be used as the full filename within a Datastore, and hence each Dataset in a Registry must have a unique storageHint (even if they are in different Collections). This can only guarantee that storageHints are unique within a Datastore if a single Registry manages all writes to the Datastore. Having a single Registry responsible for writes to a Datastore (even if multiple Registries are permitted to read from it) is thus probably the easiest (but by no means the only) way to guarantee storageHint uniqueness in a filesystem-basd Datastore.
StorageHints are generated from string templates, which are expanded using the DataUnits associated with a Dataset, its DatasetType name, and the Collection the Dataset was originally added to. Because a Dataset may ultimately be associated with multiple Collections, one cannot infer the storageHint for a Dataset that has already been added to a Registry from its template. That means it is impossible to reconstruct a URI from the template, even if a particular Datastore guarantees a relationship between storageHints and URIs. Instead, the original URI must be obtained by querying the Registry.
The actual URI used for storage is not required to respect the storageHint (e.g. for object stores).
Todo
Use Runs instead of Collections to define StorageHints.
Transition¶
The filled-in templates provided in Mapper policy files in the v14 Butler play the same role as the new StorageHint concept when writing Datasets. Mapper templates were also used in reading files in the v14 Butler, however, and StorageHints are not.
Python API¶
StorageHints are represented by simple Python strings.
SQL Representation¶
StorageHints do not appear in SQL at all, but the defaults for the templates that generate them are a field in the DatasetType table.
URI¶
A standard Uniform Resource Identifier pointing to a Dataset in a Datastore.
The Dataset pointed to may be primary or a component of a composite, but should always be serializable on its own. When supported by the Datastore the query part of the URI (i.e. the part behind the optional question mark) may be used for slices (e.g. a region in an image).
Todo
Datastore.get also accepts parameters for slices; is the above still true?
Transition¶
No similar concept exists in the v14 Butler.
Python API¶
We can probably assume a URI will be represented as a simple string initially.
It may be useful to create a class type to enforce grammar and/or provide convenience operations in the future.
SQL Representation¶
URIs are stored as a field in the Dataset table.