Granularity of datasets

Datasets can be stored and described by metadata in a variety of ways, for example one big dataset with a single DOI and associated metadata, or a list of dataset files, that are grouped together in a collection, but each with separate persistent identifier and metadata. What is needed to provide more clarity?

1 Like

Thanks for following up on this. I addressed this issue a couple of years ago; see this Dataverse GitHub issue.

Currently, dataset-level DOI records are conflated with dataset-file-level DOI records in DataCite. This situation is quite unsatisfactory as it, i.a., results in a proliferation of file metadata records listed in DataCite Search result lists and ORCID records search result lists.

As a first step to mitigate this issue, I’d like to suggest to introduce a new type of Resource Type General to be used for DOI records about files in a dataset. We might call this type Dataset file, Dataset part or Part of Dataset. This type will then be available for filtering records in DataCite Search and Fabrica and other (DataCite) webpages and services.

Thanks Philipp. One challenge I see is that data repositories are implementing this in different ways. For example:

  • one dataset (with one DOI) that holds everything
  • one dataset with multiple data files, where only the dataset has a DOI and DataCite metadata
  • one dataset with multiple data files, each with a DOI and metadata
  • multiple datasets (each with multiple data files), aggregated in one or more collections

My specific question is then whether the first item is a dataset or dataset file?

Maybe an alternative way to distinguish (and filter out) dataset files is looking at isPartOf/HasPart in the metadata.

It helps to identify the use cases of what we want to solve. I start:

  • In DataCite Search, only get the datasets and not all associated files, to reduce noise
  • in an ORCID record, only include the dataset, and not associated files, to focus on the overall record
  • aggregate all citations, views and downloads by dataset, even if they are associated with individual dataset files

As far as I can see from the metadata records in DataCite Fabrica, filtering based on isPartOf in the Related Identifiers > Relation Type field may work; see e.g. this screenshot of a DataCite file-level DOI record from a Dataverse repository:

And should we do this automatically in DataCite Search? I am fine with that, but only after we have implemented solid navigation from the dataset to the dataset files, similar how this is implemented in Dataverse.

For ORCID claiming via auto-update we for a long time already exclude all DataCite DOIs that have a isIdenticalTo, isPartOf or isVersionOf relationship.

1 Like

@Philipp In schema.org (and in DCAT where this comes from), we have the concepts of distribution and dataDownload that I think align nicely with what you are proposing.