Billions of records

GBIF.org has 1,3B individual records coming from 10,000s of datasets. Each dataset has a DOI and we currently track citations at that level which works well. It would be useful to offer a consistent record-level handle-based identifier so that people can link to records consistently - e.g. cite records individually in a paper publication or provide annotation services.

There are several ways I foresee we could achieve this with some initial pro/con commentary:

  1. Templated DOIs. Pros: cheap, easy, scalable for DataCite. Cons: can only track citations at dataset level.
  2. DataCite DOI for each specimen. Pros: easy for everyone. Cons: lots of unnecessary DOIs created, scalability challenge and bottleneck in DataCite.
  3. GBIF become a DOI authority (or other handle authority). Pros: raise visibility of GBIF, infrastructure already built with mature metadata standards. Cons: Misalignment with DataCite, not portable to other domains, additional governance for GBIF (currently a DataCite member, but not a DOI Foundation member) and costs involved.
  4. Mint DOIs on demand. Pros: Only the records needed are created - less wasteful. Cons: burden for users, no consistent resolution for records achieved, anyone looking to link programmatically then needs to mint first rather than just link.

I am interested in wider discussion around this topic for both GBIF (biodiversity observation / specimen records) and but also with wider communities who may have sophisticated systems in place dealing with high quantities of data looking to connect with DataCite.

Tim, thanks a lot for posting this here. This discussion started in this GitHub issue more than a year ago on the specific issue of template handles (related to option 1).

The issue you describe is a common one, although few organizations will have the number of records that need identifiers that go into the billions.

The DataCite view on this issue is as follows:

  1. Template handles work fine to generate large numbers of handle records cheaply, and they work similarly to what Identifiers.org is doing, generating redirect rules based on regular expressions.
  2. Template handles don’t work well for DOIs (including DataCite DOIs), as DOIs use central infrastructure to store metadata in addition to providing DOI resolution (forwarding to a URL) via hte handle system.
  3. GBIF becoming a DOI registration agency would be an alternative in terms of governance, but comes with the same technical challenges.
  4. DOI registration on demand is the proposed solution of the RDA Data Citation Working Group for evolving data, and aligns well with how DataCite DOI registration works.

My primary question would be about the data maturity mode, or what Treloar et al. call the data curation continuum:

  1. What kind of metadata do you want to register for datasets, and for records?
  2. Are datasets and records persistent, or can records and their metadata also be deleted?
  3. As DOIs are mainly citation identifiers linking datasets to citations, funding, authors, etc., are there similar use cases for records?

One solution would be to use handles for records and DOIs for datasets. And either provide no metadata for records in a centralized system or a very limited set of automatically generated provenance information. This would be somewhat similar to how the life sciences manages large databases with machine-generated datasets, e.g. the European Nucleotide Archive, only that they don’t use handles but identifiers.org. The datasets can use DOIs, have more metadata, link to other resources (e.g. citations) and link to individual records using their handle identifier. As this is an important use case, DataCite would help with this, and I am particularly interested in how to include automated machine-initiated data downloads into this workflow.

1 Like

This makes good sense and could be relatively easily achieved (we hold that metadata already).

1 Like

Great. Happy to help discussing the details, here or elsewhere.

What is a template(d) DOI/Handle? I see this term used by both Tim and Martin but haven’t come across it before.

It is equivalent to an anchor on a webpage.

See http://www.doi.org/doi_handbook/5_Applications.html#5.8 and Page 58, “11 Template Handles” on
http://www.handle.net/tech_manual/HN_Tech_Manual_8.pdf#h.1opuj5n

Edited to add: DataCite don’t support them, and we believe no DOI agency does.

I would describe template handles as handles that are not registered individually, but generated automatically using a regular expression. This is much easier to implement for very large numbers of identifiers, but does not support some important functionalities. It is for example very hard to generate a list of all identifiers, so that you can for example check whether any one of them doesn’t resolve properly, or for a systematic analysis of metadata.

[Identifiers.org](https://identifiers.org] doesn’t use handles, but has taken a very similar approach with namespaces and regular expressions. DataCite DOIs don’t support template handles, as metadata have to be registered individually for each DOI, and several functionalities and services of the DataCite system, e.g. automatic link checking, depend on registering each DOI individually in the DOI system.

1 Like

I’ve just deployed a test server that allows GBIF records to be resolved using handles. This demonstrates the 1.3 Billion records in GBIF can be issued resolvable handles easily without significant new infrastructure and records can be minted at a few 10,000s per second when new data comes in.

As an example, you can put in 20.500.12472/occurrence/2147528474 to the CRNI handle resolver on http://hdl.handle.net/ and it will resolve to the GBIF page. Similarly you can use DOI as a resolver http://doi.org/20.500.12472/occurrence/2147528474.

In this test I do not do any content negotiation, so will only resolve to the HTML (no JSON etc) but this was really just for a test. What format the identifier should be (e.g. a GBIF integer ID or a more native version of a specimen identifier such as a CETAF one) remain open but all options are easily possible already with GBIF infrastructure.

This can also be access through an API directly or using a root server such as CRNI.

Being a test server this may go down, be restarted, go offline at any point and without notice. Please contact me if you need any help in this case (trobertson@gbif.org).

Thanks go to Jane and Robert of CRNI for answering my newbie questions

Very nice! Two small comments:

  • while http://doi.org/20.500.12472/occurrence/2147528474 resolves, it is very confusing to users, and may not work in the future.
  • consider embedding the metadata as schema.org JSON-LD if the extra effort is reasonable. That is a low effort approach to distribute metadata that scales well - main challenge is the sitemaps file for such a large number of identifiers.
1 Like