eBank project
Workpackage Number: 6
Workpackage Title: Supporting studies
Study 2

This study will consider the description of (multimedia) datasets within the chemistry domain particularly with regard to the perceived hierarchy of data and metadata from raw data up to “published results”. A variety of issues will be addressed, including identifying common attributes of a dataset and relating these to domain-specific characteristics, managing legacy data, metadata created at source by laboratory equipment and the relationship to data curation activities.  The Combechem project will be used as a case study and metadata from three sources (e-Lab book, crystallography data and physical chemistry data) will be investigated. Outcomes of the study will be a report and a draft schema for describing chemistry datasets.

Feasibility Report on Data Set Description and Schema

The eBank project has produced a prototype demonstrator of a service based on Eprints software to provide access to the detailed results of scientific experiments in chemistry, and in particular crystallography. To present this complex data in a retrievable and meaningful way requires that it is described through metadata and appropriate metadata schema that allow the information to be harvested and re-published by other services through alternative interfaces. The challenge faced by the project is the complexity and volume of data that are to be made accessible from the principal points in the network dissemination chain - institutional archives, aggregators, service providers, portals, and prospectively other data providers such as publishers and digital libraries. In both respects, the design of the metadata containers and schema is critical, and is perhaps the key contribution of the project to date. This report describes the metadata and schema adopted, and shows how the records so-described are presented in the demonstrators. The advantages and limitations of the approach are briefly evaluated with a view to generalising the schema for the presentation of experimental data from other science disciplines through other service providers.
A journal publication describing the results of scientific work is typically a distillation of experimental data. This description is aimed at a wider audience than the immediate peers of the authors, so placing the work in its primary context and reducing the data to the most significant results is critical in making the work more widely known. Those immediate peers, however, may require access to more of the original data produced in the work, to verify reproducibility or to build on those data, for example. Modern science can produce large volumes of data as computational tools enable experiments to be perfomed more frequently and more efficiently, [MH's quote about increased productivity] As long as publication has been detached from the means of production and format of this data, managing and providing access to full experimental data has not been simple. Journals, especially those based on print formats, do not have the space for such data. In crystallography just 300,000 crystal structures are documented in database archives, against an estimated 1.5 million known structures: less than 20% of data generated in crystallographic work is reaching the public domain due to publication bottlenecks.

The task is now assisted by the emergence of electronic networks. Experimental data are produced electronically, so are immediately amenable to network distribution. What needs to be done is to mark-up the data so it can be discovered and made available to both machine and human readers. This is the process of creating metadata. While the Internet and the World Wide Web offer standard protocols for distribution, now being supplemented for the type of scientific data sources described here by e-science and grid technologies, particular subjects require specialised metadata and means of discovery. Dublin Core (DC) is a metadata standard that has emerged to describe the 'core', or essential, elements of a bibliographic record, say of an item that might be found in an academic library. A mechanism designed to improve discovery of such records is the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). If the library is considered to be the example data provider, the OAI-PMH allows independent data services to 'harvest' the DC records into a database and enable these records to be searched among records from other selected data providers. Cross-searching techniques generally send specific search requests in parallel to different sources (by some specified protocol) and combine the various responses into a result for the cross-search. In contrast, search services built on harvested metadata carry out local searches on the pre-harvested metadata. DC and OAI provide a minimum level of interoperability between data providers and diverse service providers.

Basic DC does not describe specialised subject terms, but can be extended by means of  'qualified' DC. In this case a schema is devised to describe the extended terms. If a schema definition, an XML document describing the schema terms, is linked from a DC record, then readers can make sense of the extended terminology.

In crystallography there are a number of different ways to describe the subject of the data sets. Experiments revolve around a single molecule which can be thought of as the ‘topic’ of the experiments. There are a number of established ways of identifying molecules, which include internationally recognised methods of specifying their formulas or names. These different vocabularies have been incorporated into the schema through the encoding schemes facility of qualified DC.

This report describes the schema implemented by the eBank project to disseminate e-data reports describing crystallographic structures (eCrystallographyDataReports). Schemas are not new, but what this report aims to do is justify the design of the schema in terms of underlying experimental processes and the network of services it is designed to serve. Some of the services envisaged are new and not widely known in the chemistry community; others are better known and offer standards-based services but are not well connected within the network of services; then there are those that do not yet exist in practice, so are replicated experimentally by the project. The schema is thus pivotal in making connections between current and anticipated network services, and the report assesses the feasibility of the initial implementation in terms of its use by the respective services, with recommendations for future improvements.

In the eBank project the actual and prospective service partners include:

The underlying science process; crystallography workflow

A data archive has been created at the University of Southampton, built on Eprints, OAI-compliant software that was designed to manage author deposit of papers. The user interface has been adapted to manage the deposit of data sets produced by crystallographers from the National Crystallography Service (NCS) at Southampton. During the deposit process, metadata about the data sets is entered or generated automatically. Since there are inherent relationships between data sets, the metadata is designed to reflect the scheme of the experimental procedure, outlined in Figure 1. In the case of crystallography, datasets are related by sequence since they are generated (by measurement or analysis) from a series of sequential stages in the experimental process.

Figure 1. Generalised workflow for crystallography experiments

Data sets do not need to be stored at a single location such as the Eprints archive at NCS. By using OAI-based DC, interoperability conditions mean that data sets stored at different locations can be accessed by users as though they were from a single 'virtual' archive depending on the OAI service provider used.

Metadata schema

'Local' institutional archive

The metadata schema resulting from analysis of the workflow (Figure 2) captures the files generated during the course of the experiment. Each of the files is stored in the host, or local, archive for access by users, mediated by a single record, the e-data report. This report links to the individual data files as well as other relevant sources, such as eprints, and possibly external structure databases, and presents an interactive visualisation of the derived structure (Figure 3).

Figure 2. Representation of the crystallography experiment schema, indicating all the files generated during the course of the experiment.

Figure 3. eCrystallographyDataReport presented to a user via the adapted Eprints archive interface

Distribution to aggregators and portals

To enhance the visibility of archived data sets, e-data reports can be harvested by independent service providers such as aggregators and portals. In this project there are two demonstrator services: the eBank UK aggregator service at Bath University, in effect a specialist aggregator of e-data reports, and PSIGate, the physical sciences hub of the JISC Resource Discovery Network, which offers search results in a broader science context.

Only the e-data reports need to be harvested, rather than the full data sets, as the reports link to the constituent data files in the original archive. For this purpose the e-data report is represented by a DC schema designed for dissemination via OAI interface. Figure 4 shows the schema elements presented to the OAI interface for the exchange of eBank data between data provider and service provider. Explanations of the elements and how they map to user requirements are given in the Appendix.

Figure 4. Schema elements for eCrystallographyDataReports presented to the OAI interface within a data archive (data provider)

The search interface presented by the eBank UK demo is shown in Figure 5a. A similar search interface offered by PSIgate is shown in Figure 5b. The PSIgate search uses an RDN-include type mechanism: search requests run scripts on the eBank UK server. Although a stylesheet is used to reformat the data, the portal has no control over what data are passed across. Service providers can re-present records, such as the one shown in Figure 3, ideally supplemented with additional information such as links to other relevant sources, such as published papers and library holdings, or other information on which the provider holds data.

Figure 5. eBank demo search interfaces: a, presented through the eBank UK; b, from PSIgate

E-data reports are represented as records in an XML format, defined and constrained by the adopted schema. At its core an eBank record conforms to the schema described above, although additional 'layers' can be added as entry points for more service providers. For example, while the current eBank data might be harvested by specialist crystallography services, more general providers of digital library services might require additional information to be able to handle such data. An eCrystallographyDataReport might not be commonly encountered by a digital library OAI harvester, which therefore needs additional information to understand its contents.

More particularly, with increasingly complex digital objects becoming available for harvesting, such as objects with multiple components and multiple metadata components, 'containers' are needed to transport not just the core data but the additional components too. The Metadata Encoding and Transmission Standard (METS) is such a container and provides an XML document format for encoding metadata necessary for both management of digital library objects within a repository and exchange of such objects between repositories. METS recognises that describing digital objects requires an increasingly complex series of metadata descriptions - administrative, structural and technical metadata, for example. Other proposed schemas for describing complex objects include MPEG21 Digital Item Declaration Language (DIDL) and content packaging standards from elearning organisations such as the IMS Global Learning Consortium.

The example eBank record shown in Figure 6a includes a METS layer. This approach can produce complex and specialised records, however. One effort to reduce, simplify and generalise an e-data report record to an OAI-DC format is shown in Figure 6b. A schematic view of metadata exchanged in eBank UK project using OAI-PMH with METS wrapper elements is shown in Figure 7.

This record contains declarations linking to the eBank XML schema definitions (.xsd). Two .xsd documents have been created for the eBank demo, defining

The latter includes the key definitions. The terms used will be replaced by official types from the bodies concerned, e.g. IUPAC, CCDC, when such types become available.

Assessment

The eBank project was funded by JISC for one year from September 2003. In that time eBank has demonstrated how new infrastructure can be built on existing and emerging services to integrate and disseminate new sources of data, in this case research data generated directly from experimental equipment. Concerns have been expressed, however, at the limited range of application of the current schema, which only apply to crystallography.

A metadata schema should not be considered stable for release until it has been tested against harvesting requirements and whether it supports building a search interface that meets user requirements. Within the limited confines of the project infrastructure, it has been shown that data can be produced and structured for effective dissemination from the data producer to a local archive for storage and then on to aggregator and discovery services.

However, the records produced for the project are effectively placeholders, not for intended for wider usage beyond the project. To avoid inadvertent harvesting of these records by real service providers, the OAI interface has not been made public and usage has been monitored to prevent unauthorised harvesting.

The schema is complicated and specialised. The eBank UK aggregator, set up specifically to support the project, has harvested a small sample of records from a single source. The OAI-METS export described above is only intended to convey data between the archive and the experimental service provider and is not recommended for use after the project. The PSIgate search interface has not been integrated with the portal's other services. If a longer-term, larger-scale view had been possible, it would be interesting to investigate whether a more general harvesting or re-harvesting approach would have benefits compared with a server-include mechanism, since relations between data and service providers are unlikely to be as well specified as between the partners in this case. So far there has been no test harvesting by other services such as CCDC and IUCr. None of this has been systematically tested against user requirements.

Creating more generic scientific schema is the next goal. The schema must be designed with expansion and interoperability in mind. This is technically possible but requires, as in the case investigated, an intimate understanding of the underlying experimental processes that are to be represented, and active involvement in the different science communities to agree standards for the respective schema.

The services that can be built on aggregated metadata can only be as good as the metadata that is available to them. To achieve interoperability between data set and publication metadata, there must be some consensus on the data models and the metadata schemas being used to exchange data. Furthermore, to support discovery between data sets from different communities, there must be some agreed commonality in the models and schemas if cross-disciplinary services are to be built. For example, efforts are ongoing in the chemistry community to agree on namespaces for these vocabularies. Until standard recommendations for these namespaces emerge, the eBank namespace designations can substituted temporarily.

The principal strengths and weaknesses of the current eBank approach, as revealed by its application of a data set description and schema, and future requirements, are listed below.

Strengths

Weaknesses Future work In terms of alternative data models, within the UK research council communities the Council for the Central Laboratory of the Research Councils (CCLRC) has developed a data model that attempts to describe the relationship between experiments, investigators, data holdings, data sets, data files, logical and physical locations.

Future plans include working closely with IUCr and CCDC to integrate the eBank approach into chemistry-related publications so this is the globally accepted route for publishing crystal structures. Initial discussions with chemistry publishers such as the American Chemical Society (ACS) and Taylor and Francis, a learned society and commercial publisher respectively, indicate that the eBank open access, OAI-based approach to accessing crystal structures is one solution to the current publication bottleneck problem.

Conclusion

E-data reports convey more information to users than was previously possible by journal papers alone, but such sources are not isolated entities in the scholarly communication chain. Increasingly they will be seen as part of the continuum that network technologies make possible. E-data reports of all sorts are set to benefit from the emergence of network services based on grid and e-science technologies, and from the growth of open access based on institutional archives. Combined, these services can deliver the kind of data volumes required together with the means for search, discovery and access, all linked to associated data, such as the refereed papers, that are outputs of the same experimental work. The eBank project has shown that a framework built on standard metadata components will make it feasible for different services in this chain to cooperate. The initial implementation is limited to crystallography, but more generic applications are anticipated.

eBank is not just about chemistry, or even crystallography, although these disciplines provide a very good exemplar. It is about how you structure e-data reports, but mostly it is about how you use this structure to make these data accessible from the principal points in the network dissemination chain - institutional archives, aggregators, service providers, portals, and prospectively other data providers such as CCDC, publishers and digital libraries. In a world that will soon be awash with e-science data, that is what is distinctive about eBank.

Acknowledgements

The following people have contributed to the work described in this report:
(Southampton) Leslie Carr, Simon Coles, Jeremy Frey, Chris Gutteridge, Steve Hitchcock, Mike Hursthouse; (UKOLN, Bath) Michael Day, Monica Duke, Rachel Heery, Liz Lyon, Andy Powell; (PSIgate, Manchester) John Blunden-Ellis, Paul Meehan

Links

Appendix

To include
Elements in the eBank Schema – Version 3 (modified 21st September 2004), Duke, Heery
Part 1