Paper for Digital Libraries 2000

San Antonio, Texas
Submission deadline, 1 December

Version history of this paper

This paper may be revised further up to the final copy date above
 

Developing services for open eprint archives: globalisation, integration and the impact of links

Introduction

The process of scholarly communication, in particular the aggregation of formal academic papers in journals, is probably about to enter its fastest period of change since 1994-6. The immediate platform for this change was the emergence of the World Wide Web as a popular medium in 1994 and the subsequent conversion of most established journals to electronic facsimiles delivered over the Web, a process which began to build momentum in 1996. While growth in the number of e-journals continues to accelerate towards an estimated 10,000 (ref. McLennan, Hunter), it is prior developments, such as the establishment of free electronic archives, or eprint archives, that are beginning to influence wider changes, however.

Launched in 1991, the importance of the Los Alamos physics eprint archive, the first and preeminent archive of its kind, (ref. Ginsparg) is widely recognised, but its practical ramifications have so far been largely confined to its home community in physics. Many have questioned, because of cultural differences between different academic disciplines (ref. Kling), whether the e-print model will be accepted beyond physics. That contention will be challenged by the most significant new e-print projects to have emerged since 1991: PubMed Central, launched at the beginning of this year and sponsored by the NIH, covering all fields in biomedical and life sciences (ref. Varmus); and the Computing Research Repository (CoRR), sponsored by the ACM and the BCS (Halpern). An immediate effect of PubMedCentral has been the announcement by some biomedical publishers, notably the British Medical Journal and Electronic Press, of freely accessible archives of electronic copies of current and past papers, both pre- and post-publication. As well as discipline-based archives, institutionally based initiatives such as Scholars Forum are planned.

The essence of the e-print archive model based on the Los Alamos model is a free-to-archive, free-to-access service. Compared with journals, e-print archives provide a largely automated and highly efficient organisational framework and distribution mechanism based on the Internet, but without many of the additional services that journals provide, such as peer review, and other services that mostly require human intervention. As more archives attract more papers the test for journals is how they respond to the effective loss of exclusivity that many depend on, and how they cope with the new economics of journal publishing in which every facet of traditional value-adding is re-evaluated against the for-free services.

There is another dimension to consider too: globalisation. With no geographical or financial barriers the next inevitable step is, if not universalisation where archives all adopt the same technical infrastructure, then integration of the archives based on new services. As part of this agenda archives are not only freely accessible to users but are becoming open to computational processes on which these services will be built. This approach is being formalised by the Open Archives initiative, a group that includes archive managers, potential service developers and academic librarians (UPS press release).

It is widely expected that links on citations within scholarly papers will be one of the primary new services driving integration between archives and other scholarly sources. Linking services cannot be implemented piecemeal on such a scale and a number of organisations recognise this (NISO). A large group of journal publishers have announced a collaboration to use the Digital Object Identifier (DOI) system to form citation links between their, separately maintained, journal contents. Most traditional journals service providers - aggregators, subscription agents (Blackwell) and secondary publishers (Ovid, Ebsco) - as well as new producers (Highwire) highlight the important role of citation links in their services.

This paper considers the implications of the new wave of eprint archives and the development of open archives. Its main focus is an example of a open archive service being developed by the Open Citation (OpCit) project, funded by the joint NSF-JISC international digital libraries programme, that will initially build citation links uniting large, high-profile and distributed archives but which is planned to be extensible to other services that provide access to scholarly papers.

The real role of eprint archives

When is an eprint not an eprint? When it's a preprint, reprint or a working paper. These are not merely pedantic distinctions for the words have real meanings and the differences highlight many of the misconceptions and misunderstandings that hinder the devlopment of freely accessible, permanent and complete archives in many disciplines.

Since, for economic reasons as much as utility, such archives are invariably electronic, papers deposited in them are reasonably described as 'eprints', electronic forms of papers that otherwise conform to the conventions and structures of documents that were in the past printed but which now may or may not ever appear in printed form.

The term preprint has long been used to describe a document circulated in some, possibly draft, form to a limited audience prior to its confirmation by publication, in final form, typically appearing in a printed journal. Preprints have had a long and useful history, especially in physics where they paved the way for the success of electronic archives, but the term persists today with a more mendacious use, effectively denoting a document waiting to be printed, or at least to be published in some authoritative, refereed form. Used by publishers the term implies the impermanence of the pre-published version, the uprooting of a document from its original location, say in an electronic archive, to its final and exclusive published state.

Another long-used publishing term, reprint, has been appropriated to suggest the opposite process, where the author of an eprint or preprint replaces it, or at least makes available alongside it, the formally published version of the paper. Multiple copies and versions of a document may exist, but ideally all versions of the text if not the presentation can be traced and accessed from one original source.

This is in conflict with the copyright terms offered, but not always imposed, by many journal publishers which do not permit multiple versions and which have led to another previously well understood term, the working paper, being contrived to suggest additional meaning. The working paper is typically an unfinished work, warning the reader that they should be careful in interpreting the contents. The subtler meaning is that the work is 'unpublished', i.e. it has not appeared in a formal, peer reviewed source, even though a version of the paper may have been notified and publicly-accessible on an electronic service for some time.

The largest example of this is the various archives of working papers in economics, otherwise exemplary services such as the EconWPA archive and the collected WoPEc archive pointer service. The aim is to satisfy the eligibility requirements commonly imposed by journals that submitted works are previously unpublished. The strategy is partially successful with publishers such as Elsevier agreeing that working papers can be made available on the Internet but using the ultimately untenable distinction of the 'working paper', created by the archives themselves, to refuse permission for final versions to be posted in the same way. The muddle is highlighted by the hypothetical question:
Q "May I claim that the finished version accepted for publication is a working paper and make it available on the Internet ?"
A "No."
Thus these archives are forcing authors into the postion where they effectively agree that the working paper posted for public comment is a version that will be voluntarily given up on publication. There is surely little future for archives that require their authors to mortgage their contents in a way that ensures certain reposession.

In the electronic environment it becomes clear that the fundamental requirements of archiving -permanence and accessibility - issues that are as important to authors as to formal archivists, are less a problem caused by the impact of rapidly changing technology as is popularly perceived than of our ability to manage and organise the underlying information in some way that is independent of commercial interests. The Los Alamos physics eprint archive, has endured for nearly a decade without any of its contents becoming unavailable through technological change. Admittedly, this is not a timesacle to impress archivists of its ultimate longevity, but it has been achieved at a time when the rate of technological change is perceived to be exponential (Negroponte).

True eprint archives are not repositories of unfinished works, of early versions of papers that are only updated in less accessible places elsewhere - the 'grey literature' as it is commonly and mistakenly referred to. In true eprint archives the grey is more likely to be decomposed into its two distinct components, black and white, representing both the early drafts supplemented and/or replaced by the final published versions. By holding to these distinctions eprint archives will not only grow in number but in acceptability and respectability within their target communities, the researchers and authors who will post to and use these archives.

Eprint services: separating content from services

Eprint archives are noticeably entering a new phase. Not only have significant new archives opened, but new services are being developed to complement the traditional content management functions of the archives. The key feature is that many of these new services will be independent of the underlying archive contents. There are a number of reasons for this. Growth in the number of archives and greater use create opportunities for third-party developers, and such developments thus do not impact on the low-cost base of most archives and can be developed either commercially or non-commercially.

More importantly, as the use of archives crosses new boundaries between disciplines, the role of new services will be to enable the user to view an integrated, complete set, or selected subset, of all archives, with user customisation features to set the scope of a service likely to become increasingly common. For the archives predicated on free and open access it is recognised that capitalising on this feature means not constraining users within the boundaries of a single discipline or archive but supporting cross-disciplinary navigation. The aim is that navigation support such as indexing, searching and linking should provide a consistent interface and seamless service regardless of which archives are accessed by the user, who need have no knowledge of the structure of an archive or, apart from perhaps noticing the archive source and publisher identities, from where a viewed document originated.

Integration of free e-print archives through independent services is currently the preferred view of the way in which such archives may pave the way for a unified global scholarly literature that will encompass not just the archives but journals and all other contributing literatures. (UPS)

The way in which this level of service will be achieved will be by supporting interoperability between the archives, for example the protocols through which services can communicate with the archives and metadata forms that will expose the data structures and the contents of archives. Methods will be open and published to encourage widespread adoption, but also to ensure conformity of safe and trusted practices that do not compromise the integrity of the archives.

A well established protocol for for data communication in digital library applications is Dienst, developed at Cornell University. An implementation of part of the Dienst protocol has been adopted by the Open Archives initiative for data harvesting. The OpCit project, in which the Dienst research group is one of the principal partners, is similarly building on this core technology for its linking applications.
 

Linking services and information environments

Linking is an apparently simple concept, especially the model implemented by the Web in which a unidirectional point-and-click event presents the user with the page from the location pointed at by a locator, the URL, that is authored into the linking page. This simplified form of hypertext linking, it has long been argued in the hypertext community of researchers, is inadequate to support robust services required for large-scale information environments, such as might be represented by the contents of scholarly communications and organised via digital libraries.

Such environments continue to be part of the Web but are distinguished by services that apply to contents that are deemed to be within them, determined not by physical location but by the nature of those contents. Examples of these environments might include the contents of libraries as in the UK eLib funded DNER, single-publisher collections such as the ACM Digital Library, larger collections of published journal papers accessed via services based on DOIs or SLinkS, or distributed archives accessed via NCSTRL (see Figure 1). In essence, in these environments the Web is transformed from a document delivery service into a dynamic, computational framework.

In terms of linking services, an early pioneer and predecessor of the OpCit project was the Open Journal project (Hitchcock et al. DLib 1998). The information environment envisaged in that project centred on contributed journal contents but was essentially unbounded, allowing links to take users to other types of materials, such as abstracting and indexing services, dictionaries and biological databases. In practice different implementations of Open Journals had to be bounded, principally to manage the user interface more effectively.

Originally supported by four publishers, by its close 12 publishers were involved. It is safe to assume that many more scholarly publishers now recognise the importance and power of links in electronic documents, especially links which implement the long-established, non-electronic form of linking inherent to the scholarly or scientific paper, the reference.

With this wider participation has come the recognition that linking may not be as simple as originally envisaged (Caplan and Arms). There are a number of reasons for this. Expectations are high, of links from almost every reference in every electronically-accessible paper, both backwards in time as provided by a typical reference list in a paper, and forwards in time in a manner made familiar by the Science Citation Indexes (Hitchcock and ISI). Constraining these expectations are accuracy, reliability and availability of the necessary contents in electronic form (Hitchcock and Quek). There may be multiple, but not identical, versions of the same document at multiple locations (Flecker and Caplan). Finally, there are the financial and authorisation barriers imposed by commercial journals and services. Access requirements can differ for each user, for each access location, for each document and for each document location. Competing publishers may be becoming allies to support cross-publisher reference linking (Wired news), but although there are various possible solutions to the problem of linking across distributed collections, there is as yet no convincing demonstration or detail of how this might be achieved in this exclusively commercial environment (Atkins).
 

Linking the archives: the OpCit project

In this context linking is more than just a technical process but must be viewed as part of the social and business phenomena that are shaping the new information environments. This is recognised in the scope and partnerships that form the OpCit project. Ironically, in the first environments the project will explore, selected open archives, such is the accessibility of so much content that the project could almost reduce the problem to its technical issues, but the longer-term commitment is to collaborating with others in the wider environments shown in Figure 1.

Figure 1. Information environments: defining OpCit and beyond

Three principal objectives elaborated by the project (see original proposal) concern scale-compatibility-universality:

Primary partners in the project are Paul Ginsparg from Los Alamos, Southampton University with its expertise in linking applications, also home of the Cogprints e-print archive for cognitive sciences and a mirror site for the Los Alamos archives, and Cornell University which is a major player in building NCSTRL based on Dienst. In related work the Cornell group will apply the linking technology used in OpCit to the ACM Digital Library.

When the Open Journal project began in 1995 the linking tools used then were all developed in-house. Now there are a range of tools created to suit different application requirements:

An early objective is to examine the structure of these tools at a software level and elaborate how they might be used to work together, in technical terms exposing the respective application programming interfaces (APIs). By publishing these details it is hoped to establish a generic foundation for emerging digital library linking applications, recognising that these research tools will constantly change and evolve and that new tools will be developed. This is a novel approach in that linking applications of all types have typically been tool based, and many such tools, although often highly functional, have tended to be used independently.
 

OpCit: early implementations and results

The process of adding citation links to documents retrieved from an archive involves parsing the document during download to identify and read citations. The data are compared with a precompiled link or citation database, and a link to the cited work added where an exact match is found. For more detail, one method for doing this was described by Hitchcock et al. 1997.

In a very preliminary implementation of this model in OpCit in which one of the objectives was to integrate services provided by some of the linking tools described above, a successfully linked citation directs the user to an intermediate page offering the user a choice: either download the cited full text from the archive or look up some contextual information on the citation. In this example the links in the original document are added by the DLS, and the intermediate page is produced from an SFX database which maintains some knowledge of the user privileges and can offer all versions of the cited paper that are accessible to the user (in this case only the archive version is available, but in principle if the user or a library subscribes to the journal in which the cited paper was published that version could be linked too; also, other versions of the paper, in abstracting services for example). The different stages of retrieval are shown in Figs x-xxx. Contextual information is retrieved from a database compiled by Citeseer (not shown here, but another application of this service can be tested at   ).

Figure 2.  Reference section of article hep-th/9907001 with added links indicated by coloured boxes

Figure 3. Activating a link returns an SFX page offering the user a choice of sources

Figure 4. Choosing "Download from Universal Preprint Archive" produces the master page from the xxx archive
 

Citation link analysis

Working initially just within the Los Alamos physics eprint archive, citations were analysed from a subset of papers from one section of the archive, hep-th (theoretical high-energy physics), submitted during 1999 to the end of October, a total of 2170 papers. These papers contained over 65 000 citations.

Citations in physics are notoriously terse. Typically they include author names, and then either an archive identifier number or a standard abbreviation for the journal title followed by some undifferentiated numbers, usually volume number, start page number and the year of publication (in brackets). Sometimes both the archive ID and journal data are included.

For our subset of documents the relative success in recognising and resolving to the corresponding document in the xxx bibliographic database can be guaged from Figure 5. The colours in this chart correspond to the link colours in Figure 2. Some links were simply derived from explicitly cited IDs for the archive documents, others were derived purely from the bibliographic data in the citation. Where XXX archive IDs are not included directly, they can alternatively be derived from bibliographic data in the XXX journal-ref metadata or from other more intensively-maintained, overlapping bibliographic databases in physics such as SPIRES. (SPIRES is maintained by the Stanford Linear Accelerator Center (SLAC), another partner in the OpCit project.)

Figure 5. Resolving and linking citations in a subset of hep-th papers: what could be linked, what could not and why (where known). For an example of a single reference list showing these results see Figure 2

Where resolution of reference data against the database, and therefore linking, was unsuccessful there could a number of reasons, and the relative occurrence of these problems is also shown in Figure 5. In some cases a citation was correctly recognised, but the reference is too old to feature in XXX, or a citation was recognised but not found in the database (the cited paper is not in the archive). The remaining citations could not be recognised, possibly due to poor formatting, incorrect data, etc. It can be seen that just over half of the references were successfully linked within the archive. The number of successful links could be increased significantly if the archives were supplemented with other, older sources, online archival journals say. About 16 per cent of citations, the unrecognised citations, may never be resolvable.

These percentages cannot be generalised to other applications, although it is interesting to compare broad measures of success in citation linking such as this example (52 per cent citations linked) with that reported by Electronic Press (Hitchcock and Quek), which suggested an upper limit of 60 per cent of linked citations in a large biomedical archive and database. Factors that control this figure include the size and accessibility of the archive and other document sources, and the accuracy, quality and completeness of the reference data.
 

OpCit design and evaluation

There is another way tackle potentially unresolvable citations for new and future submissions to the archive: at source, when the papers are deposited by authors. As well as developing linking services, important and interesting tasks for the project are citation analysis, interface design, and user testing and evaluation. Designing user interfaces for the linked archives concerns not just navigation but author deposit too. It is argued that a barrier to wider participation in eprint archives beyond physics is the need for more user-friendly procedures for authors. Given that the content of archives is unedited and unmoderated the responsibility for the quality and accuracy of a given paper lies wholly with its authors, so the challenge is to make the deposit process both easier and more intuitive but with a greater degree of responsiveness and immediate feedback to encourage high standards of correctness and completeness.

Citation linking obviously depends strongly on the accuracy of data provided by authors and so improving correctness in this area is of particular interest to the project. One idea is is to provide dynamic checking for references in newly submitted papers using the same process as used to produce citation links and inviting the author to respond. In this case correct, linkable citations would appear as standard links, but other citations could be highlighted by different link shades or colours, say, indicating possible problems with a citation, suggesting a reason for the problem and highlighting which ones the author could usefully amend. Such a example scheme might look like that shown in Figures 2 and 5 (although the model implementation was not designed for this purpose).

While each of the components developed by the project will be subjected to user evaluation by standard means, in citation analysis there is an inherent means of evaluating common practice and usage of the archive by its most important constituency: its authors. For example, explicitly citing either archive IDs or journal sources or both in citations

Since the project is experimenting with stored data from the archive rather than real live data there are other effects that can be used to monitor user behaviour. The stored data is a snapshot of the archive contents at that time and using simple difference computation can be compared with later datasets to reveal the extent of changes, minor or major. Of particular interest is the proportion of pre-publication papers in the archive that are replaced by the final published version. It is only the authors of a paper that can change or update it in the archive, so there is no standard procedure for replacing pre- with post-publication copies.

Conclusion

It is common for the Internet to be viewed as a paradigm shift (ref. First Mon?). With major new free-to-use e-print archives joining the well established Los Alamos archives, and with progress towards integration and globalisation of these archives motivated by the Open Archives initiative, the paradigm shift beckons for scholarly communication. For scholarly communication these developments bring closer the critical and long heralded (Dyson) transition promised by the Internet in which primary value accrues to services rather than just to content. For any such transition the main issue is access. The archives provide access to content and projects such as OpCit and others are demonstrating how that can be exploited and new value added through new services.

For commercial scholarly publishing, which by its nature imposes financial barriers to access, the picture is less clear. It is ironic that as some major publishers agree to explore possible mechanisms for managing links between their journals, they are failing to respond to the issue of improving access. For example, users can access primary journal papers via abstracting, indexing and aggregation services now transformed into electronic forms and sporting new brand names but mostly supported by familiar corporate interests. A typical result was described by Morrow (1999):

"Sites who have signed up for the ScienceDirect trial through BIDS now have access through four different routes ...
* Selecting the Elsevier option for the BIDS route to ScienceDirect does NOT preclude access to ScienceDirect via Web of Science through MIMAS. Even after July 2000, Web of Science users (with the appropriate linking licence) at sites who have selected the Elsevier licensing option will continue to be able to access ScienceDirect material (in exactly the same way as those who sign a NESLI/Swets agreement)."
Individually these are perfectly good services that are legitimately exploring new solutions at a critical moment of change. The question it raises, however, is does this help users? This is not a paradigm shift, but a muddle that is the result of self-preservation driven by service providers rather than users. It arises because it fails to recognise the shift in value from content to services. Some publishers that maintain close links with the research community, typically professional societies, have been required to pay close attention to the development of the archives. They understand the distinction and have acted (e.g. BMJ, EP), or are prepared to act (Doyle, SeptForum), to free their electronic archives.

This paper has shown the real paradigm shift in scholarly communication, towards more open and accessible information. This benefits linking services, but many other digital library services can benefit too. We are entering a new phase of convergence in digital library services driven not by conspiracy or anti-commercial prejudice, but by embracing pragmatic, widely shared interests throughout the scholarly community.
 

References

Morrow, Terry (1999) BIDS services now linked to Elsevier's ScienceDirect. Posted on lis-scitech@mailbase.ac.uk, 27 Sep

Lawrence et al., 1999, Digital Libraries and Autonomous Citation Indexing, IEEE Computer
Hitchcock et al., 1997, Citation Linking: Improving Access to Online Journals, Second ACM International Conference on Digital Libraries
Van de Sompel and Hochstenbach, 1999, Reference Linking in a Hybrid Library Environment, Part 2: SFX, a Generic Linking Solution, D-Lib Magazine
Hellman, 1999, Scholarly Link Specification Framework