Open Citation Linking: the way forward

A paper being prepared for publication in D-Lib Magazine, October 2002

Steve Hitchcock, Donna Bergmark*, Tim Brody, Chris Gutteridge, Les Carr, Wendy Hall, Carl Lagoze* and Stevan Harnad

IAM Group, Department of Electronics and Computer Science, University of Southampton,  SO17 1BJ, United Kingdom
* Digital Library Research Group, Department of Computer Science, Cornell University, Ithaca, NY 14853-7501, USA
Contact for correspondence: Steve Hitchcock sh94r@ecs.soton.ac.uk

This paper is produced by the Open Citation project, funded by the Joint NSF - JISC International Digital Libraries Research Programme. It is based on a presentation, available as Powerpoint slides, given to the JISC/NSF Digital Libraries Initiative (DLI) All Projects Meeting in Edinburgh during June 2002.

Abstract

The speed of scientific communication – the rate of ideas affecting other researchers' ideas – is increasing dramatically. The factors driving this are free, unrestricted access to research papers. Measurements of user activity in mature eprint archives of research papers such as arXiv have shown, for the first time, the degree to which such services support an evolving network of texts commenting on, citing, classifying, abstracting, listing and revising other texts. The Open Citation project has built tools to measure this activity, to build new archives, and has been closely involved with the development of the infrastructure to support open access on which these new services depend. This is the story of the project, intertwined with the concurrent emergence of the Open Archives Initiative (OAI). The paper describes the broad scope of the project's work, showing how it has progressed from early demonstrators of reference linking to produce Citebase, a Web-based citation and impact-ranked search service, and supported the development of EPrints.org software to build OAI-compliant archives. The work has been underpinned by analysis and experiments on the semantics of documents (digital objects) to determine the features required for formally perfect linking - instantiated as an application programming interface (API) for reference linking - that will enable other applications to build on this work in broader digital library information environments.

Introduction: Exploiting Open Access

Imagine, as a researcher, the prospect of free, instant access, at any time, anywhere, to all peer reviewed papers and data that might affect your work. How much better would that be than the present situation? It's achievable through the process of author self-archiving in Open Archives. The power of this idea is permeating the  scholarly publishing establishment. More libraries are beginning to host Open Archives to present research papers produced by their institutions (Crow 2002). Progressive publishers are providing free online versions of journals, sometimes before, sometimes after, fomal publication; new business models for open access journals are at last emerging. Even those that remain unconvinced by open access recognise the move to electronic publication must be accompanied by improved access. Publishers are collaborating as never before, among themselves and with digital libraries, to support new electronic services such as reference linking and mediated access based on powerful databases and new systems of identifiers and rights management. It's a serious business. In fact, only one group in the scholarly communication chain isn't yet wholly embracing open access: authors (Pinfield 2002).

Curious, because authors stand to gain most in the switch to open access. Some fear damaging prestigious peer reviewed journals, but as is already apparent, journals are getting better because open access and self-archiving do not exclude other forms of publication and, focused by competition, journals will enhance their core values.

Authors are well aware of the potential benefits of open access, but how can they be persuaded to act in pursuit of these benefits? The key requirements that scholarly authors demand of publication are visibility and impact. The key to impact is the ability to measure citations.

The Open Citation Project grew out of an early demonstration of tools to add links, post-authoring, to references contained in scholarly papers in Web-deliverable formats. The basic idea was to extend the application to very large numbers of papers freely available on the Web. Linking on that scale would require automatic recognition and collection of references contained in these papers. If the references are stored in a database, it is possible to do more than link references: for a given paper, the number of times it has been referenced can be determined, and from this emerges the ability to measure impact.

There is nothing new in this, except that impact has always been associated with journals, and has typically been measured by expensive secondary services. Could it be possible that papers freely available on the Web might also have a measurable impact? And might this measurement be provided by a service that, like the papers it acts on, is free and could give authors (and research assessment agencies) an instant indication of the impact of their papers (Harnad 2001)?

This is the story of the Open Citation Project, intertwined with the concurrent emergence of the Open Archives Initiative (OAI), which has become a focal point for open access to metadata describing all sorts of digital objects held by libraries hosting Open Archives. Open access, Open Archives, reference linking and citation analysis are all connected, we contend, in creating a managed digital library framework in which peer reviewed scholarly papers can be made freely accessible to all in the most efficient manner possible.

The story begins with the transition from backwards-in-time reference linking to forward-in-time citation analysis on the Web, and the consequent potential to transform open access. There may have been wild projections for open access. The scenarios described above, involving publishers and libraries, are real and are an integral part of this story.

From Reference Linking to Citation Analysis on the Web

Reference linking has become the de facto added value for electronic journals (Hunter 1998). In recent years there have been important reference linking initiatives. Journal publishers have converged on Digital Object Identifiers (DOIs) and CrossRef (Pentz 2001), described by Hellman (2001) as a 'miracle'. The library community, which wants to solve the perennial 'appropriate copy' problem - getting the right resource to the right user at the right time - for the digital world (Caplan and Flecker 1999), appears to have selected the ingenious OpenURL (Van de Sompel and Beit-Arie 2001), a proposal for 'context-sensitive' linking (i.e. a service that knows which resources are available to a user) currently being fast-tracked towards standardisation by NISO.

Web linking is not easy, raising social and cultural problems, for example, the farcical misunderstanding of, and resistance to, deep linking by some Web commercial content providers. Reference linking similarly raises commercial as well as technical issues (Hitchcock et al. 1998b). Hellman was referring to the 'unprecedented' cooperation between all the major science publishers through CrossRef, rather than to any implementation, but tensions remain (Quint 2002). Demonstration systems embracing these various linking components have raised hopes that heterogeneous and diverse information environments can be viewed by users as though they are a single delivery system (Beit-Arie et al. 2001), although some remain sceptical (Pace 2002).

From the user perspective, reference links are remarkably useful, but in essence all the link does is save the user time. A formal reference given in a paper is an address to the cited work. Even without the link the referenced work ought to be retrievable. A link might save the user minutes or even weeks in retrieving the work - currently we can only speculate on the cognitive impact on scholarly research of instant and universal online retrievability, which Harnad calls 'scholarly skywriting', and which he predicts will 'increase individual scholars' productivity by an order of magnitude' (Harnad 1996).

The real value in collected reference data is not in producing links that point to works in the past, the authored links, but in creating links that transport the user forward in time. For a given paper, what later works have cited it? Unlike the reference list, this cannot be an authored part of the original paper and cannot be determined by the reader independently. Citation analysis requires an additional service. It is possible to build a simple citation database by storing bibliographic records that contain the reference lists from papers. Hundreds of thousands of users of citation manager programs such as EndNote and ProCite recognise the utility of citation analysis for building personalised bibliographies (Simbol and Zhang 2002).

Citation analysis is not new. The technique was first identified by Garfield and has since been exploited in information products from ISI, the company that Garfield formed. Garfield's brilliant insight was to recognise that references in journal papers can be used to form an intellectual index across the whole of a chosen literature. Such an index would be impossibly complex and costly to compile without author references: ‘by using authors’ references in compiling the citation index, we are in reality utilizing an army of indexers’ (Garfield 1955).

More than that, the index can be used to measure the 'impact' of cited works. The more often a paper is cited, the more highly regarded the work is likely to be within the peer community. This factor has become a widely used, if contentious, measure of the importance of papers, authors and journals. This knowledge can in turn can be used by scholars new to a field to find starting points to explore the literature.

ISI has found a lucrative market for its products, indicating the high value that the research community places on tools that measure citation impact. Other abstract and indexing database services, such as the the American Chemical Society's Chemical Abstracts Service and American Mathematical Society's MathSciNet, have belatedly noticed the potential of including citing reference lists, which have also crept into papers in the electronic versions of high-profile journals such as Science and Nature, drawing on secondary sources such as ISI (Simbol and Zhang 2002).

The advent of the Web has seen dramatic growth in the availability of journal papers online, many free through services such as arXiv (http://arxiv.org/), and has opened new possibilities for citation analysis. Network access to works has made it possible to develop software to automate data collection from very large resources at relatively low cost, making it feasible for Web-based citation services to be offered free to users. NEC's ResearchIndex (Lawrence et al. 1999) and Citebase, a citation and impact-ranked search service produced by the Open Citation Project, are two examples. In contrast to ISI's established subscription services covering a self-selecting corpus of 6500 of the highest impact journals, these automated services are in their infancy, covering diverse collections, having to work with inconsistent data formats and trying to identify user preferences to optimise their features. Progress is being made. ResearchIndex currently indexes over half a million computer science papers. Citebase is linked from over 200k arXiv records (currently on a trial basis), introducing the service to tens of thousands of prospective users.

Links to Citebase sit below links to the Stanford Linear Accelerator Center (SLAC) SPIRES citation database in a typical arXiv abstract page (see foot of Figure 1). The SLAC-SPIRES service involves more manual labour in data collection and checking than the software approach of Citebase, and has been compiled over a longer period, since 1974 (O'Connell 2000). SLAC-SPIRES covers only high-energy physics, a large subset of arXiv, whereas Citebase indexes all papers in arXiv. The two are thus not directly comparable, but both emphasise the contentious nature of citation data with prominent warnings about coverage and interpretation.

Figure 1. Example arXiv abstract, showing links to SLAC-SPIRES and Citebase citation services

The Open Journal (OJ) Project produced some of the first demonstrators of Web-based reference linking and citation analysis, but depended on data supplied from journal publishers and ISI (Hitchcock, et al. 1998a). Soon after this collaboration ISI introduced Web of Science, making its citation indexes available on the Web (Atkins 1999). Starting in 1999, as the successor to the OJ project, the three-year Open Citation Project aimed to apply the tools and techniques from the earlier OJ work to open and freely accessible Web data, in particular to now mature eprint archives such as arXiv. The project combined the experience of  reference linking specialists in Southampton University's IAM group with the expertise of digital library data management of the Digital Library Research Group at Cornell University. The third partner was arXiv, then based at Los Alamos and now hosted at Cornell.

As the Open Citation Project completes its funding period, this paper describes the broad scope of its work, showing how it has progressed from early demonstrators of reference linking to produce Citebase. This work was underpinned by analysis and experiments on the semantics of documents (digital objects) to determine the features required for formally perfect linking: an application programming interface (API) for reference linking. Along the route the project helped launch the OAI, with project principals leading the development of metadata and protocol schemes on which OAI is founded (Lagoze and Van de Sompel 2001), and supported the development of EPrints.org software to build OAI-compliant archives.

Reference linking: Opcit in the DL environment

One original objective of the Open Citation project, described by Hitchcock et al. (2000), was to 'hyperlink', or produce reference links, for all the papers in the arXiv physics archives. The extension of that work to a build a citation database could be seen to be one of the primary contributors to the objective of promoting this new way of navigating the scientific journal literature based on free access and free services.

At that time OAI was in its infancy. In terms of numbers of papers, access to eprints was, and still is, dominated by the centralised disciplinary-based arXiv, but there was also a distributed collection on which OAI based its model and technical infrastructure. NCSTRL (Networked Computer Science Technical Reference Library) provided an index, now being revived within an OAI framework (Anan et al. 2002), for browsing and searching papers from partcipating computer science departments. Thus the project could foresee a distributed information environment in which digital libraries are distinguished by services that apply to various types of content. Mediating services would provide managed and enhanced access to free content (OpCit) or paid-for content (the established journal secondary services supplemented by CrossRef and DOIs) or in some cases both (resolver services such as SFX (Van de Sompel and Hochstenbach 1999); OpenURL was motivated by the need to standardise the way metadata describing cited resources is packaged within a URL so this information can be passed to resolvers such as SFX).

There are two ways of presenting digital services to users. One is to modify the original content. An example is the project's early experiments with reference linking, illustrated by Hitchcock et al. (2000). References were linked, indicated by boxes surrounding the linked text, from PDF versions of original papers. Overlaying services on content in this way is effective if it is offered at the place and moment the user needs it most. Otherwise this approach can appear intrusive and faces cultural resistance. Further, it can be difficult, not to say inappropriate, to add new information to the originally authored text. A more universally accepted way is to create information interfaces.

Citebase: a new interface to the scholarly literature

As the volume of networked metadata and content grows, interfaces become a powerful and flexible means of enabling users to explore this content. Interfaces in the digital environment are analagous to packaging in the physical world, embracing selection as well as access. What makes digital services so powerful is the degree of automation that can be implemented behind the user interface (Arms 2000). At its most effective this processing must be transparent yet responsive to user demands, providing scope for user input and, for more advanced services, control, The resulting output must be organised optimally for user response.

Search is the most familiar service on the Web, yet because most search engines compete to offer the most comprehensive coverage of the Web the concept of selection is not immediately obvious. Instead, bare search services that have not evolved into portals are characterised by a simple user interface, a text box, and compete on the ability to provide fast processing and the most relevant results. In other words, the most successful search engines provide the desired result with minimal input and effort from the user by delegating almost all choices and almost the entire task to a highly sophisticated underlying algorithm and processor.

In one case the underlying algorithm provides citation analysis with perhaps the ultimate accolade: a mass audience service, although it is unlikely many users are aware of the connection with citation analysis. The search service in question is Google, inevitably. Google has become enormously popular for the quality of its results - the ability to rank Web pages that satisfy the user's query at the top of the results (Brin and Page 1998). As well as indexing content, Google analyses links to Web pages. The technique works because links, like citations, are not offered lightly and represent intellectual connections between works. The number of links pointing to a page can be used to determine its relative importance among pages on similar topics and is the basis of Google's ranked results.

The growth of OAI archives has motivated the emergence of search services, such as Arc (http://arc.cs.odu.edu/) (Liu et al. 2001) and OAIster (http://oaister.umdl.umich.edu/cgi/b/bib/bib-idx?c=oaister;page=simple), which cover all registered OAI-compliant data providers (DPs) rather than the Web (most OAI data providers are hidden to Web search engines, although software such as DP9 (http://www.cs.odu.edu/~dlibuser/dp9/) can be used to build a gateway service for crawlers that require persistent URLs and HTML rather than XML for all OAI records). These services harvest and store OAI metadata records from OAI archives, so user search is based on these data rather than the full texts of the archived objects.

Citebase - “Google for the refereed literature”, because it ranks results based on references to designated papers - exercises more selective coverage, harvesting from the larger OAI disciplinary archives - currently arXiv, CogPrints (http://cogprints.soton.ac.uk/) and BioMed Central (http://www.biomedcentral.com/) - that (with permission) allow texts as well as metadata to be downloaded via an automated machine interface. Unlike the earlier OpCit reference linking demonstrator, Citebase does not store full documents but extracts the references, which are associated with the OAI metadata record for the document in which they are identified. This association between document records and references is the basis for a classical citation database, matching a cited document with the record for that document (reference linking), and matching a record with instances of its citation (forward citation analysis), i.e.:

In this case the citation database explicitly contains records for documents A and B. A record can be treated as a surrogate for the full text because it contains a direction (typically a URL) to the text. Although the existence of document C is known through its citation by A and B, it may not be possible to link to C if there is no harvested record for it. Whether C is known simply by citation or as a harvested record, it will always be possible to link from a citation of C to A and B, illustrating another benefit of linking forward in time to citing documents.

The Citebase Web interface (Figure 2) shows how the user can classify the search query terms (typical of an advanced search interface) based on metadata in the harvested record (title, author, publication, date). In separate interfaces, users can search by archive identifier or by citation. What differentiates Citebase is that it also allows users to select the criterion for ranking results by Citabase processed data (citation impact, author impact) or based on terms in the records identified by the search, e.g. date (see drop-down list in Figure 2). It is also possible to rank results by the number of 'hits', a measure of the number of downloads and therefore a rough measure of the popularity of a paper. This an experimental feature to determine if hit data might be correlated with or independent of citation impact, and is based on limited data on download frequencies from the UK arXiv mirror at Southampton. Its retention in the full Citebase service is subject to further analysis and discussion.

Figure 2. Citebase search interface, showing results for the most-cited paper on string theory in arXiv (on 25/11/02)

The results shown in Figure 2 are ranked by citation impact: Maldacena's paper, the most-cited paper on string theory in arXiv at the time, has been cited by 1576 other papers in arXiv.

Citebase is based on classical citation principles adopted by other successful services and widely used in the community, but does this implementation work for users? There are a number of variables that need to be tested, and Citebase has been evaluated by arXiv users and by others who use or maintain bibliographic services to access the refereed journal literature. Results of that evaluation are being processed and will be reported first on the project Web site (http://opcit.eprints.org/).

The aims of the evaluation, which was based on two Web forms (URL), were to:

Development of Citebase will continue beyond the OpCit project. Of widest significance is the emergence of Citebase as a data provider as well as an OAI service provider. Citebase records will be available to automated harvesters just as though they were OAI records, although they are more complex, containing reference data (Figure 3). Researchers at Old Dominion University have harvested Citebase data as part of their Archon (http://archon.cs.odu.edu/) federated digital library on physics (Liu et al. 2002), and arXiv is a possible (re)harvester of Citebase data too.

Experiments are being performed with various metadata formats and XML schema for exporting reference data. One format designed for this purpose is the Academic Metadata Format (Krichel and Warner 2001). This is a 'local profile', i.e. nonstandard, format. Other possibilities are encoding citations in the OpenURL format, or using the structured-value set containing the sub-elements for citation proposed by the Dublin Core Citation Working Group (http://www.dublincore.org/groups/citation/) which can be mapped to OpenURL attributes (Powell and Apps 2001). The difficulties of producing an agreed schema and format for citation metadata was highlighted on the OAI-implementers discussion list (http://www.openarchives.org/pipermail/oai-implementers/2002-June/000518.html, thread XSD file for qualified DC).

Figure 3. Example Citebase record encoded in DC-Citation-like format for potential re-harvesting by other service and data providers

Other planned enhancements include making Citebase reference links OpenURL-enabled, so pointing the links at library and journal services. This feature is being investigated by directing OpenURL links at a target resolver service (typically users should be able to select their preferred resolver, likely to be based in their institutional library). In this case the target resolver should ideally include Citebase data, so results presented to the user following a Citebase link might include a link back to Citebase as well as to other sources that might contain a referenced item. Citebase is a new, non-commercial service and so is unlikely to be included in resolvers supplied as part of library information systems (Hellman 2001).

Citebase has a DP9 interface, principally to enable it to be indexed by Google and other Web search engines. It has been discovered that this needs to be optimised to enable Google to index the whole of Citebase: it is believed Google takes longer to index dynamically generated cgi-based services than static pages. This limited coverage of Citebase in Google has become less important now that arXiv is indexed by the search service (arXiv has a long-standing policy blocking access to Web crawling software used by search engines), and now that Citebase is linked from records for arXiv papers. Ironically, the static arXiv links should ensure that Google indexes all of Citebase. Other OAI data and service providers may still need DP9 to assist indexing by Web search engines.

API for Reference Linking

Recap
Surrogates in the API
API evaluation
RefLinking demo
XGQuery reference-citation graph viewer - ResearchIndex demo NEW
Here are some good references to demo:
BCST95 at "Major theoretical interactive..."
Ell89  at "Information Seeking Studies..."
Ell97  at "IR researchers are also beginning..."
Some Figs here or, given it's an e-paper, a link to a tested demo?

Filling the Archives: EPrints.org Software

Reference linking and citation analysis only truly become effective when there is a critical mass of related, linkable content, whether that content is in Open Archives or journals. For Open Archives, even when aggregated, other than from those larger subject-focused archives covered by Citebase there is as yet insufficient content for linking. It is possible the example of Citebase and arXiv will motivate authors in other areas to self-archive their papers, but the OpCit project hasn't just promoted the benefit of contributing to Open Archives by proxy example. It has supported the development of software to build and manage Open Archives, known as EPrints.org software.

EPrints software is undoubtedly the better known product of the OpCit project. It could be argued that Citebase or similar services will ultimately have more impact with users, but EPrints is necessary now and plays a critical role in enabling open archives to be filled.

EPrints originated from software used to manage arXiv. That software was first adapted to run the CogPrints cognitive science eprint archive. With the emergence of OAI and the consequent emphasis on institutional archives, it was evident there would be a need for large numbers of smaller archives than arXiv, but which would need to operate on similar principles - low cost, largely automated deposit, indexing and dissemination of author-archived content. EPrints was developed within the remit of the Open Citation project to generalise the author and management interfaces for Open Archives.

Of most significance, EPrints builds archives that are compliant with the OAI Protocol for Metadata Harvesting (PMH). This means that any content deposited within an EPrints-based archive will become visible to users of OAI services, such as the search services mentioned above, immediately enhancing the chances of discovery. Authors depositing papers in an EPrints archive are not required to have any knowledge of OAI metadata: it is generated automatically.

EPrints is aimed at institutions and special-interest communities. In its current incarnation, the name GNU EPrints (http://software.eprints.org/) reflects its new status as open source software, available free under the GNU General Public License. The last major release of EPrints, version 2.0, appeared in February 2002, although it has been updated (now on version 2.1.1) to conform with the latest OAI-PMH (also version 2) announced in June. Features of EPrints version 2 include:

The practicalities of building an EPrints-based archive are described by Nixon (2002). Meanwhile, Eprints has new features that extend its focus on institutional research papers. It is now configurable for adoption as a journal-archive for new open access journals or established journals converting to open access, e.g. psycprints (http://psycprints.ecs.soton.ac.uk/).

OpCit and OAI

There has been a surge of activity based on OAI, reflected in research programs and projects, tools, data and service providers (Van de Sompel and Lagoze 2002). The faith of early adopters has proved well founded, but many repository administrators had their fingers crossed:
“As we have introduced our repository to our faculty and staff, we have emphasized the point that because they would be depositing their material in an OAI-compliant archive, it would automatically and painlessly be discoverable from various other points around the globe. Luckily, we were right.”
(Roy Tennant, eScholarship, California Digital Library, on American Scientist September-98 Forum, June 2002 http://www.ecs.soton.ac.uk/~harnad/Hypermail/Amsci/2085.html)

A primary motivator for adoption of OAI has been its promotion by funding agencies such as JISC in the UK (see Beyond the Project below), the NSF, Digital Library Federation (http://www.diglib.org/architectures/testbed.htm) and the Mellon Foundation (Waters 2001) in the USA, as well funding from new programmes such as the Budapest Open Access Initiative (http://www.soros.org/openaccess/read.shtml) sponsored by George Soros' Open Society Institute. The results of these recent initiatives, and the momentum they have provided for eprints, have been documented by Suber (2002).

The Open Citation project has contributed to OAI not just as a data and service provider, but in other, less noticeable ways concerned with enhancing the efficiency of OAI through aggregation, registration and validation services, and building infrastructure.

OAI is founded on the idea of interoperability, that if objects in an Open Archive are described by a defined protocol and metadata format then the presence or availability of a work can be advertised to other, independent services. At its simplest, basing the OAI-PMH on unqualified Dublin Core metadata say, interoperability ought to be straightforward in principle. In practice, unqualifed DC is not mandated, and there are various reasons why the quality of OAI data for harvesting can be compromised. Liu et al. (2001) discovered that not all archives strictly follow the OAI protocol, many have XML syntax and encoding problems, and some data providers are periodically unavailable.

One solution is for data providers to be validated for protocol compliance, but not all data providers register. The registration and validation service provided by OAI, and managed by Donna Bergmark at Cornell, has other benefits. Registered archives become accessible by service providers, and validation helps improve repository maintenance. To simplify registration, EPrints feeds repository URLs straight into the OAI registration process (if so desired by the EPrints administrator). A scan of the list of registered sites (http://www.openarchives.org/Register/BrowseSites.pl) shows many have used EPrints to build repositories.

To improve interoperability, scalability and reliability of OAI services, OpCit has worked with the Old Dominion University team on infrastructure components such as proxies and caches (Liu et al. 2002). Proxies, transparent layers acting between data providers and harvesters, can be used to fix simpler encoding errors as part of the delivery process. More serious errors in the data require an intermediate storage approach: caching and aggregation. In this case a few large service providers might harvest and cache metadata from registered OAI repositories, reducing the load on those archives and serving many smaller harvesters. An OAI aggregator (OAIA) must in principle be an active cache as it requests new records from known repositories in advance so it is always up-to-date. An example OAIA known as ‘Celestial’ (http://celestial.eprints.org), which mirrors OAI repositories,.has been built by Tim Brody from the OpCit team.

Access and Impact: OpCit Data Mining

OAI is winning support from repository administrators because it has a simple and, mostly, effective infrastructure. This feature alone will be insufficient to attract authors to deposit works in Open Archives. Many authors perceive, incorrectly, that Open Archives are competing with other sources, such as journals, for submissions. The role of Open Archives is to complement journals while extablishing distinctive benefits for authors.

The most compelling attraction a source can offer authors is impact, the ability to confer recognition and prestige on submitted works. Open Archives, because they are free to authors and users, maximise access to works and will therefore maximise impact. While the latter part is still speculative, it can begin to be substantiated. Through its work with arXiv, the project has access to over 10 years of submitted papers and can identify how citation patterns have changed over that time. Correlations have been made with, admittedly limited, data on usage of arXiv taken from the arXiv mirror at Southampton since August 1999. The raw results of this work can be found in Mining the Social Life of an Eprint Archive (http://opcit.eprints.org/tdb198/opcit/ and http://opcit.eprints.org/ijh198/). Interpretation is difficult, but we can present at least two results which support claims that open access improves impact (Figure 4).
 

a

b

Figure 4. Maximising access: maximising impact. Data on downloads and citations for papers in arXiv

Figure 4a shows how, over a period of eight years to 1999, the peak of citations occurs higher and sooner for papers deposited in each succeeeding year. The citation peaks for 1999 and 1998 can be seen after approximately 3-4 months. This is remarkable because it implies that the speed of scientific communication – the rate of ideas affecting other researchers ideas – is increasing dramatically.

As with any large collection of papers, there is a wide variation in the likelihood of any individual paper being cited. Analysis of citations identified papers in arXiv that might be categorised as high, medium and low impact papers. From 132218 papers in arXiv at the time of the analysis, 595698 internal citations were extracted, an average of 4.51 citations per paper. The papers were split so that approximately 1/3 of the citations were to each category of impact. Papers with no citations to them are referred to as 'unknown'. The number of papers in each category is shown in Table 1 and graphically in Figure 5.

Returning to Figure 4b, which shows accesses to papers in each category, it becomes clear that high impact papers are accessed more often and over a more sustained period than other types of paper. What is not clear from this analysis alone is whether higher accesses are due to higher citations, or higher citations due to higher accesses, but either way the result is dependent on unrestricted, free access. The relationship between access and impact is worthy of further study, but what can be said is that a clear hierarchy of papers emerges, based entirely on previously unrecognised usage patterns within arXiv. Brody et al. (2002) explore further results from this work, showing how arXiv supports an evolving network of texts commenting on, citing, classifying, abstracting, listing and revising other texts. Archives are becoming a network of texts rather than simply a classified collection of texts.
 
Impact No. of Papers No. of Citations per paper
High 2698 40+
Medium 10122 13 - 39
Low 61518 1 - 12
Unknown 57881 0
Table 1. Spectrum of high, medium and low impact papers in arXiv Figure 5.  Graphical representation of data from Table 1

It has to be recognised that impact depends on more than access; another factor is association, with an established journal title, say. Journal reputations are founded on peer review. Figure 4 shows that informed authors can have all three benefits - peer review, access and impact - simply by depositing a paper in an Open Archive at the same time as submitting to a peer reviewed journal. Revised versions can similarly be submitted to both sources simultaneously. For a given paper, publication in a peer reviewed journal is recognised in the updated OAI record.

Intuitively, authors, and journal publishers too, know that unrestricted access enhances impact. The biomedical field, which has the largest number of high-impact journals (Garfield 1996), has least reason to alter its publishing practices, yet initiatives such as NIH's PubMed Central (http://www.pubmedcentral.nih.gov/) and the Public Library of Science (http://www.publiclibraryofscience.org/) are evidence that authors now demand more. Publishers may not have warmly embraced NIH's demand for deposit of published papers in its freely-accessible archive, yet those journals that contribute to PubMed Central do so without compulsion and are clearly sensitive to their authors' demands as reflected by PLoS. It is no coincidence that a biomedical journal publisher, BioMed Central, has produced the most convincing publishing model so far for open access journals (Velterop 2002).

Recognition is dawning of the complementary roles of Open Archives and journals in scholarly communication and publication. Electronic journals will inherit one critical service from their print ancestors: peer review. Meanwhile, Open Archives facilitate access; open services such as Citebase will measure impact.

Beyond the OpCit Project

The ideas and efforts that have characterised OpCit will be taken forward not just in the obvious products of the project, such as Citebase and GNU EPrints, but in new environments as well. Specifically, the JISC FAIR programme (http://www.jisc.ac.uk/dner/development/programmes/fair.html), which is just beginning, includes major projects that will seek to extend the culture of EPrints-based archives in UK universities through the provision and targetting of new archives and supplementary services:

Conclusion: What we have Learned

The Open Citation project has produced tools to help OAI data providers and service providers. The project has been fortunate in being able to contribute to the broadly-based activities, focused on OAI, that have emerged since 1999 to support improved scholarly communication through open access to research papers. We are clear this is the beginning of a transformation towards more open access, not its end. The longer-term future is thus exciting, yet uncertain. The legacy of a project should be borne of experience rather than speculation, so we offer some concluding thoughts which, although stated before, collectively create a clear picture of the way forward:

References

Anan, H. et al. (2002) "Preservation and Transition of NCSTRL Using an OAI-Based Architecture". Proceedings of the Second ACM/IEEE Joint Conference on Digital Libraries, Portland Oregon, July
http://128.82.7.99/ncstrl/p183-anan.doc
http://mln.larc.nasa.gov/~mln/pubs/ncstrl-oai.pdf

Arms, W. Y. (2000) "Automated Digital Libraries: How Effectively Can Computers Be Used for the Skilled Tasks of Professional Librarianship?" D-Lib Magazine, Vol. 6, No. 7/8, July/August
http://www.dlib.org/dlib/july00/arms/07arms.html

Atkins, H. (1999) "The ISI Web of Science - Links and Electronic Journals". D-Lib Magazine, Vol. 5 No. 9, September
http://www.dlib.org/dlib/september99/atkins/09atkins.html

Beit-Arie, O. et al. (2001) "Linking to the Appropriate Copy: Report of a DOI-Based Prototype". D-Lib Magazine, Vol. 7, No. 9, September
url http://www.dlib.org/dlib/september01/caplan/09caplan.html

Brin, S. and Page, L. (1998) "The Anatomy of a Large-Scale Hypertextual Web Search Engine". Seventh International World Wide Web Conference, Brisbane
http://www7.scu.edu.au/programme/fullpapers/1921/com1921.htm

Brody, T., Carr, L. and Harnad, S. (2002) "Evidence of Hypertext in the Scholarly Archive". Proceedings of HT'02, the 13th ACM Conference on Hypertext, University of Maryland, June
http://opcit.eprints.org/ht02-short/archiveht-ht02.pdf

Caplan, P. and Flecker, D. (1999) "Choosing the Appropriate Copy". NISO News, September
http://www.niso.org/DLFarch.html

Crow, R. (2002) "The Case for Institutional Repositories: A SPARC Position Paper". Scholarly Publishing & Academic Resources Coalition, Washington, D.C., July
http://www.arl.org/sparc/IR/ir.html

Garfield, E. (1955) "Citation Indexes for Science: A New Dimension in Documentation through Association of Ideas". Science, Vol. 122, No. 3159, July 15, 108-111
http://www.garfield.library.upenn.edu/papers/science_v122(3159)p108y1955.html

Garfield, E. (1996) "The Significant Scientific Literature Appears in a Small Core of Journals". The Scientist, Vol. 10, No. 17, September 2nd, 13, 16
http://www.the-scientist.com/yr1996/sept/research_960902.html

Harnad, S. (1996) "Implementing Peer Review on the Net: Scientific Quality Control in Scholarly Electronic Journals". In Scholarly Publication: The Electronic Frontier, edited by Peek, R. and Newby, G (Cambridge, MA: MIT Press), pp. 103-108
http://cogsci.soton.ac.uk/~harnad/Papers/Harnad/harnad96.peer.review.html

Harnad, S. (2001) "Why I think research access, impact and assessment are linked". Times Higher Education Supplement, Vol. 1487, 18 May, p. 16
http://www.cogsci.soton.ac.uk/~harnad/Tp/thes1.html (extended version)

Hellman, E. (2001) "Building a database for e-journals". Web4Lib Electronic Discussion, 17th October
http://sunsite.berkeley.edu/Web4Lib/archive/0110/0175.html

Hitchcock, S. et al. (1998a) "Webs of Research: Putting the User in Control". Internet Research and Information for Social Scientists (IRISS) Conference, Bristol, March
http://sosig.ac.uk/iriss/papers/paper42.htm

Hitchcock, S. et al. (1998b) "Linking Electronic Journals: Lessons from the Open Journal Project". D-Lib Magazine, December
http://www.dlib.org/dlib/december98/12hitchcock.html

Hitchcock, S. et al. (2000) "Developing Services for Open Eprint Archives: Globalisation, Integration and the Impact of Links". Proceedings of the Fifth ACM Conference on Digital Libraries, June (ACM: New York), pp. 143-151
http://opcit.eprints.org/dl00/dl00.html

Hunter, K. (1998) "Adding Value by Adding Links". Journal of Electronic Publishing, Vol. 3, No. 3, March
http://www.press.umich.edu/jep/03-03/hunter.html

Krichel, T. and Warner, S. (2001) "A metadata framework to support scholarly communication". International Conference on Dublin Core and Metadata Applications 2001, Tokyo, October
http://openlib.org/home/krichel/papers/kanda.html

Lagoze, C. and Van de Sompel, H. (2001) "The Open Archives Initiative: Building a Low-Barrier Interoperability Framework". Joint Conference on Digital Libraries, Roanoke, VA, June
http://www.cs.cornell.edu/lagoze/papers/oai-final.pdf

Lawrence, S., Giles, C. L. and Bollacker, K. (1999) "Digital Libraries and Autonomous Citation Indexing". IEEE Computer, Vol. 32, No. 6, 67-71
http://www.neci.nj.nec.com/~lawrence/papers/aci-computer98/

Liu, X. et al. (2001) "Arc - An OAI Service Provider for Digital Library Federation". D-Lib Magazine, Vol. 7, No. 4, April
http://www.dlib.org/dlib/april01/liu/04liu.html

Liu, X. et al. (2002) "A Scalable Architecture for Harvest-Based Digital Libraries - The ODU/Southampton Experiments". arXiv.org, Computer Science cs.DL/0205071, May
http://arxiv.org/abs/cs.DL/0205071

Nixon, W. (2002) "The evolution of an institutional e-prints archive at the University of Glasgow". Ariadne, issue 32, July
http://www.ariadne.ac.uk/issue32/eprint-archives/

O'Connell, H. B. (2000) "Physicists Thriving with Paperless Publishing". arXiv.org, Physics/0007040, February
http://arxiv.org/abs/physics/0007040

Pace, A. K. (2002) "'Standard' Issue: Defining Standards and Protocols". Computers in Libraries, Vol. 22, No.8, September
http://www.infotoday.com/cilmag/sep02/Pace.htm

Pentz, E. (2001) "CrossRef: A Collaborative Linking Network". Issues in Science and Technology Librarianship, Winter
http://www.library.ucsb.edu/istl/01-winter/article1.html

Pinfield, S., Gardner, M. and MacColl, J. (2002) "Setting up an institutional e-print archive". Ariadne, issue 31, April
http://www.ariadne.ac.uk/issue31/eprint-archives/

Powell, A. and Apps, A. (2001) "Encoding OpenURLs in Dublin Core Metadata". Ariadne, issue 27, March
http://www.ariadne.ac.uk/issue27/metadata/

Quint, B. (2002) "The Digital Library of the Future: CrossRef Search and QuestionPoint offer challenges to traditional services". Information Today, Vol. 19, No. 7, July/August
http://www.infotoday.com/it/jul02/quint.htm

Simbol, B. and Zhang, M. (2002) "Citation Managers and Citing-Cited Data". Issues in Science and Technology Librarianship, Summer
http://www.istl.org/02-summer/article4.html

Suber, P. (2002) "Momentum for eprint archiving". Free Online Scholarship Newsletter, 8th August
http://www.topica.com/lists/suber-fos/read/message.html?mid=1607391538&sort=d&start=38

Van de Sompel, H. and Beit-Arie, O. (2001) "Open Linking in the Scholarly Information Environment Using the OpenURL Framework". D-Lib Magazine, Vol. 7, No. 3, March
http://www.dlib.org/dlib/march01/vandesompel/03vandesompel.html

Van de Sompel, H. and Hochstenbach, P. (1999) "Reference Linking in a Hybrid Library Environment, Part 2: SFX, a Generic Linking Solution". D-Lib Magazine, Vol. 5, No. 4, April
http://www.dlib.org/dlib/april99/van_de_sompel/04van_de_sompel-pt2.html

Van de Sompel, H. and Lagoze, C. (2002) "Notes from the Interoperability Front: A Progress Report from the Open Archives Initiative". 6th European Conference on Research and Advanced Technology for Digital Libraries (ECDL), Rome, September
http://lib-www.lanl.gov/%7Eherbertv/papers/ecdl-submitted-draft.pdf

Velterop, J. (2002) "BioMed Central. What we do and what we don't do". American-Scientist-E-PRINT-Forum, August 14th
http://www.ecs.soton.ac.uk/~harnad/Hypermail/Amsci/2228.html

Waters, D. J. (2001) "The Metadata Harvesting Initiative of the Mellon Foundation". ARL Bimonthly Report, No. 217, August
http://www.arl.org/newsltr/217/waters.html

URLs given in this paper

Arc (http://arc.cs.odu.edu/)
Archon (http://archon.cs.odu.edu/)
arXiv (http://arxiv.org/)
BioMed Central (http://www.biomedcentral.com/)
Budapest Open Access Initiative (http://www.soros.org/openaccess/read.shtml)
CogPrints (http://cogprints.soton.ac.uk/)
Digital Library Federation (http://www.diglib.org/architectures/testbed.htm)
DP9 (http://www.cs.odu.edu/~dlibuser/dp9/)
Dublin Core Citation Working Group (http://www.dublincore.org/groups/citation/)
E-Prints UK (http://www.rdn.ac.uk/projects/eprints-uk/)
GNU EPrints (http://software.eprints.org/)
JISC FAIR programme(http://www.jisc.ac.uk/dner/development/programmes/fair.html)
Mining the Social Life of an Eprint Archive (http://opcit.eprints.org/tdb198/opcit/ and http://opcit.eprints.org/ijh198/)
OAI-implementers discussion list, thread: XSD file for qualified DC (http://www.openarchives.org/pipermail/oai-implementers/2002-June/000518.html)
OAI Aggregator ‘Celestial' (http://celestial.eprints.org)
OAI registered sites (http://www.openarchives.org/Register/BrowseSites.pl)
OAIster (http://oaister.umdl.umich.edu/cgi/b/bib/bib-idx?c=oaister;page=simple)
Public Library of Science (http://www.publiclibraryofscience.org/)
PubMed Central (http://www.pubmedcentral.nih.gov/)
Romeo project (http://www.lboro.ac.uk/departments/ls/disresearch/romeo/index.html)
SHERPA project (http://www.sherpa.ac.uk/index.html)
TARDIS project (http://tardis.eprints.org/)

OpCit Web site

Results of the Citebase evaluation, final reports to funding agencies and other concluding works will appear first on the Open Citation Project Web site (http://opcit.eprints.org/), which also leads to Citebase, EPrints, Cornell's Digital Library services and full details of all OpCit research.