Early Opcit: The Les Carr archives

Report on early link demos and directions

Date: Mon, 02 Aug 1999 15:57:40 +0100
From: Leslie Carr <lac@ecs.soton.ac.uk>
To: harnad@ecs.soton.ac.uk, sh94r@ecs.soton.ac.uk, wh@ecs.soton.ac.uk,
        ginsparg@qfwfq.lanl.gov, halpern@cs.cornell.edu, lagoze@cs.cornell.edu,
        wya@cs.cornell.edu, ijones@bcs.org.uk, eric@hellman.net,
        lawrence@research.nj.nec.com, vandesompel@rug.ac.be,
        kurt@research.nj.nec.com, friedman@highwire.stanford.edu,
        giles@research.nj.nec.com, hoc@SLAC.Stanford.EDU,
Subject: EprintLinks (Opcit) project: report on some early work and directions

During July 99 some proof-of-concept work has been undertaken on the EprintLinks (Opcit) project. This document is a brief report on the state of this initial work. (For a simple diagram showing the kind of work that
the project is trying to achieve, see
http://www.ecs.soton.ac.uk/~lac/EPrintLinkdemo.gif .)

XXX (ArXiv physics) Citation Strategies
There are a number of ways to provide links between the citations and the cited preprints. The essential part of this process is the ability to be able to map between a journal citation (e.g. Phys. Rev. D. 56,
6336) and the apropriate XXX reference (e.g. astro-ph/907075). The responsibility for declaring the relationship between a citation and the archive holdings can be born by

(a) the author of the article directly citing the XXX reference code
(b) an external agency maintaining a manually entered database i.e. (SLAC) SPIRES
(c) an external software agency maintaining a citation database derived from the article contents

Once a process performing this mapping is defined, the appropriate hypertext links can be embedded directly into a suitable (viewable) version of the article by the Open Journal linking software or by adding hyperdvi comands to the original source.

Some initial work has been undertaken on the XXX archive using the processed articles in the postscript cache. This has allowed us to develop a prototype of  (c), the software that reads the articles, parses the references and deduces the journal citations which which are capable of being linked to eprints. An example of this can be seen at
(references section begins on p11). The five links on p11 have been automatically added to the PDF document under the control of the software reference reader.

A smilar procedure can be followed using (b), software that reads the WWW page of the SPIRES database and uses that data to automaticaly add links. Identical results are obtained from these two methods for this
example article. In practise SPIRES will be more accurate than at least our current prototype; however (c) can be used in areas outside High Energy Physics and can be usefully applied to new eprint submissions.

Next Stages
Previous experience indicates that it may be easier to work with dvi/ps/pdf than with the range of TeX input formats. (Most of the problems come from the fact that TeX is a programming language, not a
document description language.) If this is truly the case then attention must be paid to the fact that the postscript/dvi cache constitutes only some 3% of the total archive. The EprintLink (Opcit) software would need to independently process the articles to work with the reference sections of the full database.

An important premise of the EprintLink (Opcit) project is that the archive readers should be able to navigate directly betwen viewable articles using citation links. To acomplish this it wil be necessary to provide PDF format as standard. The results of the processing required above should be made available directly to the users, and not just held as part of the internal linking process.

Although (b) can be performed interactively when the user requests a particular article (as part of an Open Linking Service) it would be more helpful for developmental and user-testing purposes to use a mirrored copy of the SPIRES data.

The major part of  the work will be performed in making bringing (c) up to better levels of performance. At the moment it is based on some very simple scripts which do not deal correctly with all the complexities of
the article formats (e.g. two columns). Of the 2523 articles that we are using from the cache, 80% have a recognisable references section. Of those, 96% are successfully parsed into a database of references but
only 48% seem to yield citations to XXX (arXiv). We have a list of 6243 citation links generated by (c) together with a list of some 11000 direct XXX (arXiv) citations provided by (a).

Currently (c) is intended to work directly within the confines of the LANL archive. Applying SLinkS would allow us to provide links also to publishers' holdings as well.

Other bibliometric work is in very early stages. About half of the archive is declared as having been subsequently published in a journal (according to the presence of a Journal-ref: field in the listing).
Initial analysis of the rate of change of the article meta-data indicates that there is an average lag of about 11 months between submitting a preprint and declaring it published. Exactly what is happening in an eprint's life-cycle is not yet clear, and we would like to do a lot more work in this area along with the development of new bibliographic measures.

Leslie Carr
Tel: +44 1703 594479            Fax: +44 1703 592865
Email: L.Carr@ecs.soton.ac.uk   URL: http://www.ecs.soton.ac.uk/~lac
ACM Member: 5135934             IEEE Member: 40323275
Dept of Electronics and Computer Science, University of Southampton SO17

Follow-up: Current link demos, presented for evaluation