Metalist of open access eprint archives: the genesis of institutional archives and independent services

Steve Hitchcock
Intelligence, Agents, Multimedia Group, Southampton University

Open access eprint archives, where authors of published research papers can self-archive their work for all to see, pose a challenge to journal publishers. Researchers wants to improve access to papers while preserving the recognised quality control established by journals. Open access archives will cause journals to review the business model and focus on adding new, digital features.

What is the scale of this challenge currently? Despite the rhetoric there are no quantitative studies. It can't be that difficult to produce a list of open access archives, surely? Actually, it is harder than might be imagined, not just because of the growing scale of open access archives and the sheer number of archives, but by the evolving structure of distributed archives and independent services. Unlike journals, which are by design distinct and bounded entities (a collection of papers bounded by an editorial framework enforced by peer review standards), Web-based open access archives are not simply collections built for browsing but also as open data sources for powerful, automated independent services such as search, aggregation and impact measurement. For this reason open access archives do not need a user interface, although most do have one. From a prospective reader's viewpoint (or that of someone surveying these archives), an archive may have no independent presence other than through a service interface.

The critical infrastructure required to support distributed archives and independent data services was introduced by the Open Archives Initiative (OAI) with its Protocol for Metadata Harvesting (PMH) in January 2001 (Lynch 2001). Tomaiuolo and Packer (2000) provided a checklist of disciplinary 'preprint' archives that, because OAI was then in its infancy, recognised the likely influence of cross-archive services such as search but could not have detected the growth in institutional archives that OAI has subsequently motivated.

So a new checklist is warranted, but a list of open access eprint archives, and examination of their contents, is insufficient as a measure of the challenge. It is important to look through the lens at archive service providers too.

Thus, this is not a list of individual open access archives of full-text research papers, but instead lists and comments on other lists of individual archives. This list and its categorisation gives a broad overview of the structure, size and progress of full-text open access archives, and is intended to be useful for further quantitative research on the open access archive phenomenon.

Overview

There are many different types of archive. One principal distinction is between subject-based, disciplinary archives and institutional archives. Both disciplinary and institutional archives can be preprint (pre- journal publication versions of papers) or eprint archives (which can include successive versions of papers pre- and post-publication, but are primarily distinguished by inclusion of post-publication versions). This difference is often ignored or incorrectly glossed over. Many of the archives of interest in this study are characterised by containing papers that have been self-archived (i.e. deposited) by their authors.
All of these types of archive can be found in the General lists of open access eprint (full-text) archives.

Until 1999 many institutionally-based archives would have had a departmental bias and contained technical reports (TRs), the Guild Model identified by Kling et al. (2002). Since then the Open Archives Initiative (OAI) has given momentum to a new type of institutional archive that contains eprints of published (refereed) journal papers produced within research and educational institutions. OAI archives can be disciplinary or institutional, but its primary contribution has been to motivate new institutional archives. Not all OAI archives serve full-text papers, and it is definitely not a pre-condition of compliance with OAI that the items described by OAI metadata are openly or freely accessible.
Both types of archive, full-text and non-full-text, can be found in OAI archives.

Where TR archives were essentially separate archives that could be indexed (see for example the Unified Computer Science Technical Report Index (UCSTRI) list of sites, one of the first TR indexes on the Web) but had to be accessed and searched separately for each institution or department, the OAI-PMH enables independent services to provide common search and browse interfaces covering many archives. To give users an idea of scope and coverage, these automated services typically provide useful details of the indexed archives.
Find more details of OAI archives in OAI services-based lists of archives

Some lists focus on institutional archives as the most likely area for growth of open access, OAI-based eprint archives.
See Lists of institutional archives

Institutional archives can be distinguished by the type of software used to build the archives. As can be deduced from the lists of institutional archives, the software most widely used for this is produced by Eprints.org (also known as GNU EPrints from version 2 of the software, to indicate its availablity as open source software under the GNU licence). Eprints.org-based archives are mostly institutional, but not exclusively so. The Cogprints disciplinary archive was built with software that evolved to become Eprints.org. Other types of archive software are becoming available, and no doubt there will soon be lists of archives supported by these packages. Whichever software is chosen, these packages invariably produce archives that comply with the OAI, so this list will overlap with the OAI list above.
For now see Eprints.org archives.

It is not the intent in this paper to list individual institutional archives extensively, although a few are chosen to highlight different implementation models, described by Tennant (2002), adopted within institutions to motivate the uptake of archive services across the range of cultures and disciplines found within academic institutions.
See Institutional archives.

OAI services were not the first to introduce unified search and browse interfaces for archives. Various gateway services preceded these. While not archives in their own right, these services are important for the way in which they have enabled the structure of different archives to evolve. Some gateways are based on the largest archives, in this case the physics, maths and computer science archives at arXiv. For example, a number of previously independent maths archives merged with arXiv without loss of functionality or focus due to interfaces such as the Front for the Mathematics ArXiv. Other services combine searches on high-energy physics and astronomy in arXiv with bibliographic sources.
See Centralising subject-based archive gateways.

Gateways have not exerted solely a centralising influence, and in two notable examples, RePEc (Research Papers in Economics) and NCSTRL (Networked Computer Science Technical Reference Library), can be found forerunners of the distributed OAI model: independent archives, indexes and databases. RePEc is a large database of papers, an  "Open Library", open to contributions and providing open data for user services (Krichel 2000). Interpretations vary on the proportion of material available as full texts from the constituent archives of 'working papers', but RePEc is claimed to be the "second-largest source of freely downloadable scientific preprints" after arXiv. The growth and appeal of NCSTRL appears to have been limited by the large administrative, maintenance and metadata overhead imposed on participating institutional archives, a lesson learnt by the OAI designers who wanted a simpler, more widely accepted standard metadata format describing the contents of archives. NCSTRL is being converted into an OAI-compliant index.
See Decentralising archive gateways: the Economics network (RePEc) example.

Perhaps one of the more surprising developments in the wider context of full-text archives is the growth of open access journal archives. Papers in these archives are not deposited by authors but by journal publishers. Mostly this is focussed on biomedical journals, and was initiated by PubMed Central, the US National Library of Medicine's site, which has grown significantly after a slow start, and makes copies of subscription-based journals available some tine after publication. HighWire Press, a large producer of biomedical e-journals, similarly makes delayed copies of journal papers available free. Unlike PubMed Central and HighWire, the publisher BioMed Central has pioneered a new business model of original open access journals funded through author and institutional payments for review and publication. For some in this field the progress represented by these examples is not enough, as they will be joined by new open access journals from the Public Library of Science (PLoS). The model adopted by PubMed Central and PLoS has been endorsed by the Budapest Open Access Initiative (BOAI), which by supporting both open access archives and journals has reinvigorated the cause and adoption of services providing open access to full-text research papers. There are other distinctive and successful journal-archive models, such as Advances in Theoretical and Mathematical Physics, a journal 'overlay' of some arXiv physics archives that has published high-impact papers. Open access journals per se, without an archive connection, are not included here.
See Open access journal archives.

Athough it is not intended to list individual archives, some disciplinary archives are significant enough to be included in their own right. These archives demonstrate a wide range of types, from the ubiquitous arXiv, to the large Citeseer autonomously indexed collection of computer science papers mostly cached from authors' personal Web pages, to publisher-sponsored preprint collections, as well as smaller, specialised archives.
See Disciplinary archives (not a comprehensive list).

For a chronological view of the development of open access institutional archives in the wider context of free online scholarship (FOS), including many of the services and archives listed here, see Suber's Timeline of the FOS Movement.

This commented version of the archives metalist is just a snapshot of an emerging new phenomenon, of distributed institutional archives with real and growing open access content including published research papers. The engine for growth of these archives is the recognition by researchers and policy-makers that the improved impact achieved through open access, demonstrated by Lawrence (2001), is not only desirable but entirely comapatible with peer reviewed publication. The core metalist will be maintained and updated on the Explore Open Archives section of the Open Citation Project Web site.

This list includes sources that were considered to be either current or recently updated at the time of the investigation in March 2003.

General lists of open access eprint (full-text) archives

Open Directory Project, Free Access Online Archives (60 archives listed, last update 16 March 2003)
Electronic Archives "providing free and unrestricted access to peer reviewed scientific papers and academic publications" http://dmoz.org/Science/Publications/Archives/Free_Access_Online_Archives/

HighWire Press, Earth's Largest Free Full-Text Science Archives (20 archives), list produced to highlight HighWire's Free Online Full-text Articles (see Open access journal archives) as the largest such archive
http://highwire.stanford.edu/lists/largest.dtl

University of Maryland Libraries, Virtual Technical Reports Center: EPrints, Preprints, & Technical Reports on the Web, "Institutions listed here provide either full-text reports, or searchable extended abstracts of their technical reports". Alphabetical by institution name (last updated March 05, 2003)
http://www.lib.umd.edu/ENGIN/TechReports/Virtual-TechReports.html

University of Virginia Science and Engineering Libraries, Preprint Servers and Databases (33 archives, last modified: January 13, 2003) pointers to a variety of electronic pre-print sources in all areas of science and engineering http://viva.lib.virginia.edu/science/guides/s-preprn.htm

Tardis (JISC FAIR project 2002- ), E-print and Related Archives with Subject and Institutional Categories Identified (first posted January 2003). Institution, Multi-institution, Subject and Multidisciplinary archives
http://tardis.eprints.org/discussion/eprintarchivessubjecttable9103.htm

Aardvark, Asian Resources for Libraries, Free preprint and full text science archives (115 archives, viewed 20 March 2003)
http://www.aardvarknet.info/user/subject19/index.cfm?all=All

American Mathematical Society (AMS), Directory of Mathematics Preprint and e-Print Servers http://www.ams.org/global-preprints/

Astronomy Preprints & Abstracts, linked list of sites, includes institutional preprint servers (56 archives, viewed 20 March 2003) http://www.cv.nrao.edu/fits/www/yp_preprint.html
 

OAI archives

Open Archives Initiative, registered data providers, "conforming repositories" (75 archives, viewed 20 March 2003). Sites found still to be using OAI 1.1 on 2002/12/01 were purged from this list
http://www.openarchives.org/Register/BrowseSites.pl

Open Archives Forum, List of Repositories (20 archives, viewed 20 March 2003). No reasons for selection given
http://www.oaforum.org/oaf_db/list_db/list_repositories.php

OAI services-based lists of archives

Celestial, Open Archives gateway that harvests and caches metadata from OAI-PMH repositories and makes these data available for other services to harvest, includes number of records in repository and metadata namespace
http://celestial.eprints.org/cgi-bin/status

OAIster, serving 1,093,169 records from 144 institutions (updated 21 February 2003) http://oaister.umdl.umich.edu/o/oaister/viewcolls.html

Arc, an experimental cross-archive search service, List of Existing Archives
http://arc.cs.odu.edu:8080/oai/admin.jsp

my.OAI, user customisable search engine covering selected metadata databases from the OAI, see forms-based list of databases in guest search interface
http://www.myoai.com/search/Search.cgi/LoginForm?Login=guest&Password=guest

Open Archives Initiative - Repository Explorer, Virginia Tech interface to test archives interactively for compliance with the OAI-PMH, see forms-based predefined archive list in Explorer interface http://oai.dlib.vt.edu/cgi-bin/Explorer/oai2.0/testoai

Public Knowledge Project, Open Archives Harvester (12 archives, viewed 20 March 2003). Listed archives have to request harvesting)
http://www.pkp.ubc.ca/harvester/archives.php

Lists of institutional archives

SPARC SELECT LIST OF INSTITUTIONAL REPOSITORIES international, lists type of content (mostly preprints, published papers), software used (13 of 26 repositories listed use EPrints.org), url of repositories
http://www.arl.org/sparc/core/index.asp?page=m1

Eprints.org archives

GNU EPrints supports the development of institutional repositories. All the repositories known to have been built using the first two version releases of this software are listed in these two lists:
EPrints 2 Archives http://software.eprints.org/#ep2 (36)
EPrints 1 Archives http://software.eprints.org/#ep1 (28)
 

Institutional archives

University of California eScholarship Repository, offers faculty a central location for depositing any research or scholarly output deemed appropriate by their participating research unit, center, or department, including working papers and pre-publication scholarship, a service of the California Digital Library http://repositories.cdlib.org/escholarship/

Caltech Collection of Open Digital Archives (CODA), includes more then 10 repositories in production or in development  http://library.caltech.edu/digital/

The Information Bridge provides the open source to full-text and bibliographic records of Department of Energy (DOE) research and development reports in physics, chemistry, materials, biology, environmental sciences, energy technologies, engineering, computer and information science, renewable energy, and other topics. The Information Bridge consists of full-text documents produced and made available by the Department of Energy National Laboratories and grantees from 1995 forward. Additional legacy documents are also included as they become available in electronic format http://www.osti.gov/bridge/
see also PrePRINT Network

Gateways (indexes, unified search and browse of covered sites)

Centralising subject-based archive gateways

ArXiv search interfaces
NASA ADS ArXiv Preprints Query Form http://adsabs.harvard.edu/preprint_service.html
Front for the Mathematics ArXiv, alternative arXiv interface http://front.math.ucdavis.edu/

SLAC SPIRES HEP literature database contains more than 500,000 high-energy physics related articles indexed by the SLAC and DESY libraries since 1974 http://www.slac.stanford.edu/spires/hep/

Citebase, citation-ranked search and impact discovery for arXiv (also covers CogPrints and BioMed Central) http://citebase.eprints.org/help/coverage.php

NASA ADS
ArXiv Preprints Query Form http://adsabs.harvard.edu/preprint_service.html
Harvard-Smithsonian Center for Astrophysics Preprints (CfA) Preprints Query Form http://adsabs.harvard.edu/cfa/preprints.html

CERN Document Server (CDS), searchable Web interface to over 550,000 bibliographic records, including 220,000 fulltext documents in particle physics and related areas, covers preprints, articles, books, journals, photographs ... http://weblib.cern.ch/
Results include reference links (including journal links to publisher site, abstract, summary only, not OpenURL) and cited by, but cannot search or rank by citations
CDS services include:

PhysDoc - Physics Documents Worldwide - offers lists of links to document sources, such as preprints, research reports, annual reports, and list of publications of worldwide distributed physics institutions and individual physicists, ordered by continent, country and town http://de.physnet.net/PhysNet/physdoc.html

MPRESS, The Mathematics Preprint Search System, a searchable index of preprints from 10 servers, mostly covering geographical servers, but also disciplinary servers including Topology Atlas, Algebraic Number Theory Archives (frozen since Jan 2003) and K-theory Preprint Archives, as well as  the mathematics part of the arXiv mirror at Augsburg  http://mathnet.preprints.org/

PrePRINT Network, Department of Energy's searchable gateway to preprint servers that deal with scientific and technical disciplines of concern to DOE: physics, materials, and chemistry, as well as portions of biology, environmental sciences and nuclear medicine. Browse sites at http://www.osti.gov/preprints/ppnbrowse.html
see also Information Bridge

NTRS, NASA Technical Reports Server, search interface for 18 databases http://techreports.larc.nasa.gov/cgi-bin/NTRS
 

Decentralising archive gateways: the Economics network (RePEc) example

RePEc is a decentralized database of working papers, journal articles and software components, holding over 177,000 items of interest, over 86,000 of which are available online (27 Feb 2003) http://repec.org/

The following services provide access to all or part of the RePEc database for browse or search:

RePEc Archives

Current archive providers to RePEc http://ideas.repec.org/archives.html
Participating institutions provide over 1000 RePEc series (many of the top series are journal series or smaller databases). LogEc list of the top 25 RePEc series last month http://logec.hhs.se/scripts/seriesstat.pl

Working Papers in Economics

WoPEc, all papers in WoPEc are downloable but not necessarily free (contains over 80,000 documents in electronic format: 52130 Working Papers, 41741 Journal Articles, 27 Feb 2003) http://netec.mcc.ac.uk/WoPEc.html Among the largest contributing RePEc archives are the following working paper archives: Based on RePEc, Documents in Information Science (DoIS) is a database of articles and conference proceedings published in electronic format in the area of Library and Information Science, holds about 10042 articles and 3045 conference proceedings, 6928 of them are downloable (28th February 2003) http://dois.mimas.ac.uk/

A more broadly based database, rclis (Research in Computing, Library and Information Science) is in development

Networked Computer Science Technical Reference Library (NCSTRL) is being developed into a sustainable OAI conformant framework in a collaborative project involving NASA Langley, Old Dominion University, University of Virginia and Virginia Tech http://www.ncstrl.org/

Networked Digital Library Of Theses And Dissertations (NDLTD) http://www.ndltd.org/
 

Open access journal archives

BioMed Central (120 journals at 20 Feb 2003) http://www.biomedcentral.com/start.asp

PubMed Central (PMC) is the U.S. National Library of Medicine's digital archive of life sciences journal literature (52 participating journals at 20 Feb 2003) http://pubmedcentral.nih.gov/

HighWire Press Free Online Full-text Articles (list limited to journals published online with the assistance of HighWire Press). As of 2/28/03, 472,871 free full-text articles are included in the Free Online Full-text Articles from 1,358,713 total articles http://highwire.stanford.edu/lists/freeart.dtl

Advances in Theoretical and Mathematical Physics is an overlay of the arXiv archives. All papers are archived at LANL and its mirror sites. ATMP maintains only links to the above archive thus realising the first e-journal as an overlay to the global e-print archives http://www.intlpress.com/journals/ATMP/

BBS Prints Interactive Archive of the journal Behavioral and Brain Sciences containing original refereed 'target' papers, open peer commentary and repsonses (OAI compliant, Eprints.org journal archive) http://www.bbsonline.org/

Psycoloquy, articles and peer commentary in all areas of psychology as well as cognitive science, neuroscience, behavioral biology, artificial intelligence, robotics/vision, linguistics and philosophy (Eprints.org archive) http://psycprints.ecs.soton.ac.uk/

Disciplinary archives (not exhaustive)

arXiv http://www.arXiv.org/ abstracts include links to citation anlysis for the paper by SLAC Spires and Citebase. Stores over 230.000 papers

Citeseer (aka ResearchIndex), indexes Postscript and PDF research articles on computer science on the Web, and provides autonomous citation indexing, caches copies of freely available papers. Developed by NEC Research Institute, it is claimed to index over 500,000 papers. Not yet OAI compliant, but planned to become so http://citeseer.nj.nec.com/cs

EbizSearch, based on Citeseer, autonomously creates citation indexes of e-commerce literature. The search engine crawls Web sites of universities, commercial organizations, research institutes and government departments to retrieve academic articles, working papers, white papers, consulting reports, magazine articles, and published statistics and facts. For certain documents, the database only stores the hyperlinks to those documents. eBizSearch performs a citation analysis of all the academic articles accessed http://gunther.smeal.psu.edu/

Cognitive Science

Maths

Library and Information Science (LIS)

Publisher supported preprint archives

Other disciplinary archives

References and links used in the commentary

Budapest Open Access Initiative http://www.soros.org/openaccess/

Kling, Rob, Lisa Spector and Geoff McKim (2002) "Locally Controlled Scholarly Publishing via the Internet: The Guild Model". SLIS Indiana University, Center for Social Informatics, Working Paper No. WP- 02-01  http://www.slis.indiana.edu/csi/WP/WP02-01B.html
also in Proceedings of the 2002 Annual Meeting of the American Society for Information Science and Technology, Philadelphia, PA, November, and Journal of Electronic Publishing, Vol. 8, No. 1, August http://www.press.umich.edu/jep/08-01/kling.html

Krichel, Thomas (2000) "RePEc, an Open Library for Economics". March
http://openlib.org/home/krichel/papers/salisbury.html

Lawrence, Steve (2001) "Free Online Availability Substantially Increases a Paper's Impact". Nature Web Debate on e-access, May
http://www.nature.com/nature/debates/e-access/Articles/lawrence.html

Lynch, Clifford A (2001) "Metadata Harvesting and the Open Archives Initiative". ARL Bimonthly Report, No. 217, August
http://www.arl.org/newsltr/217/mhp.html

Open Citation Project, Explore Open Archives http://opcit.eprints.org/explorearchives.shtml

Public Library of Science, Journals http://www.publiclibraryofscience.org/journals.htm

Suber, Peter (2002) Timeline of the Free Online Scholarship Movement http://www.earlham.edu/~peters/fos/timeline.htm

Tennant, Roy (2002) "Institutional Repositories". Library Journal, 15 September 2002  http://libraryjournal.reviewsnews.com/index.asp?layout=article&articleid=CA242297&display=Digital+LibrariesNews&industry=Digital+Libraries&industryid=3760&verticalid=151

Tomaiuolo, Nicholas G. and Packer, Joan G. (2000) "Preprint Servers: Pushing the Envelope of Electronic Scholarly Publishing". Searcher, Vol. 8, No. 9, October
http://www.infotoday.com/searcher/oct00/tomaiuolo&packer.htm

Unified Computer Science Technical Report Index (UCSTRI)  http://www.cs.indiana.edu/ucstri/sitelist.html