Home | You are at

Comparison Between When Articles Were Published And When They Were Submitted To The Archive


Written by Tim Brody, last updated on August 31 2000 10:09:19.

What data was analysed:

The test data is initially extracted from the directory hierachy of the Los Alamos archive, e.g.:

/math-ph/papers/0001/0001027.abs

In addition to the directory hierachy the individual abstracts were opened and relevent data extracted from them.


How the data was extracted:

A directory listing was taken of the XXX directory hierachy:

dir -R -l > d_listing

This created a file listing all the files in the archive.

This was parsed into another file, d_papers, which contained just the abstract and article file names. During parsing the file type is extracted, the article reference, the date the article was first submitted and the file location:

date_submission type paper_ref first_submit filename
ftp abs physics/0006007 200006 ftp/physics/papers/0006/0006007.abs

  • Root directory [used when analysing XXX updates]
  • File format, e.g. abs = abstract
  • Paper reference
  • Date of first submission (yyyymm)
  • File location

The date of first submission in the directory hierachy is stored in "yymm" format. This is converted to "yyyymm", by adding 1900 if the year is greater than 50, or adding 2000 if the year is less than 50.

The d_papers file is then parsed into d_publishedcount. This contains only the abstract files.

abs cond-mat/9702146 199702 ftp/cond-mat/papers/9702/9702146.abs 1997 3 0 1 1997,1997,1997,1998,

  • File format
  • Paper reference
  • First submission date
  • File name
  • Publication date
  • Submissions before publication date
  • Submissions in same year as publication date
  • Submissions following publication date
  • List of years of submissions [mainly for debugging purposes]

For information on the algorithm used to extract the date from the Journal-ref field please see below.

To generate the graphs and other analytical data this file is parsed in a number of ways, for example summing together all the publications in a single year.


Accuracy:

The most accurate the data can be, for an individual article, is one year - as predominantly authors only give the year of the journal's publication (along with the journal name, issue number and so on). If the author has not specified if the article has been published it is impossible to ascertain whether an article has been published. Any such records are ignored.

The data analysed covers a period of 10 years and all the submissions in that period, therefore earlier articles are likely to have had more updates than those later in the period.


Parsing the Journal-ref Field

In PERL:

# Push in years in brackets
while( s/[\(\[](\d\d\d\d)[\)\]]// ) {
push(@years, $1);
}
# Push in any years at beginning or end of line
(s/\D(\d\d\d\d)$//g) && push(@years, $1);
(s/^(\d\d\d\d)\D//g) && push(@years, $1);
# Push in remaining four-digit values
while( s/\D(\d\d\d\d)\D// ) {
push(@years, $1);
}
# Push in two-digit years e.g. '99, or (99)
while( s/\'(\d\d)\D// || s/\((\d\d)\)// || s/\'(\d\d)$// ) {
if( $1 > 50 ) {
push(@years, 1900+$1);
} else {
push(@years, 2000+$1);
}
}
# Go through the 4 digit values until we find a likely candidate
while( ($year = shift(@years)) ) {
if( ($year > 1950) && ($year < 2050) ) {
return $year;
}
}

 

Results:

Submissions to the archive have grown steadily since 1991. The total number of papers is now 130000.

It is useful to know how the submissions to the archive compare to when those articles were published. The following two graphs show how submission rates for published articles have compared to the rate at which those articles were published.

We can analyse the relationship between when authors have submitted their documents to the archive: have authors submitted before, at the same time or after their article has been published?

The following graph shows that only about half the articles in the archive have meta data containing when the article was published, the remaining articles have either not been published or the author has not updated the information to show that the article has been published. Of the 132218 articles that are in the archive, 82998 articles had no meta date pertaining to its publication and 478 articles had publication information but no indication of the year of publication.

This graph shows a further refinement of the above graph, with "unpublished" articles split into articles with "report-no" field, "submitted" or "published" in their comment, "thesis" in their comment or "accepted" in their comment.

The following graph shows at what point submissions were made to published articles, for example a submission was made before it was published, then in the same year it was published and finally after it was published.

The following graph analyses when submissions have been made to articles (this differs from the previous graph, as this graph uses the date of the "first submissions" only). These figures include the "initial update" of first submitting the article to the archive. The 7% of articles, as indicated by "Other", are articles that have been updated before, at and after their publication (or before and at, and so on).

The following three graphs show of these changes how many there have been. One change means the author has only submitted the article – they have performed no further alterations to the article and submitted them.

Using SLAC/SPIRES

find du 1999 not (eprint hepth or hepph or grqc or math or mathag or condmat or quantph)

...66198 records found...of 77802 total in 1999

Home