OpCit

"Articles do not get changed very much following publication"

Definitions:

"Changed" - A change, or update, to the article is defined as being a new submission to the archive. The dates of these submissions are recorded in the abstract's "Date" field. The date format is in standard date string format, with the year either in two digit or four digit form.
"Publication" - Publication is defined as being when the article is published in a journal. Whether the article is published is recorded by the author in the abstract's "Journal-ref" field. There is no fixed format for this field.
"Abstract" - Each article in the Los Alamos archive has an associated abstract, containing meta data about the article and the abstract itself. Some of this data is entered by the author, and some is generated automatically through the author's process of submission.
"Meta Data" - Meta data describes information about a document, for example its creation date, or size.
"Parsing" - Parsing is when a data set is filtered or changed so that the output is more useful.

What data was analysed:

The test data is initially extracted from the directory hierachy of the Los Alamos archive, e.g.:

/math-ph/papers/0001/0001027.abs

In addition to the directory hierachy the individual abstracts were opened and relevent data extracted from them.


How the data was extracted:

A directory listing was taken of the XXX directory hierachy:

dir -R -l > d_listing

This created a file listing all the files in the archive.

This was parsed into another file, d_papers, which contained just the abstract and article file names. During parsing the file type is extracted, the article reference, the date the article was first submitted and the file location:

date_submission type    paper_ref       first_submit   filename
ftp             abs     physics/0006007 200006         ftp/physics/papers/0006/0006007.abs

The creation date in the directory hierachy is stored in "yymm" format. This is converted to "yyyymm", by adding 1900 if the year is greater than 50, or adding 2000 if the year is less than 50.

The d_papers file is then parsed into d_publishedchanges. This contains only the abstract files. Additional data is added to the records:

  • The date of publication, as extracted from the Journal-ref field in the abstract meta data.
  • The number of changes before publication, as extracted from the date(s) fields in the abstract meta data.
  • The number of changes after publication, as extracted from the date(s) fields in the abstract meta data.
  • The years that changes were made, as extracted from the date(s) fields in the abstract mete data.
For example:
abs     cond-mat/9702146        199702  ftp/cond-mat/papers/9702/9702146.abs    1997    3       0       1997,1997,1997,

The two values refer to the number of changes before publication (less than or equal to publication year), and the number of changes after publication.

For information on the algorithm used to extract the date from the Journal-ref field please see below.

The "number of changes after publication", are the number of date fields (one field per submission by the author) whose year is greater than the publication date.


Accuracy:

The most accurate the data can be, for an invididual article, is one year - as predominently authors only give the year of the journal's publication (along with the journal name, issue number and so on). If the author has not specified if the article has been published it is impossible to ascertain whether an article has been published. Any such records are ignored.

If an article is published before it is submitted to the archive, but has not been updated, this will be counted as being an update to the article (as the submission date will be greater than the publication date).

The data analysed covers a period of 10 years, and all the changes that have been made to papers over that period. Therefore papers during the earlier period are likely to have had more updates than those later in the period. 


Parsing the Journal-ref Field

In PERL:

        # Push in years in brackets
        while( s/[\(\[](\d\d\d\d)[\)\]]// ) {
                push(@years, $1);
        }
        # Push in any years at beginning or end of line
        (s/\D(\d\d\d\d)$//g) && push(@years, $1);
        (s/^(\d\d\d\d)\D//g) && push(@years, $1);
        # Push in remaining four-digit values
        while( s/\D(\d\d\d\d)\D// ) {
                push(@years, $1);
        }
        # Push in two-digit years e.g. '99, or (99)
        while( s/\'(\d\d)\D// || s/\((\d\d)\)// || s/\'(\d\d)$// ) {
                if( $1 > 50 ) {
                        push(@years, 1900+$1);
                } else {
                        push(@years, 2000+$1);
                }
        }
        # Go through the 4 digit values until we find a likely candidate
        while( ($year = shift(@years)) ) {
                if( ($year > 1950) && ($year < 2050) ) {
                        return $year;
                }
        }