From harnad@coglit.ecs.soton.ac.uk Tue Jul 4 14:06:22 2000 Date: Mon, 3 Jul 2000 16:17:19 +0100 (BST) From: Stevan Harnad To: Ian Hickman Cc: Stevan Harnad , lac@ecs.soton.ac.uk, zj@ecs.soton.ac.uk, S.Hitchcock@ecs.soton.ac.uk, tdb198@ecs.soton.ac.uk Subject: Re: Questions arising from our Work Presentation. On Mon, 3 Jul 2000, Ian Hickman wrote: > > Web Traffic > ------------- > > - Where does the web traffic come from on our mirror? > - Where does the traffic come from on the US site? > - How do the two compare? And remember it was also suggested to contact the managers of the other 13 mirror sites. Some may have ready stats, others may even be conducting related analyses of their own. Proportions of national domains at each site (and also their overall usage rates) would be a terrific supplement ot what we have. > - Can the session lengths distribution be used to eliminate non-human > users? (i.e. do Spiders have long session lengths?) > - What percentage of papers are hit in their first month on the archive? > - How often do users have a session on the site? > - Is the ratio of published to unpublished papers significant (using > statistical methods)? The ratio or the difference, or even (P-U)/(P+U) One suggestion for display is to show the standard error bars for your averages (standard error is the roout mean square deviation from the mean, divided by the sqaure toor of N: could be drived by doing monthly averages and calculating their variance). > - What is the reading life-cycle of a paper? > - How does the Impact of the paper effect the life-cycle? > - Is there a correlation between High Impact papers and High Download > rates? Divide the papers into hi/med/lo impact using several measures: (1) Total hits per paper in the archive (2) Citations of the paper (cannot be done in early embryo stages). in the Archive (3) Citations of the paper, ISI statistics (tell me what you need and I'll contact ISI about access to their database) (4) Impact factor (citation ratio) of the AUTHOR (rather than the paper): easier to get (both from the Archive, using Zhuoan's tools, and from ISI). Further subdivide by hi/med/lo on each of the above measures and sector: HEP/ASTRO/COND/other The experimental design is then Impact (3 levels) by sector (4 levels) > Defining High Impact > ---------------------- > > - Quality of Author(s) - How good is an author? > - Number of Citations (using SPIRES archive for High Energy Physics) > - Number of downloads Author's hit-rate; author's citation-ratio ("impact factor") > Validity > ---------- > > - Is a paper deposited in a certain field really a paper about that field? Important. Here we can use other forms of analysis: Latent Semantic Analysis (I can contact Tom Landauer about the LSA software for research purposes), Shimon Edelman's similarity metric, shared keywords, co-citation > - Could we cross reference with the ISI database to ensure this? Contact each of the other mirror sites (compose a letter and send it to me: I could edit and send for you). > - Could we use LSA techniques on the paper abstract to ensure this? LSA and other techniques > Deposition patterns Depositing > --------------------- > > - What fraction of updates are links to changes as opposed to paper > re-writes? Re-writes of text-body (how big), re-writes of abstract, and front-matter, journal reference insertion > - Can we cross reference with the SLACK database to confirm publication > figures/dates? SLAC/SPIRES Again, draft a text and I can liaise for you: Heath O'Connell hoc@SLAC.Stanford.EDU They have all the validated biblio data for all of HEP and many other areas of physics. We MUST use that info to cross-check whether those papers in XXX whose authors have not given journal-refs are indeed in journals. Those stats are essential -- again subdivided by impact-level (hi/med/lo) and sector (HEP/astro/cond/other) Get the same info for Astro from the Astro database (pboyce@aas.org) > - What is the deposition/submission behavior in each field of the archive? and at each impact level -- and compare across the years as XXX grew and practise evolved... > - Are published versions submitted or are papers updated with Journal > references? (look to SPIRES) and AAS and maybe even ISI > - Is there a correlation between average number of changes and Impact of > paper? This is one of many variables you will want to correlate with impact (which can be measured the 4 ways mentioned above): latency (how soon the hits occur); whether journal ref is given; sector; etc. > - Throughout the different fields - what proportion of pre-prints are > replaced by peer reviewed reprints? > i) Is a paper published? > ii) Does the author say that it is published? > iii) Did the version number change? > iv) If it is not updated what is the "diff" between submitted and > deposited papers? Good! Chrs, Stevan