bioRxiv homepage

Analyzing Preprints: The challenges of working with metadata from bioRxiv

By Mario Malički, Maria Janina Sarol, and Juan Pablo Alperin

This blog post is the third in a four part series documenting the methodological challenges we faced during our project investigating preprint growth and uptake. Stay tuned for our final post, in which we’ll dive into the arXiv metadata.

bioRxiv homepage

Following our analysis of the challenges working with SHARE and OSF metadata (read about them in our previous blogs here and here), we explored the metadata of bioRxiv—a preprint server focused on biology research. Since its inception in 2013, bioRxiv has received significant attention,1 and became the largest biology-focused preprint server (Figure 1, below). Studies showed that posting a preprint on bioRxiv was associated with more citations and higher Altmetric scores than articles that were not posted as preprints.2-4 Additionally, recent research (itself posted on bioRxiv) has shown slight differences between preprints posted on bioRxiv and postprints published in journals indexed in PubMed, with peer reviewed articles having better reporting of reagents (i.e., drug suppliers and antibody validation) and experimental animals (i.e., reporting of strain, sex, supplier and randomization), but bioRxiv articles having better reporting of unit-level data, completeness of statistical results, and exact p-values.5

Unlike OSF servers we previously analysed, bioRxiv only accepts unpublished/unrefereed manuscripts that have not been peer reviewed or accepted for publication in a journal (Figure 1, below). Furthermore, bioRxiv specifically states it does not accept theses, dissertations, or protocols. It accepts only preprints written in English and uploaded as either Word documents or PDFs. Moreover, all uploaded preprints, including their updated versions, undergo a basic screening process for content that is offensive, non-scientific, or that might pose a health or biosecurity risk, and undergo iThenthicate plagiarism check and duplicate search (information provided by bioRxiv). To encourage use of preprints, bioRxiv allows users to directly transmit uploaded documents and their metadata to a number of journals, and journals to deposit received manuscripts as preprints to bioRxiv.

Monthly submissions to bioRxiv are increasing
Figure 1. Monthly submission to bioRxiv. Graph available at Rxivist under CC-BY 4.0 licence.

Despite not yet releasing their own public API, bioRxiv metadata can be retrieved from Crossref and through the Rxivist site. BioRxiv deposits metadata to Crossref as part of the official scholarly record; while the metadata available from Rxivist is scraped from the bioRxiv website.6

For our analysis, we downloaded metadata from June 18 to 20, 2019 for all posted content in Crossref—that is, all preprints, eprints, working papers, reports, dissertations, and many other types of content that has been posted but not formally published”and filtered it for records from bioRxiv. Guided by a description of Crossref metadata fields, we then set out to explore the same basic metadata fields as we did for SHARE and OSF. Here’s what we found:

1) Records

Our final database contained 52,531 preprints. Although bioRxiv users have to categorize their preprints into one of three categories (New ResultsConfirmatory Results, or Contradictory Results), this information is only available through the biorXiv search engine—and not in the Crossref or Rxivist metadata. On August 28, 405 (0.7%) preprints were listed as Contradictory and 1,090 (1.9%) as Confirmatory; all the rest were classified as New Results.

Although records can be removed in cases of discovered plagiarism or breach of relevant ethical standards (see the bioRxiv Submission Guide for details), there was no metadata field for withdrawn status in either Crossref or Rxivist. However, the bioRxiv’s search engine includes a filter option that on August 28 listed 49 preprints as withdrawn. Unlike for published papers, for which the Committee on Publication Ethics recommends keeping the full text of the record available after indicating its retracted or withdrawn status, some uploaded documents of withdrawn preprints in bioRxiv, are removed from the server, and their abstracts replaced with removal reason statements. This is similar to the practice we found in OSF.

There is also no metadata information available for authors of the removal statements, although, in some cases, the statements themselves contain information about the author (e.g., as in example one, a preprint removed by the bioRxiv moderators, or example two, a preprint removed at the request of the authors). However, in several cases the reason for removal is not clear (e.g., example three, Figure 2 below).

an example of a withdrawn bioRxiv preprint.
Figure 2. Example of a withdrawn bioRxiv preprint.

Using Crossref Metadata, we also identified 10 duplicate cases, where the title and author(s) of one preprint was an exact match to that of  one or more other preprints (e.g., example one, preprints one and two are identical; example two, preprints one, two, and three are different versions of the same preprint). It is worth nothing that bioRxiv has recently implemented a duplicate search.

2) Subjects

On upload, bioRxiv users have to assign their preprint into one of 25 subject areas. Previously users could assign preprints to two additional subject areas – Clinical Trials and Epidemiology – until content in these areas were redirected to medRxiv, a preprint server for clinical sciences launched in June. Taking these two discontinued subject areas into account, we originally expected the bioRxiv metadata to include 27 subject area classifications. Instead, the Crossref metadata listed 29 areas, including two areas not listed on the bioRxiv website: Scientific Communication (n=16) and Pharmacology (n=26), which existed alongside listed Scientific Communication and Education (n=322), and Pharmacology and Toxicology (n=400) areas. (A recent update of bioRxiv metadata has corrected this.)

There were only eight records (0.02%) which did not have a subject information in the Crossref metadata. Of these eight, only two listed subject information on the bioRxiv website and in Rxivist (i.e., records one and two); the remaining six did not (i.e., records one, two, three, four, five, and six).

3) Dates

There are six date fields in the Crossref metadata for bioRxiv preprints: indexed, posted, issued, accepted, created, and deposited. All six of these fields were populated for each of the bioRxiv preprints in our database.

Crossref defines these fields as follows:

  1. Date indexed: when the metadata was most recently processed by Crossref;
  2. Date posted: when the content was made available online;
  3. Date issued: the earliest known publication date. This was equal to date posted in all the cases we studied;
  4. Date accepted: the date on which a work was accepted for publication. For 37,730 (72%) records, this was equal to the date posted/issued, for 13,517 (26%), it preceded the date posted, and for 1290 (2%) it succeeded it. A description of why this is so, would be a welcome addition to the metadata documentation;
  5. Date created: when the DOI was first registered. For all but one preprint, Figure 3 below, this date was the same or succeeded the date posted/issued—likely because most DOIs are assigned on the same day or a few days after a preprint is created on the bioRxiv server.
  6. Date deposited: when the metadata was last updated (by bioRxiv).


An example of a preprint with mismatched dates in the metadata
Figure 3. The metadata for this preprint indicates that its DOI was created before the record was uploaded to the server (top). According to Rxivist metadata and the record’s URL, a version of this preprint was previously posted on bioRxiv, but is no longer available (bottom).

The Rxivist metadata contains only one date field, the date when a preprint was first posted. Metadata regarding version numbers or update dates is not available from either Crossref or Rxivist. However, it is possible to discern the version number by looking at the last two characters of the biorxiv_url field in Rxivist (e.g., example one, whose biorxiv_url ends with v7).

4) Contributors

When entering author information during preprint upload at bioRxiv, users are required to fill in each author’s given and family name and institutional affiliation. They can also optionally fill in the author’s middle name, middle name initials, name suffix, home page, and ORCID iD. Alternatively, users can choose to add a Collaborative Group/Consortium as an author, by filling in a free text field. This information, alongside the uploaded document, is then used to create a final pdf version of the preprints (which users have to approve), and from which the final metadata is extracted.

Only three records in the Crossref metadata lacked any author information (specifically, records one, two, and three—this third one has since been corrected). All other records included, at minimum, at least one consortium or family name. We found 24 (0.05%) records with an empty given name field. In some cases this was because a consortium name had been entered as a family name (e.g., example one), in others because only a family name had been entered (e.g., examples two and three). Additionally, we identified 33 records where an author’s full name (given and family name together) exactly matched that of another author on the same record (these errors have since been corrected). For 859 (2%) preprints, we found a Collaborative Group/Consortium as one of the authors. Two of these (records one and two) were exclusively authored by a consortium. Of preprints with consortia authors, 156 (18%) listed more than one Consortium. One preprint (see Figure 4, below), listed the same consortium name twice.

This preprint has errors in the contributor field
Figure 4. Example of a same consortium name appearing twice in the list of authors.

We also identified cases where authors names were entered as consortia (e.g., example three), as well as cases where the consortia field was used to enter “zero”, “No”, “None”, or “N/A” (many of these issues have since been corrected). In a handful of cases, different records used different spellings or acronyms for the same consortia (e.g., the Karolinska Schizophrenia Project (KASP) was also spelled as Karolinska Schizophrenia Project KaSP, KaSP, and KaSP Consortium). All of these issues make it difficult to accurately determine the number of unique consortia listed as co-authors.

In the Crossref metadata, each author is assigned a Sequence field identifying them as either the first or an additional author on a preprint. As expected, in almost all cases, the additional authors appeared in the metadata in the same order as on the bioRxiv webpage. However, in Crossref JSON metadata format consortia always occupy the last spot of the author list (e.g. example one, where a consortia is the middle position of the author byline, but last in the metadata), which  makes it impossible to use Crossref JSON metadata format to determine byline order for preprints co-authored with consortia. Crossref XML metadata format and Rxivist metadata retains the correct author order (note: they do not contain a field indicating exact byline position).

For 28,663 (55%) preprints in Crossref metadata, at least one author had an ORCID iD; for 4,177 (8%), all of the authors did. For 13 preprints, two or more authors shared the same ORCID iD (e.g., example one and two).

Lastly, although bioRxiv requires uploaders to add affiliation information for all authors during upload, this information isn’t captured in the Crossref metadata (though it is available in Rxivist).

5) Linking preprints to postprints

As Indicated in our previous blogposts, it is important to identify if a record on a preprint server is actually a preprint. BioRxiv monitors title and authors of records in Crossref and PubMed. On identifying a potential match, they send an email to the corresponding author for confirmation, and then add that information to the preprint metadata. Additionally, users can also manually alert bioRxiv of postprint publications.3

In the Crossref metadata, 20,465 (39%) records had a link to a DOI of their published version. Of those, one record contained its own DOI instead of that of the published postprint (at the time of writing this error had been corrected on the bioRxiv site but not in the metadata deposited to Crossref). We found an additional 29 pairs of preprints that shared an identical postprint DOI with each other. In some cases, this was caused by different versions of the same preprint being uploaded to bioRxiv as separate preprints (e.g., example one: preprints one and two, example two: preprints three and four). In others, the same version had been uploaded twice as two separate preprints (e.g., example three: preprints five and six, Figure 5, below). Additionally, despite accepting only original research preprints, both review papers and published papers have been found among bioRxiv records6-7 (discussion of the preprint community and bioRxiv moderators about these issues can be found here).

Finally, five preprints contained links to reviews and endorsements received on F1000 Prime in Crossref’s relation metadata field (e.g., example one).

Example of doi-related metadata issues
Figure 5. This record was uploaded twice on bioRxiv and was assigned different DOIs.

Metadata continues to improve

In the two months that have passed since we first downloaded the bioRxiv Crossref metadata and writing of this post, the bioRxiv team has been working and improving the metadata they provide. We hope that they similarly address the errors and omissions we’ve identified above, and enrich the metadata with information about author affiliations, withdrawn records, article categorization, versions, and comments. Overall, we identified fewer issues with bioRxiv metadata than we did for SHARE and OSF. We hope researchers using bioRxiv metadata from either Crossref or Rxivist will report if and how they overcame the challenges we described.


  1. Kaiser J. Are preprints the future of biology? A survival guide for scientists. Science. 2017;397.
  2. Fu DY, Hughey JJ. Releasing a preprint is associated with more attention and citations. bioRxiv. 2019:699652.
  3. Fraser N, Momeni F, Mayr P, Peters I. The effect of bioRxiv preprints on citations and altmetrics. bioRxiv. 2019:673665.
  4. Serghiou S, Ioannidis JP. Altmetric Scores, Citations, and Publication of Studies Posted as Preprints. JAMA: The Journal of the American Medical Association. 2018;319(4):402-4.
  5. Carneiro CFD, Queiroz VGS, Moulin TC, Carvalho CAM, Haas CB, Rayêe D, et al. Comparing quality of reporting between preprints and peer-reviewed articles in the biomedical literature. bioRxiv. 2019:581892.
  6. Abdill RJ, Blekhman R. Meta-Research: Tracking the popularity and outcomes of all bioRxiv preprints. eLife. 2019;8:e45133.
  7. Anaya J. 2018. [cited 2019]. Available from:

Comments? Questions? Drop us a line on Twitter (tag #scholcommlab). We have shared the issues outlined above with the bioRxiv team, and incorporated their feedback into this post. Additionally, recently bioRxiv has released a beta version of their own API. We welcome your insights and experiences with working with preprint metadata. Our source code is available on GitHub.

Large stack of papers on a white background

Analyzing preprints: The challenges of working with OSF metadata

By Mario Malički, Maria Janina Sarol, and Juan Pablo Alperin

This blog post is the second in a four part series documenting the methodological challenges we faced during our project investigating preprint growth and uptake. Stay tuned for our next post, in which we’ll describe our experiences with the bioRxiv metadata.

Our analysis of the SHARE database, described in our previous blog post, revealed many challenges of working with metadata that was merged from different preprint servers. To continue with our research, we needed to proceed with a detailed analysis of one server (provider) at a time. We decided to start with preprint servers created by, or in partnership with, The Center for Open Science (COS). Using their Open Science Framework (OSF) preprints infrastructure, COS launched its first interdisciplinary preprint server, titled OSF Preprints, in 2016, and in the years following helped launch additional 25 servers (Figure 1 below), each tailored for a specific discipline or country. We downloaded all 29,430 records available through the OSF API on June 18, 2019 and explored the main metadata fields, just as we had done for SHARE (documentation on OSF server metadata fields is available here).

Figure 1. The 25 servers backed by the Open Science Framework

1) Sources and records per source

While we could easily calculate how many records existed on each of the 25 servers, we soon realized that records do not necessarily correspond to preprints. Identifying preprints among the records ended up being more challenging than we had anticipated for a number of reasons.

First, although servers employing the OSF infrastructure can accept different types of records, there is no metadata field (or filter on the OSF search engine) to distinguish between them (SHARE metadata also lacks such a filed). Our exploration revealed that several of the records were postprints (i.e., published, peer reviewed papers (e.g., example one), full conference proceedings containing several papers (e.g., example two), PowerPoint presentations (e.g., example three), or Excel files (e.g., example four). Some of the records were also theses, with one of the servers, Thesis Commons, created specifically for this document type.

Second, we identified a significant number of duplicate records which, if left accounted for, overestimate the number of preprints or records available. By using a strict exact matching of titles and full author names, we found 1,810 (6%) duplicate OSF records. These records corresponded to 778 unique title and author pairs that had been reposted at least once on either the same OSF preprint server (e.g., example one: records one and two, Figure 2, below) or on different servers (example two: records one, two, three, four, and five, or example three: records one and two).

a comparison of two different preprint records
Figure 2. The same document uploaded as two different records on an OSF Preprints server, one created on May 6, the other on May 30 2019.

This number, however, underestimates the actual number of duplicates, as users creating these records on OSF servers sometimes enter slightly different information. For example:

  • different authors for the same record, as in example four: records one and two;
  • different spellings of an author’s name or omission of an author’s middle name, as in example five: records one and two; or
  • uploading the same records in two different languages, as in example six: records one and two.

To reliably identify all of the duplicates, we’d need a more detailed comparison of titles, names abstracts, and linked published papers (outlined later in this post).

Third, 96 (0.3%) records we identified had been withdrawn at some point. But unlike for published papers, for which the Committee on Publication Ethics recommends keeping the full text of the record available after indicating its retracted or withdrawn status, withdrawn documents on OSF are removed, leaving behind only the webpage and none of the content. This makes it impossible to determine if withdrawn records were indeed preprints or some other kind of document. Reasons for withdrawal were listed on the webpage and available in the metadata for 70 (73%) of records. But while some withdrawal reasons were written by the server moderators (e.g., example one), some by the authors (e.g., example two), there were also cases where it was not clear who wrote them (e.g., example three, Figure 3). No (metadata) information was available on the individuals who provided the reasons for withdrawal.

Additionally, we found six  orphaned records, for which the metadata existed in the database, but the webpage was no longer accessible. We could not find explanations on the differences between orphaned and withdrawn records, except for OSF’s orphaned metadata definition, which reads: A preprint can be orphaned if it's primary file was removed from the preprint node. This field may be deprecated in future versions."

Withdrawn preprint record
Figure 3. An example of a withdrawn record from the MinRxiv preprint server. No information is provided on who withdrew the record (authors or server moderators).

2) Subjects (i.e., scientific discipline or subdiscipline classification)

OSF servers use the bepress discipline taxonomy for classifying records to scientific fields, albeit with slight modifications. For example, bodoArXiv, a server for medieval studies, introduced Medieval Studies as a top level discipline, while the bepress taxonomy has Medieval Studies under the Arts and Humanities top-level discipline. Similarly Meta-Science and Neuroscience were introduced as a top-level disciplines in PsyArXiv, while in bepress Meta-Science does not exist as a separate sub-discipline, and Neuroscience falls under the top-level Life Sciences discipline. Only two of the records we identified did not have subject classification.

3) Dates

OSF metadata contains seven date fields, two of which we planned to use for our analyses: date_published (defined as “the time at which the preprint was published”) and original_publication_date (defined as: user-entered, the date when the preprint was originally published”). The definition of the other five date fields (date_created, preprint_doi_created, date_withdrawn, date_modified, and date_last_transitioned) is available here. The date_published field appears to be automatically populated, and was therefore present for all records. Original_publication_date, on the other hand, was entered by users for only 9,894 (35%) records.

We also found instances where the original_publication_date differed from the date listed in Crossref metadata (e.g., example one, the user entered date is February 13, 2018, but Crossref lists February 14; example two, Figure 4 below, the user entered December 31, 2008, but Crossref lists December 3, 2009). These kinds of discrepancies make it difficult to identify which records correspond to preprints (i.e., by comparing the date when a record was created on a preprint server with the date it was published as a peer reviewed paper). They also pose challenges for determining at what stage of the dissemination process authors uploaded the preprint (i.e., on the same date they submitted a paper to a journal, or at the time of the paper’s acceptance).

Two records associated with a preprint
Figure 4. Example of a mismatch between user entered original publication date (above) and the date listed in Crossref, and at the record’s journal website (below).

4) Contributors

In our previous blog post, we explained how in SHARE, Contributors were divided into Creators (authors) or Contributors (uploaders). In OSF, Contributors are also divided into two similar categories: bibliographic, those that are the authors of the record, and non-bibliographic, those that can upload (new) versions of the records. We found 2,278 (8%) records that had at least one non-bibliographic Contributor (uploader), but only 14 (1%) of those records also listed one of those uploaders as an author. This confirms, like that data in SHARE, that uploaders are rarely also the authors of the records they are uploading, and indicates users are likely misunderstanding the entry fields during preprint uploading.

We identified many of the same issues that plague the Contributor field in SHARE, also in the OSF metadata. When taken together, these issues make the author metadata unreliable for authorship network analysis and other scientific inquiries. For example, we identified cases where the correct number of authors in the metadata differed from the number of authors listed in the uploaded documents (e.g., example one: only one author is listed on the record, but three can be found in the uploaded PDF).

We also encountered cases where Contributors that were specified as non-bibliographic should have been bibliographic (e.g., example two: the metadata lists one bibliographic and five non-bibliographic authors, while the uploaded PDF lists all six as bibliographic authors), as well as cases where the authors listed in the metadata were not among the authors listed in the uploaded PDF (e.g., example three, Figure 5 below: the single author in the metadata differs from the three listed in the uploaded PDF). Additionally, there were 31 (0.1%) records where the same author (using simple exact full name matching) was listed more than once as an author on the same record (e.g., example four).

A preprint on the engrXiv server
Figure 5. Example of authors listed in the uploaded document differing from those entered as authors on the engrXiv preprint server.

When we dug deeper into the analyses of authors’ names, we also saw that 137 (0.5%) records were missing an author’s first (given) name or initials (e.g., example one), and 2,091 (7%) were missing a last name (e.g., example two). We also found instances where users listed the name of a journal as the name of the author (e.g., example three, Figure 6 below), the name of a proceeding (e.g., example four) or of an institute (e.g., example five), and cases where the professional titles were included in names (e.g., example six).

A preprint posted on the AgriXiv server
Figure 6. Example of a journal entered as an author on the AgriXiv preprint server.

Additionally, like in SHARE, we found inconsistencies regarding the position of authors in the byline order (i.e., their place in the list of authors). We identified 67 cases where two or more authors had the same position in the byline order according to the metadata information, and to the best of our knowledge none of these cases where due to the practice of equal or shared contributorship (e.g., example one: the last two authors share the byline order position according to the metadata, but not according to the uploaded PDF; example two: the first two authors share the byline order position according to the metadata, but not according to the uploaded PDF). Finally, we also identified cases where the byline order in the metadata was completely different than the one in the uploaded document (e.g., example three).

As a final remark on the Contributor field, we noted that information on authors’ affiliations is not (directly) included in the record’s metadata; it is available only for authors who have created an OSF account and have taken the time to fill in their employment information. That is, uploaders who post preprints on the OSF servers are not asked to fill in affiliation information or professional identifiers (e.g., an ORCID iD) for themselves or for their co-authors at the time of the preprint upload. Instead, co-authors are sent an email by the system asking them to create an account, and if they choose to do so, they are then prompted to include their identifiers, education, and professional affiliations (but none of these fields are mandatory). Moreover, because this metadata is linked to the user and not the preprint, it can easily introduce errors and inconsistencies, as affiliations change over time. There were 9,741 (33%) records for which at least one of the co-authors did not create their profile on OSF. Of the 17,510 unique users that did create their profiles, only 3,218 (18%) provided their affiliation information.

Additional exploration of OSF servers

Despite all of the issues with the basic OSF metadata we described above, we still wanted to include OSF servers in our analysis of preprint growth and uptake. To help estimate the number of unique preprints (not records) on the servers, we explored an additional metadata field that we did not explore in SHARE: a record’s link to its published peer reviewed version.

Like many other preprint servers, OSF servers offer a free text field to allow users to link their preprints to their published (likely peer reviewed) version. This can be done at the same time the preprint is uploaded (if the published version already exists), or at any time after it becomes available. However, on OSF, the field is labelled “Peer-reviewed publication DOI”, which means that preprints cannot be linked to papers published in venues that do not assign a DOI. We found 4,965 (17%) records with a peer-reviewed publication DOI.

However, after exploring the data, it became apparent that this field alone cannot be used as a clear indicator of whether or not the record corresponds to a preprint (e.g., example one, where the peer-reviewed publication DOI points to an identical PDF that was uploaded as a preprint). In 491 (10%) of cases, the “peer reviewed” DOI pointed to another OSF preprint (e.g., example two, where the DOI points to a duplicate of the same record). Furthermore, 70 (14%) of the records used the record’s own DOI as the peer-reviewed publication DOI (e.g., example three, Figure 7 below). We also identified cases where the DOI entered could not be resolved (i.e., link does not work, as in example four). In addition, we found 362 (7%) cases where DOIs pointed to preprints or data deposited at the Zenodo repository and 9 (0.2%) cases were they pointed to items deposited at the Figshare repository. Finally, we found 305 (6%) records that shared the same Peer-reviewed publication DOI with at least one other record in an OSF repository. In many cases, this was due to the preprints being duplicates of each other, but in others, the error seemed to be a data entry issue (e.g., example five: records one and two are different preprints that point to the same published paper).

A preprint with incorrect DOI information
Figure 7. Example of a record where the DOI of the preprint (Paper DOI) and the Peer Review Publication DOI were identical.

Proceed with caution

As can be seen from above, using OSF metadata to answer even the most basic questions about preprints growth and uptake is problematic. The contributor metadata, in particular, was very unreliable, making it difficult to conduct any analyses that relied on author information, including searching for possible postprints using the authors names. We therefore caution researchers and stakeholders citing the number of preprints as reported on the OSF website or in recently published scientific articles1,2,3. To the best of our knowledge, no analyses conducted so far have fully addressed the issues we encountered.

Note: We communicated many of the issues described above with the OSF team at the Centre for Open Science who have indicated they are taking steps to address them. The most recent version of the OSF roadmap is available here.


  1. Narock T, Goldstein EB. Quantifying the Growth of Preprint Services Hosted by the Center for Open Science. Publications. 2019;7(2):44.
  2. Rahim R, Irawan DE, Zulfikar A, Hardi R, Gultom E, Ginting G, et al., editors. INA-Rxiv: The Missing Puzzle in Indonesia’s Scientific Publishing Workflow. Journal of Physics: Conference Series; 2018: IOP Publishing.
  3. Riegelman A. OSF Preprints. The Charleston Advisor. 2018;19(3):35-8.

Comments? Questions? Drop us a line on Twitter (tag #scholcommlab). We have shared the issues outlined above with the COS team, and welcome your insights and experiences with working with preprint metadata. Our source code is available on GitHub.  

a stack of preprints

Analyzing preprints: The challenges of working with SHARE metadata

By Mario Malički, Maria Janina Sarol, and Juan Pablo Alperin

As can be seen from a series of recent publications,1-6 there is great interest surrounding preprints—scholarly manuscripts that are published ahead of peer review.

a stack of preprints
"97688449"by pttgogofish is licensed under CC BY-NC-SA 2.0

And while these studies show a rise in the number of documents posted on preprint servers, they fail to offer a clear picture on how the growing number of preprint servers are being used, and how different communities are engaging with preprints. That’s why ASAPbio teamed up with the Scholarly Communications Lab on a project investigating preprint uptake and use across life and other sciences.

As we will describe in this four-part blog series, we encountered many challenges in working with preprint metadata, including lack of documentation, missing values, and incompatible and erroneous data. We have documented these challenges, in the hopes of helping those, who like ourselves, would like to better understand what is happening with preprints.

We begin this series with an analysis of the resource that was supposed to aggregate metadata of multiple preprint sources—the SHARE database—while the other three blogs will focus on the Open Science Framework (OSF) preprint servers, bioRxiv, and arXiv.

A deep dive into the Center for Open Science’s SHARE database

The Center for Open Science (COS)—the same group behind the OSF Preprint servers—partnered with the Association of Research Libraries to build the SHARE Infrastructure in order to harvest and normalize metadata from more than 100 sources (preprint servers). Within the 2,222,495 records contained in SHARE as of January 2019 (full dataset is available here, 1.46 GB uncompressed, provided to us by the SHARE team), we uncovered many discrepancies, omissions, and errors, which ultimately prevented us in using it as a source for our planned analyses. We present or findings for each of the major metadata fields available in SHARE:

1) Sources and records per source

We could not find information on SHARE’s website or GitHub pages specifying to what extent records are harvested from the different sources available in SHARE. In the records database they provided us, we identified 103 unique sources (servers). But we also found some that did not contain all of records that were available at the sources themselves (e.g., there were only 24 preprints from SSRN, out of the more than 700,000 preprints housed there today). In addition, although SHARE attempts to consolidate information from multiple sources for each record (e.g., see Figure 1, below, for a record with information available from four sources), we also found instances where information from different records was consolidated as if it all corresponded to the same record (e.g., example, where information on two different arXiv records—records one and two—was merged, most likely due to the fact that both of these records link to the same postprint publication). A clear description on how information is harvested from each source could help researchers attempting to use this data.

Figure 1. SHARE aggregates preprint information from multiple sources

2) Subjects (i.e. scientific discipline or subdiscipline classification).

We were also unable to find documentation on how SHARE handles subject/discipline classifications from various sources, nor on whether there were plans to use a single classification system for all records. We did confirm that, of the records that had a subject field value, 98% used the bepress classification of disciplines. However, this was only 1 of 21 different classifications we identified in the database.

Most records though, 1,517,874 (68%), did not have subject information at all. We felt more could have been done to fill in the subject classifications for these records. For example, records from the RePEc preprint server—which deals exclusively with Economics—could be augmented under the subject Social and Behavioural Sciences, as per their bepress classification, and similarly records from the various arXiv repositories could be augmented based on their respective fields.

3) Dates

Dates are a complicated metadata field, as research documents can be time stamped at many different moments: when they are uploaded, when their DOI was minted, when they were made public, updated, retracted, etc. However, the SHARE metadata contains only two date fields for each record that are populated using the source’s metadata (an additional three date fields exist for timestamps that occur within the SHARE database itself). These two source-based dates are: date_published (defined as: “date when the record was first published, issued, or made publicly available in any form”) and date_updated (defined as: “date when the record was last updated in the source”).

However, 2,085,123 (94%) of the records we examined didn’t have a date_published in SHARE, and only had the date_updated. This made it impossible to distinguish whether the date_updated referred to the date that the record was first created, when its DOI was minted, or when it was published in a printed or online version of a journal. In other words, we couldn’t use this data to determine the elapsed time between a preprint and its published paper (postprint), or to confirm whether the record was a preprint at all (we’ll expand on this issue of preprint/postprint identification in our next blog post). For researchers and users who rely on this database, a clear indication of which dates are harvested from each source would be greatly welcome.

4) Contributors

In the SHARE database, a Contributor is categorized as either a Creator (defined as: “the agent who authored or was a major contributor to the creation of the work”) or as a Contributor (defined as: “the people or organizations responsible for making contributions to an object”). In more scholarly familiar terms, Creators are authors and Contributors are uploaders, i.e. those who uploaded initial or revised versions of the preprints. Any Creator (author) or Contributor (uploader) is further classified as being either a person, institution, organization, or consortium.

This classification system poses several challenges. First, because any preprint record can have multiple Creators and multiple Contributors, SHARE metadata cannot be used to determine who initially uploaded the preprint (and filled the information on the authors, keywords, and other preprint details). Instead, the metadata only offers information about who can upload new versions. This, however, is a common issue in most bibliographic databases; information on who initially submitted a manuscript to a journal is seldom provided.

Second, information on the Contributors (uploaders) was available for only 86,579 (4%) of records (most of which were preprints hosted of the OSF preprint servers, arXiv or HAL). While it could be assumed that the Contributors (uploaders) would most often also be Creators (authors) of those records, we found—using a simple exact name matching—that this occurred in only in 2,737 (3%) of the 86,579 records. And just as we were initially confused by the two terms, we also found examples where users uploading preprints might have misunderstood them as well. Such is the case shown in Figure 2, where one person is listed as a Creator (author) and five are listed as Contributors (uploaders), when in fact all six are authors.

Figure 2. SHARE lists only one of six authors as the “Creator” of this preprint

Affiliation information was almost non-existent for either authors or uploaders. We found no affiliation information for 2,153,579 (97%) of records.

Even more problematic were records where the number of authors in the metadata did not match the number of authors in the uploaded documents (e.g., this example lists one person as the only Creator and Contributor of the record, but the uploaded PDF has two additional authors, see below; and this example lists one Creator in the metadata, but three authors in the PDF—none of whom is the supposed Creator).

Figure 3. We identified several records in which the contributor names in the metadata did not match those in the uploaded documents

These errors make it very difficult to answer even the most basic preprint authorship questions reliably: How many authors does a specific preprint have? And what is the average (or median) number of authors per preprint? These issues also constrain more detailed investigations into authorship networks; author affiliations, countries or regions of origin, seniority; and identifying all preprints published by a single author.

“These errors make it very difficult to answer even the most basic preprint authorship questions reliably: How many authors does a specific preprint have? And what is the average number of authors per preprint?”

Comparing the metadata of each record with the information in the uploaded documents seems to be the only way to detect all of these errors. Therefore, we can only presume (and hope) that the percentage of records with an incorrect number of authors is small and will not distort our analyses. Preprint servers, however, should apply better methods or checks to ensure information on authors is correctly entered into the metadata when preprints are submitted or posted.

Unfortunately, these were not the only issues that would make analysis of preprint authorship challenging. In SHARE, the byline order position (i.e. place in the list of authors) is provided in the metadata field order_cited for each author or Creator. We found that 454,866 (20%) of records assigned the same byline order position to several different Creators (authors). To the best of our knowledge, these cases do not represent instances where authors contributed equally to a paper or preprint. Instead, they appear to be metadata errors.

One such error can be found for this preprint, where the metadata lists 4 authors—3 with the same byline order position—while the source’s website shows only two authors and the uploaded PDF only one. Other errors include listing all of the Creators (authors) twice in the metadata (see Figure 4, where 4 authors are included in the uploaded document and 8 in the metadata) and duplicating just one of the names (see Figure 5, which lists 6 authors on the paper, but 7 in the metadata).

Figure 4. Eight authors are included in the metadata, but only four are in the original record
Figure 5. Seven authors are listed in the metadata, but only six in the preprint itself

We presume these errors most likely occur when the record information is harvested from two different sources, and each source contains a slight variation in the spelling of authors’ names or initials. Additionally, even though we could not find documentation behind the order_cited for Contributors (uploaders), we found examples, as in case of this preprint, where the metadata listed the same byline order position for two uploaders.

To complicate things even further, we also found cases where the byline order listed in the metadata did not match the order in the posted PDF document (e.g., Figure 6 and Figure 7):

Figure 6. The byline order position for the second and fourth authors in the uploaded document is reversed in the metadata
Figure 7. The byline order position for the second and third authors in the uploaded document is reversed in the metadata

All these Contributor errors we detected mean that if researchers rely solely on the SHARE metadata, any authorship analyses they conduct will likely be very unreliable. We feel more work is needed to improve the quality of authorship information during a creation of a centralized database.

Still a long way to go

As can be seen from the many examples we describe above, even the most basic metadata in SHARE is too problematic to provide a reliable analysis on preprint uptake and use across life or other sciences. The issues we encountered are further compounded by the lack of clear and complete documentation about metadata fields, harvested sources, and the way records from those sources are integrated into the SHARE database. While we cannot attest to the value of SHARE as a service for preprint discovery, we find that, in its current form, it cannot be considered a valuable source for any rigorous analysis of preprints.

In the end, we decided to discard the SHARE database as a starting point for our analyses, and we turned toward the original servers that host that information. Stay tuned for the next post in our series, in which we’ll explore metadata from 24 OSF Preprint servers.

Acknowledgements: We would like to thank Naomi Penfold, Associate Director of ASAPbio, for comments on an earlier version of this post.

Note: We shared a draft of our blog post with the SHARE team at the Centre for Open Science prior to its publication, but did not receive a response at the time of publishing.



  1. Fu DY, Hughey JJ. Releasing a preprint is associated with more attention and citations. bioRxiv. 2019:699652.
  2. Chiarelli A, Johnson R, Pinfield S, Richens E. Preprints and Scholarly Communication: Adoption, Practices, Drivers and Barriers [version 1; peer review: 2 approved with reservations]. F1000Research. 2019;8(971).
  3. Fraser N, Momeni F, Mayr P, Peters I. The effect of bioRxiv preprints on citations and altmetrics. bioRxiv. 2019:673665.
  4. Serghiou S, Ioannidis JP. Altmetric Scores, Citations, and Publication of Studies Posted as Preprints. JAMA: The Journal of the American Medical Association. 2018;319(4):402-4.
  5. Narock T, Goldstein EB. Quantifying the Growth of Preprint Services Hosted by the Center for Open Science. Publications. 2019;7(2):44.
  6. Abdill RJ, Blekhman R. Meta-Research: Tracking the popularity and outcomes of all bioRxiv preprints. eLife. 2019;8:e45133.

Comments? Questions? Drop us a line on Twitter (tag #scholcommlab). We have shared the issues outlined above with the COS team, and welcome your insights and experiences with working with preprint metadata. Our source code is available on GitHub.