Three Questions With Olivia Aguiar

Three questions with… Olivia!

Our lab is growing! In our Three Questions series, we’re profiling each of our members and the amazing work they’re doing.

Our latest post features Olivia Aguiar, a lab manager at the ScholCommLab and a doctoral student at Simon Fraser University’s Injury Prevention and Mobility Lab (IPML). In this Q&A, she tells us about creating science comics, interviewing "super seniors," setting boundaries, and more.


Olivia Aguiar enjoying the view from Everest Base Camp in Nepal
Olivia enjoying the view from Everest Base Camp in Nepal | Photo by Dawa Lama

Q#1 What are you working on at the lab? 

I’m so excited to be joining the ScholCommLab as a Lab Manager. My tasks include coordinating lab and project team meetings; updating the blog, website, and newsletter; collaborating on grant proposals; and supporting the lab directors in communication and outreach. I look forward to meeting the entire team and embarking on this new journey!

Q#2 Tell us about a recent paper, presentation, or project you’re proud of. 

I recently completed my certificate in Biomedical Visualization and Communication (BMVC) at The University of British Columbia. During the program, I learned that I love to illustrate scientific and health comics. Currently, I'm working on a short comic about concussions in ice hockey. My goal is to finish the final version by the end of Summer 2022. Stay tuned on social media (@livsaguiar) to see the result!

Olivia Aguiar on her way to catch some waves amid the morning haze in Tofino, B.C.
Olivia (right) on her way to catch some waves amid the morning haze in Tofino, B.C. | Photo by Keenan Bush

Q#3 What’s the best (or worst) piece of advice you’ve ever received? 

I interviewed “Super Seniors”—individuals aged 85 years or older who have never been diagnosed with a major medical concern (e.g. heart disease, cancer, dementia)—for the Healthy Aging Study at BC Cancer. My colleague and I summarized their advice on how to live a long, healthy life. My personal favourites include: “Kindness doesn’t cost anything,” “Consider life as an adventure,” and “Do good for others but also look after yourself. Their words provide me with perspective and inspiration—especially during uncertain and challenging times. 

As for professional advice, the best and worst I've received: “If you want something done right, do it yourself.” This mantra got me through university. It helped me build resilience, solve problems, and achieve success in volunteer and leadership positions. But this advice also led to many, many bouts of stress and burnout. I learned (the hard way) that sometimes there is a better person for the job. Now, I know there is a difference between pushing out of your comfort zone and pushing beyond your limits. I embrace team work and collaboration in the classroom and work place. It is brave to ask for help!

Find Olivia on Twitter at @livsaguiar


colourful illustration of people doing research

A bridge to access research: Reflections on the Community Scholars Program developmental evaluation

By Andrea Whiteley, with editorial support by Alice Fleerackers

What happens when you provide the community with free access to academic research? 

Simon Fraser University (SFU)’s Community Scholars Program (CSP) is a unique initiative that aims to do just that. Established in 2016 in collaboration with the United Way of the Lower Mainland and Mindset Social Innovation Foundation, the program connects more than 500 people working in nonprofits and community organizations across BC with the latest scholarly literature, providing free access as well as research training and support. 

While the program has grown considerably since its launch, it hasn’t always been clear how participating in CSP benefits community members. Curious to find out more, I met with program librarian Heather De Forest during the final stages of my PhD on Open Access and the Public Good. With support from the United Way, Mitacs, and Dr. Juan Pablo Alperin at the ScholCommLab, my casual curiosity quickly developed into a fully-fledged postdoctoral research project evaluating the success of the CSP.

In this blog post, I summarize key findings and recommendations from the final report (available to download here), drawing on survey data, group interviews, and one-on-one conversations with more than 100 participating Community Scholars.  


Colourful illustration of people doing research
The Community Scholars Program (CSP) provides community members with free access to academic research. Image by elenabsl/Shutterstock.com

1. Wide-ranging impact

Overwhelmingly, interviews and survey responses with program participants confirmed that the CSP was having both direct and indirect impact on their lives, their organizations, and the constituencies they serve. Scholars reported appreciating personal educational impacts, like improved presentation skills or program development abilities, which they had developed through CSP’s professional workshops and training sessions. They also reported benefiting from access to the research itself: 

Last year I was asked to participate in an international course as a subject matter expert on child rights. Although that is my area of work and focus I was anxious to ensure that I ‘knew enough,’ was current and was able to provide knowledge to a broad range of participants. Through CSP I was able to deepen my knowledge and gain different perspectives. Not only did this improve my confidence, it also ensured that I was able to coach 100+ participants.

The impact of the CSP on individual scholars also reverberated throughout the communities and in the nonprofit organizations where Community Scholars worked. By contributing to the professional development of employees, organizations were able to tap into a research base for successful grant funding applications, participate in legal challenges, enhance their ability to deliver evidence-based programs, create research and new knowledge in their fields, and more. 


Bar chart showing survey responses to how the Community Scholars Program helped participants. Enhanced expertise is by far the most common response.
Community Scholars reported a range of experiences and impacts from using the knowledge gained from being a part of the CSP (n=64)

2. Professional development and lifelong learning

Although over 60% of Community Scholars in the study had graduate level education (and by extension some research background), many expressed an interest in receiving further research-related education. The CSP program currently provides participants with various opportunities for training, which one Scholar called opportunities for “group study” and “self-study.” 

In surveys and interviews, many Community Scholars said that these opportunities filled a gap in their professional development, allowing them to enhance their understanding of their field, supporting in-house research, and motivating them to learn more about issues in their profession. In a couple of interviews, Community Scholars pointed to a need for adult education programs, like the CSP, that could support people who do not have formal educational credentials in entering post-secondary studies. At least one participant said that participating in the program encouraged them to pursue further education.  


Photo of a room of people talking, writing, and learning.
Community Scholars said they valued the educational opportunities CSP provided | Photo by COD Newsroom (CC BY 2.0)      

3. The power of communities of practice

The CSP brings together individuals working in the same field through journal club meetings and workshops, offered in person or online. Community Scholars who participated in journal club meetings—where participants choose research articles to read and discuss together—were very positive about their experiences and enjoyed getting to know others working in their field. Sitting in on these sessions, I was struck by how similar they were to graduate student seminars, but where each “student” possessed vast lived experience surrounding the issues under discussion. 

Although many Community Scholars expressed interest in participating in more of these “communities of practice” activities, fitting networking and learning opportunities into their busy lives was a challenge. Almost two thirds (60%)  of survey respondents said that time was a barrier to full participation. 


Bar chart of participant responses to the barriers preventing them from fully benefiting from the Community Scholars Program. Time is by far the most common.
Participants reported several barriers to fully engaging with the CSP’s offerings, of which lack of time was by far the most common (n = 108).

4. Librarians as essential supports

Feedback from Community Scholars confirmed that the program could not have achieved any level of success without the administrative, intellectual, and relationship management provided by the CSP librarians. During interviews, participants indicated that they received valuable assistance from the CSP librarians and were thankful to have them as a resource. Especially for Community Scholars located in remote locations, CSP librarians were the human faces of the online research portal. It was clear that librarians play an instrumental role in the transformation of community members into “Community Scholars.”

The CSP librarian in my area has been consistently warm in reaching out to offer support, both online and through workshops and the journal club.

Importantly, CSP librarians wear many hats and possess diverse skills. Employed at university libraries, they have extensive research skills. But they must also provide other kinds of support to help community members access research through the portal. Librarians demonstrated a range of innovative approaches for recruiting, training, and educating Community Scholars, including using local libraries to host information sessions, offering curated training sessions for organizations, and enlisting colleagues to present workshops on research-related topics. 


A man crouches down to look for a book on a lower library shelf.
CSP participants spoke to the important role librarians played in the program’s success. Photo by Tor Andreas Torhaug (CC BY-NC-ND 2.0

5. A bridge between disconnected worlds 

Finally, my research revealed that the CSP acts as a bridge between disconnected worlds, bringing together traditional, for-profit scholarly publishing models with a more “public good”-oriented approach to knowledge access. As the success of the program depends on the willingness of scholarly publishers to allow community access to scholarship, its very existence is a living compromise within a publishing ecosystem where access to research has become a hotly contested topic. During my evaluation, I encountered countless compelling examples of the research impact made possible by bringing these seemingly conflicting realities together. 

It's all bridge building, I actually consider myself a bridge builder. And right now, it's moving ... between indigenous and non-indigenous communit[ies], but there are so many bridges between communit[ies] and institutions, all of these bridges that we've been talking about. And I think Community Scholars is a flash point for that.

In a similar vein, my research revealed the potential of the program to act as a liaison between academics and their greater communities. The CSP encourages the flow of knowledge among professional scholars, community scholars, frontline service deliverers, community and nonprofit staff, policymakers, and citizens. Still, some noted that this flow of knowledge was often unidirectional. These comments point to an opportunity for the research community as a whole to develop avenues for two-way communication between scholars and their professional counterparts working in the field, and ultimately, better mobilize the knowledge created beyond university walls. 


Photo of a mossy forest path, fading in the distance

Recommendations for a path forward 

While overall support for the CSP was high, several broad recommendations emerged from my evaluation that could improve current initiatives or guide the program into new territory. For example, some Community Scholars expressed interest in the idea of a “research hub” where they could be linked up with graduate students, postdocs, or other researchers to tackle specific research needs within their organizations. As one of the CSP librarians, Kate Shuttleworth, put it:

What most of our nonprofits really want is a researcher. Access is great, [the] portal is great, but we need someone to do the work. And that’s the primary thing that would help them.

The CSP already has a relationship with Simon Fraser University’s Community Engaged Research Initiative (CERI), making this organization an obvious candidate for further promoting community-based research initiatives and engagement with Community Scholars. 

The evaluation also uncovered a need to develop research literacy more broadly, as nonprofit organizations are increasingly challenged by funders to implement evidence-based practices, evaluate program impacts, and understand the research landscape of their respective disciplines. The CSP could provide Scholars with guidance in this area by showing them how to conduct literature reviews, write grant proposals, or find funding opportunities. Large funders, like the United Way, could support such initiatives by clearly communicating expectations for impact reporting and evidence-based programming from a funder perspective. 

One grant proposal required statistics and sources to support our request for funding. Using data and journal sources found through your portal helped us secure $255,000 in funding.

Finally, a key motivation behind this evaluation was to find ways to sustain the program in its next phase of development. Providing access to research is a cornerstone of the CSP and, although Community Scholars clearly value that access, few if any participants I spoke with felt they could afford to pay for it without the program’s support. Although my evaluation did not include a targeted analysis of possible funding and administrative supports for the CSP, my consultation with users and discussions with program leaders indicate this is an area that urgently requires further investigation. For example, the CSP could explore offering an adult education component that would create funding opportunities, particularly if expanded to include a formal “research certificate.” The CSP would also qualify for funding initiatives aimed at improving knowledge mobilization and research outreach to the public, especially in remote Indigenous and Northern BC communities. 


Two people talking and connecting on a bench in a snowy city
Above all, Community Scholars valued the welcoming and supportive environment the program provided. 

Final thoughts: Fostering connection and culture

Above all, this evaluation made it clear that, while the CSP provides access to research, Scholars also value the “culture of scholarship” that accompanies that access. The CSP fosters a culture of collegiality and sharing, allowing participants to engage with scholarly research while also taking into account the unique features of the nonprofit environment. The program offers many kinds of support to its members, is inclusive of all kinds of “scholarship,” and aims to be accessible to those who lack time, resources, or formal research training. This culture of scholarship means that Scholars have a support network when they need it, empowering them in more ways than a single blog post can ever capture. 

This research was made possible with funding and support from Mitacs, The United Way, SFU Library, and the ScholCommLab. To find out more about the Community Scholars Program Evaluation, you can read the full report here and access the CSP librarians' response here.


ScholCommLab co-director receives funding to analyze patterns of how research data is cited and reused 

Stefanie Haustein wins Alfred P. Sloan Foundation research grant of $200K US

Have you ever wondered what motivates researchers to reuse open data and what makes them cite (or not cite) datasets in their work? Or how sharing, reusing and citing open data differs between research areas or changes during a researcher’s career?

ScholCommLab co-director Stefanie was awarded $199,929 US ($281,660 CDN) by the Alfred P. Sloan Foundation to fund Meaningful Data Counts, an interdisciplinary project exploring a range of questions about scholarly data use and citation. Together with co-PI Isabella Peters, Professor of Web Science at the ZBW Leibniz Information Centre for Economics and CAU Kiel University (Germany), Stefanie and her team will conduct the basic research necessary to understand how datasets are viewed, used, cited, and remixed.

“Many researchers have to publish a paper to receive credit for collecting, cleaning, and curating scientific data, because datasets don’t count when applying for a tenure-track job or promotion.”

“Many researchers have to publish a paper to receive credit for collecting, cleaning, and curating scientific data, because datasets don’t count when applying for a tenure-track job or promotion,” Stefanie explains. While an increasing number of journals and funding agencies now require data to be open, research datasets are still considered second-class research contributions. By documenting the uptake and reuse of data across disciplines, she hopes the project will help overcome this perception and incentivize open research practices. “Data metrics that capture dataset views, downloads and citations might help to showcase the value of open data and contribute to elevate its status to first-class scholarly output,” she says.

On the other hand, Stefanie acknowledges that metrics could harm the scientific system in the long run, replicating some of the adverse effects associated with current bibliometric indicators. Therefore, a primary goal of the project is to support the creation of meaningful multidimensional metrics instead of promoting raw numbers and crude rankings. “We want to support the evidence-based development of metrics that are appropriate and well thought-out instead of repeating mistakes of the past by creating more flawed indicators, like the h-index and impact factor,” she explains.

feedback loop of data sharing
The data sharing feedback loop: a lack of incentives has lead to a lack of research, best practices, and meaningful metrics about data sharing and citation.

The project team—which, in addition to Stefanie and Isabella, includes collaborators Daniella Lowenberg, Felicity Tayler, Rodrigo Costas, Peter Kraker, Philippe Mongeon, and Nicolas Robinson-Garcia—will therefore take a mixed-methods approach to explore how datasets are (re)used and cited across disciplines. Using metadata provided by DataCite, it will build on ScholCommLab member Asura Enkhbayar’s doctoral research to develop a theory of data citations. As part of the larger Make Data Count (MDC) team, the researchers will work in close collaboration with MDC's infrastructure and outreach specialists. Also funded by Sloan, this project aims to build a bibliometrics dashboard and drive adoption of standardized practices.

“We are thrilled to expand our Make Data Count team and scope to include the bibliometrics community,” says Daniella Lowenberg, the project’s PI and Data Publishing Product Manager at the California Digital Library. Martin Fenner, Technical Director at DataCite and MDC technical lead, agrees: “Going forward, Stefanie and her colleagues will provide us with important insights into researcher behavior towards open data,” he says. “I am excited to see Make Data Count transition into the next phase.”

During the two-year funding period the grant will support two postdoctoral researchers—one at the University of Ottawa (quantitative) and one at  Isabella’s Web Science research group at the ZBW. “I am very happy to join forces with the ScholCommLab and strengthen the German-Canadian research connection,” she says. “After having worked together with Stefanie for more than a decade—we studied together in Düsseldorf—this is our first shared research grant, which is really exciting.”

Stefanie agrees: "There's a lot of research to be done, but I'm hopeful that, together with MDC, this project will make a real difference in how the scholarly community values open data."

Interested in learning more? For updates and job postings about the Meaningful Data Counts project, sign up for the ScholCommLab’s newsletter


Analyzing preprints: The challenges of working with metadata from arXiv’s Quantitative Biology section

By Mario Malički, Maria Janina Sarol, and Juan Pablo Alperin

This blog post is the last of a four part series documenting the methodological challenges we faced during our project investigating preprint growth and uptake. Stay tuned for our next post, in which we’ll share general recommendations and hopes for the future of preprint metadata. 

Following our analyses of SHARE, OSF, and bioRxiv metadata, (which you can read about here, here, and here), we went on to explore the metadata of arXiv’s Quantitative Biology section.

Started in 1991 as a server for theoretical high-energy physics preprints, arXiv has since expanded to cover many other disciplines and has become the largest repository of preprints in science.1 Its growth has made it an important part of research communication in many fields. Research has shown that usage of arXiv has increased dramatically in the last 10 years in computer science,2-3 as well as in library and information science.4 Research also showed that, across all records and disciplines, 64% of arXiv preprints are eventually published in journals indexed in Web of Science (WoS), and, in turn, journals in WoS are increasingly citing arXiv records.5 arXiv subsection dedicated to Quantitative Biology started in 2003 (Figure 1, below).

Graph of yearly submissions to Quantitative Biology section of arXiv
Figure 1. Yearly submissions to Quantitative Biology section of arXiv have been growing since it was first started in 2003. Image obtained with permission from arXiv.

How does q-bio work? 

ArXiv differs from other servers we’ve analyzed in some key ways. First, unlike submissions uploaded to OSF and bioRxiv, arXiv preprints are not assigned a DOI, but an arXiv identifier. In addition, although all uploaders on arXiv (whether they are authors or third-party submitters) must assign their submission to a primary section (e.g., q-bio), and they can also subsequently “cross-list”  it to other arXiv sections to increase their audience.

On September 3, 2019, we downloaded the metadata for all arXiv records that were listed or cross-listed to q-bio using arXiv’s Open Archives Initiative Protocol for Metadata Harvesting compatible API. Unlike the metadata of SHARE, OSF, bioRxiv, and Rxivist, which is distributed only in JSON, arXiv metadata can be obtained in four XML formats. We based our analyses on arXivRaw — the format that most closely matches arXiv’s internal format. We then explored the same metadata fields we investigated for SHARE, OSF, and bioRxiv (see our previous posts). Additional details of the various metadata fields available can be found on arXiv’s help pages.

1) Records 

There were 28,104 records categorized as belonging to q-bio in our dataset. As with OSF, not all records correspond to preprints, as arXiv allows uploading of both published papers and preprints, with no metadata field to distinguish between the two. And as other preprint servers, arXiv allows users to uploaded new versions of their records. In the database we analysed, more than two thirds of records (n=19,109, 68%) had one version, one fifth of records had two (n=6,187, 22%), and 10% of records had three or more versions (n=2,808).

However, after exploring the metadata in more detail, we learned that not every record in our dataset represents a unique preprint or published paper on arXiv.

Using exact title and author name matching, we identified two sets of duplicate records (0.01%). One of these pairs was a true duplicate (i.e., records one and two), which also contained an admin note for text overlap (more on this below). The other pair (i.e., records three and four) did not contain such a note. In this case, the preprint and its published version had been posted as separate records rather than as different versions of the same record.

Admin notes such as the one described above are added by arXiv moderators to a record’s comments section. As with the example above, they can contain statements about textual overlaps that have been detected between records. Of the 317 (1%) arXiv records in our dataset with admin notes, all but one mentioned a (substantial) text overlap with other records (e.g., example one). The outlier note indicated an: “inappropriate text reuse from external sources” (see Figure 2, below).

This record contains an admin note in the Comments section
Figure 2. Example of a record with an admin note on arXiv.

According to arXiv documentation, admin notes are not intended to suggest misconduct on the part of the author or that an article does not contain original work. But, as is the case of the example above, they are sometimes used to identify duplicate records. In other cases,  we found they were used even in cases of two different publications that build upon previous work of the authors (e.g., example two – records one and two).

Additionally, as with bioRxiv and the OSF servers, there were no metadata fields for identifying whether a record had been withdrawn and who had posted the withdrawal notice. Instead, we used the comments section (available in the metadata and through this search query) to identify 89 records that used the term “withdrawn” (e.g., examples one and two), and two records that used the term “removed” (e.g., examples three and four). In some cases, however, we also found information about a withdrawal in the abstract text field (e.g. example five, Figure 3 below). In two cases, withdrawn records also contained an admin note (example six, and seven).

An arxiv record with a withrdawal notice
Figure 3. In this example, the abstract was replaced with a withdrawal notice.

Finally, by manually reviewing the comment fields and the abstract fields, we found two additional examples of records that were not unique. In the first case, the comments section of one record stated: “This paper, together with paper q-bio/0607043 are replaced by q-bio/0607043”). In the second, the abstract field had been used to indicate that the record was a supplement to another record.

2) Subjects 

As mentioned above, uploaders on arXiv must first specify a section (subject area classification) for their submission, and can then later “cross-list” it to additional sections. The arXiv documentation states that: “It is rarely appropriate to add more than one or two cross-lists”, and that “bad cross-lists will be removed”.)

We found that over a quarter of records (n=7,907, 28%) were designated to only one section (not cross-listed). Just over a third of records were cross-listed to one additional arXiv section (n=10,346, 37%), almost a quarter to two (n=6,720, 24%), and eleven percent (n=3,131) to three or more. Of the records that were listed or cross-listed in Quantitative Biology, the most common cross-list was Biological Physics (physics.bio-ph; n=5,113, 18%, see example of a cross-listed record in Figure 4, below).

record classified to two different q-bio categories
Figure 4. Example of a record classified to q-bio’s Tissues and Organs category and cross-linked to Medical Physics.

3) Dates 

The arXivRaw metadata format contains two date fields:

  • A datestamp, which indicates when the bulk metadata was last updated.
  • A submission date, which indicated the date a record was submitted to the server, and is available for each version of a record.

Of the 8,995 records with multiple versions available, 39 had different versions posted on the same day (e.g. version one and two; in this case, a revision was made to adjust the formatting of references). At the other extreme, 235 (0.8%) records included/linked to a revised version posted more than two years later (e.g. version one and two, with 15 years between versions; the only difference between the two versions is that one contains a DOI in the header of the document).

4) Contributors 

When uploading records, users enter authors names into a free text field, with a recommendation to separate authors with a comma or an “and”. Because of this, the arXivRaw metadata contains only one contributor field, in which all authors are listed. Affiliation information is neither a part of the submission process, nor does it exist as a separate metadata field. We found, however, that some authors used the free text format of the author field to also include their affiliations (example one and two, Figure 5, below).

An example where the author name free text field includes affiliation information
Figure 5. In this example, the uploader of the record used the author name free text field to add affiliation information to their arXiv record.

Since 2005, arXiv has implemented authority records as means for (co-)authors to confirm or claim authorship and add additional information (e.g., affiliation information, ORCID iDs). In the long term, this should help to identify and map all unique authors of arXiv records. The API for authority records is, however, still in development, and the information it contains is not currently part of the arXivRaw metadata format.

When a consortium is listed as one of the authors, users add the consortium name using the same author free-text field. While this was not the case with the arXivRaw format we used, we noticed that some of the other formats split consortium names into first and last name fields. This introduced errors, as  the last word in a consortium’s name was used as a surname and others as forenames (e.g., example one: the arXiv metadata format lists  “Consortium” as a last name and “The International Schizophrenia” as forenames).

Finally, arXivRaw metadata format also contains a single text field capturing information about who was the submitter of a record. Using simple exact matching, we identified 12,768 unique submitters, with 4,775 (37%) uploading more than one preprint, and 338 (3%) uploading more than ten. In some of these cases, the name of the uploader differed in spelling or title designation from that used in the record’s author information (e.g., example one, Figure 6, below).

The authors name is spelled differently in two different fields
Figure 6. The submitter of this record is also its first author, but the spelling of their name and title is not the same in the two fields.

5) Linking preprints to postprints

As with the other preprint servers we studied, arXiv users can link any submission to its published version. They can do so either by inserting a text reference (in the journal reference field), or by adding a DOI (in a DOI field) for the final publication.

Of the records in the database, 1,517 (5%) included information in a journal reference field, 4,430 (16%) in the DOI field, and 7,991 (28%) in both. Half of the records did not include links to published versions.

However, although links to published versions could suggest that half of the records in the database were preprints that had later been published, it’s difficult to know for certain because arXiv allows users to submit published papers as well, which could also include journal reference or DOI information. With no metadata field or search filter to distinguish preprints from published papers, the numbers above should not be used as representative of the number of preprints on arXiv. Instead, to get such an estimate, researchers should query Crossref or other metadata providers with the DOI and journal reference information for each record, and compare the date of the published records with arXiv submission dates. Of course, this method has its caveats as some preprints are only uploaded after the paper has been published.

Additionally, the journal reference and DOI metadata are not without errors. Despite documentation specifically asking users to provide only a bibliographic reference (without links) in the journal reference field, and despite there being a separate field for DOIs, some users nevertheless entered links (e.g., example one) and DOIs (e.g., example two, Figure 7, below) in the journal reference field.

We found a handful of records that contained duplicate data. A total of 30 (0.1%) records contained a journal reference that corresponded exactly to that of another record, and 38 (0.1%) that listed an identical DOI (12 records had both a matching DOI and matching journal reference). Of those we found one case where two different preprints had been combined into a single publication (i.e. records one and two), as well as several instances where the version that had been submitted to a journal was posted as a separate record from the preprint version (e.g., records three and four).

Example of DOI entered incorrectly on arxiv
Figure 7. The person who submitted this record entered the DOI in three places: the textual journal reference field, the DOI field, and the report number field.

Preprint metadata is dependent on users

So where does this leave us? We found the challenges of working with metadata from arXiv’s q-bio section to be similar to those we encountered within OSF and biorXiv. However, most of the errors in the data seemed to have stemmed from the way the metadata entry fields were used during preprint upload. To reduce errors like this in future, we recommend that preprints servers support their users by checking and validating the manual inputs used to generate metadata information.

References

  1. Ginsparg P. ArXiv at 20. Nature. 2011;476(7359):145.
  2. Feldman S, Lo K, Ammar W. Citation Count Analysis for Papers with Preprints. arXiv preprint arXiv:180505238. 2018.
  3. Sutton C, Gong L. Popularity of arXiv.org within Computer Science. arXiv preprint arXiv:171005225. 2017.
  4. Wang Z, Glänzel W, Chen Y, editors. How self-archiving influences the citation impact of a paper. Proceedings of the 23rd International Conference on Science and Technology Indicators; 2018: CWTS, Leiden University; Leiden.
  5. Larivière V, Sugimoto CR, Macaluso B, Milojević S, Cronin B, Thelwall M. arXiv E‐prints and the journal of record: An analysis of roles and relationships. Journal of the Association for Information Science and Technology. 2014;65(6):1157-69.

Note: We shared a draft of our blog post with the arXiv team prior to its publication and incorporated their feedback. Comments? Questions? Drop us a line on Twitter (tag #scholcommlab). 


scholcommlab co-director Stefanie Haustein

OAEvolution: Trends in Open Access and Scholarly Communications

scholcommlab co-director Stefanie HausteinThis week marks the 10 year anniversary of Open Access Week—a global event that strives "to make Open Access a new norm in scholarship and research." In celebration of this momentous occasion, the University of Ottawa Library asked ScholCommLab co-director Stefanie Haustein about recent trends in open access and scholarly communication. In this interview with the Library's Communication Advisor Meryem El Marzouki, Stefanie reflects on the past, present, and future of OA in our changing scholarly communication landscape.

uOttawa Library: How do you see the evolution of open access over the last 5 years?

SH: In the last year, Plan S has certainly shaken up academic publishing. A group of Research Funding and Research Performing Organizations in Europe introduced an ambitious roadmap to make all funded research outputs OA. The fact that Plan S does not allow hybrid OA has been seen by many as a (too) radical change. Since publishers make the scholarly community pay twice when they charge hybrid APCs for OA publication in subscription journals, I think it is a very convincing and very much needed regulation, considering that large publishing houses like Elsevier continue to make profits higher than Apple and Big Pharma on the shoulders of academia.

The extent to which scholarly articles are published OA has continually increased over the last years. A major study conducted for the European Union reported that since 2007 more than 50% of peer-reviewed journal articles are available OA. More conservative estimates report that 45% of articles published in 2015 are OA.

graph of open access articles
Percent of articles by OA status, Crossref-DOIs sample vs Unpaywall-DOIs sample. Graph by Piwowar et al. (2018)

However, what is considered OA beyond green and gold depends on who you ask: Should copies of papers on academic social networks, which often infringe copyright, be considered OA? Should documents on illegal platforms such as SciHub, so-called "black OA," be included? Or are only articles with a specific free to read and free to reuse license truly OA? A recent study added another colour to the spectrum: bronze OA, for articles that are “free to read on the publisher page, but without any clearly identifiable license”.

uOttawa Library: What is new/special to highlight about open access?

SH: When, in October 2015, the editorial board of the Elsevier journal Lingua resigned over the inability to agree to switch to an OA model and subsequently founded the new OA journal Glossa, a precedent was set for other editorial boards to consider changing to OA. Since then the boards of at least two journals followed suit and left their publisher to flip to OA. I am proud to be part of the former editorial board of the Journal of Informetrics (Elsevier), which stepped down unanimously in January 2019 to found the new OA journal Quantitative Science Studies (MIT Press).

But not all publishers seem resistant to change. One of the latest trends are so-called "transformative agreements," where publishers agree to make journals OA by shifting from a reader-pays (library subscriptions) to an author-pays model (gold OA with APCs) in an agreed-upon time frame. Much of these agreements to flip journals are also based on Plan S requirements, which stipulate that subscription journals need to fully transition to OA by 2023 in order to be compliant.

uOttawa Library: Do you have any idea of what the future might hold for open access?

SH: I would hope that the scholarly community takes back ownership and control over scholarly publishing by flipping journals and demanding OA publication with affordable and transparent APCs. Taking back control of scholarly communication infrastructure does not stop at OA publishing but extends to open data and open citations.

"I would hope that the scholarly community takes back ownership and control over scholarly publishing by flipping journals and demanding OA publication with affordable and transparent APCs."

Scholars need to understand that without the scholarly community submitting, reviewing, and editing manuscripts, a journal is nothing but a fancy website!

This interview was conducted and edited by Meryem El Marzouki, Communication Advisor at the University of Ottawa Library. If you are in Ottawa, check out and participate in uOttawa Library’s OA week events.


bioRxiv homepage

Analyzing Preprints: The challenges of working with metadata from bioRxiv

By Mario Malički, Maria Janina Sarol, and Juan Pablo Alperin

This blog post is the third in a four part series documenting the methodological challenges we faced during our project investigating preprint growth and uptake. Stay tuned for our final post, in which we’ll dive into the arXiv metadata.

bioRxiv homepage

Following our analysis of the challenges working with SHARE and OSF metadata (read about them in our previous blogs here and here), we explored the metadata of bioRxiv—a preprint server focused on biology research. Since its inception in 2013, bioRxiv has received significant attention,1 and became the largest biology-focused preprint server (Figure 1, below). Studies showed that posting a preprint on bioRxiv was associated with more citations and higher Altmetric scores than articles that were not posted as preprints.2-4 Additionally, recent research (itself posted on bioRxiv) has shown slight differences between preprints posted on bioRxiv and postprints published in journals indexed in PubMed, with peer reviewed articles having better reporting of reagents (i.e., drug suppliers and antibody validation) and experimental animals (i.e., reporting of strain, sex, supplier and randomization), but bioRxiv articles having better reporting of unit-level data, completeness of statistical results, and exact p-values.5

Unlike OSF servers we previously analysed, bioRxiv only accepts unpublished/unrefereed manuscripts that have not been peer reviewed or accepted for publication in a journal (Figure 1, below). Furthermore, bioRxiv specifically states it does not accept theses, dissertations, or protocols. It accepts only preprints written in English and uploaded as either Word documents or PDFs. Moreover, all uploaded preprints, including their updated versions, undergo a basic screening process for content that is offensive, non-scientific, or that might pose a health or biosecurity risk, and undergo iThenthicate plagiarism check and duplicate search (information provided by bioRxiv). To encourage use of preprints, bioRxiv allows users to directly transmit uploaded documents and their metadata to a number of journals, and journals to deposit received manuscripts as preprints to bioRxiv.

Monthly submissions to bioRxiv are increasing
Figure 1. Monthly submission to bioRxiv. Graph available at Rxivist under CC-BY 4.0 licence.

Despite not yet releasing their own public API, bioRxiv metadata can be retrieved from Crossref and through the Rxivist site. BioRxiv deposits metadata to Crossref as part of the official scholarly record; while the metadata available from Rxivist is scraped from the bioRxiv website.6

For our analysis, we downloaded metadata from June 18 to 20, 2019 for all posted content in Crossref—that is, all preprints, eprints, working papers, reports, dissertations, and many other types of content that has been posted but not formally published”and filtered it for records from bioRxiv. Guided by a description of Crossref metadata fields, we then set out to explore the same basic metadata fields as we did for SHARE and OSF. Here’s what we found:

1) Records

Our final database contained 52,531 preprints. Although bioRxiv users have to categorize their preprints into one of three categories (New ResultsConfirmatory Results, or Contradictory Results), this information is only available through the biorXiv search engine—and not in the Crossref or Rxivist metadata. On August 28, 405 (0.7%) preprints were listed as Contradictory and 1,090 (1.9%) as Confirmatory; all the rest were classified as New Results.

Although records can be removed in cases of discovered plagiarism or breach of relevant ethical standards (see the bioRxiv Submission Guide for details), there was no metadata field for withdrawn status in either Crossref or Rxivist. However, the bioRxiv’s search engine includes a filter option that on August 28 listed 49 preprints as withdrawn. Unlike for published papers, for which the Committee on Publication Ethics recommends keeping the full text of the record available after indicating its retracted or withdrawn status, some uploaded documents of withdrawn preprints in bioRxiv, are removed from the server, and their abstracts replaced with removal reason statements. This is similar to the practice we found in OSF.

There is also no metadata information available for authors of the removal statements, although, in some cases, the statements themselves contain information about the author (e.g., as in example one, a preprint removed by the bioRxiv moderators, or example two, a preprint removed at the request of the authors). However, in several cases the reason for removal is not clear (e.g., example three, Figure 2 below).

an example of a withdrawn bioRxiv preprint.
Figure 2. Example of a withdrawn bioRxiv preprint.

Using Crossref Metadata, we also identified 10 duplicate cases, where the title and author(s) of one preprint was an exact match to that of  one or more other preprints (e.g., example one, preprints one and two are identical; example two, preprints one, two, and three are different versions of the same preprint). It is worth nothing that bioRxiv has recently implemented a duplicate search.

2) Subjects

On upload, bioRxiv users have to assign their preprint into one of 25 subject areas. Previously users could assign preprints to two additional subject areas – Clinical Trials and Epidemiology – until content in these areas were redirected to medRxiv, a preprint server for clinical sciences launched in June. Taking these two discontinued subject areas into account, we originally expected the bioRxiv metadata to include 27 subject area classifications. Instead, the Crossref metadata listed 29 areas, including two areas not listed on the bioRxiv website: Scientific Communication (n=16) and Pharmacology (n=26), which existed alongside listed Scientific Communication and Education (n=322), and Pharmacology and Toxicology (n=400) areas. (A recent update of bioRxiv metadata has corrected this.)

There were only eight records (0.02%) which did not have a subject information in the Crossref metadata. Of these eight, only two listed subject information on the bioRxiv website and in Rxivist (i.e., records one and two); the remaining six did not (i.e., records one, two, three, four, five, and six).

3) Dates

There are six date fields in the Crossref metadata for bioRxiv preprints: indexed, posted, issued, accepted, created, and deposited. All six of these fields were populated for each of the bioRxiv preprints in our database.

Crossref defines these fields as follows:

  1. Date indexed: when the metadata was most recently processed by Crossref;
  2. Date posted: when the content was made available online;
  3. Date issued: the earliest known publication date. This was equal to date posted in all the cases we studied;
  4. Date accepted: the date on which a work was accepted for publication. For 37,730 (72%) records, this was equal to the date posted/issued, for 13,517 (26%), it preceded the date posted, and for 1290 (2%) it succeeded it. A description of why this is so, would be a welcome addition to the metadata documentation;
  5. Date created: when the DOI was first registered. For all but one preprint, Figure 3 below, this date was the same or succeeded the date posted/issued—likely because most DOIs are assigned on the same day or a few days after a preprint is created on the bioRxiv server.
  6. Date deposited: when the metadata was last updated (by bioRxiv).

 

An example of a preprint with mismatched dates in the metadata
Figure 3. The metadata for this preprint indicates that its DOI was created before the record was uploaded to the server (top). According to Rxivist metadata and the record’s URL, a version of this preprint was previously posted on bioRxiv, but is no longer available (bottom).

The Rxivist metadata contains only one date field, the date when a preprint was first posted. Metadata regarding version numbers or update dates is not available from either Crossref or Rxivist. However, it is possible to discern the version number by looking at the last two characters of the biorxiv_url field in Rxivist (e.g., example one, whose biorxiv_url ends with v7).

4) Contributors

When entering author information during preprint upload at bioRxiv, users are required to fill in each author’s given and family name and institutional affiliation. They can also optionally fill in the author’s middle name, middle name initials, name suffix, home page, and ORCID iD. Alternatively, users can choose to add a Collaborative Group/Consortium as an author, by filling in a free text field. This information, alongside the uploaded document, is then used to create a final pdf version of the preprints (which users have to approve), and from which the final metadata is extracted.

Only three records in the Crossref metadata lacked any author information (specifically, records one, two, and three—this third one has since been corrected). All other records included, at minimum, at least one consortium or family name. We found 24 (0.05%) records with an empty given name field. In some cases this was because a consortium name had been entered as a family name (e.g., example one), in others because only a family name had been entered (e.g., examples two and three). Additionally, we identified 33 records where an author’s full name (given and family name together) exactly matched that of another author on the same record (these errors have since been corrected). For 859 (2%) preprints, we found a Collaborative Group/Consortium as one of the authors. Two of these (records one and two) were exclusively authored by a consortium. Of preprints with consortia authors, 156 (18%) listed more than one Consortium. One preprint (see Figure 4, below), listed the same consortium name twice.

This preprint has errors in the contributor field
Figure 4. Example of a same consortium name appearing twice in the list of authors.

We also identified cases where authors names were entered as consortia (e.g., example three), as well as cases where the consortia field was used to enter “zero”, “No”, “None”, or “N/A” (many of these issues have since been corrected). In a handful of cases, different records used different spellings or acronyms for the same consortia (e.g., the Karolinska Schizophrenia Project (KASP) was also spelled as Karolinska Schizophrenia Project KaSP, KaSP, and KaSP Consortium). All of these issues make it difficult to accurately determine the number of unique consortia listed as co-authors.

In the Crossref metadata, each author is assigned a Sequence field identifying them as either the first or an additional author on a preprint. As expected, in almost all cases, the additional authors appeared in the metadata in the same order as on the bioRxiv webpage. However, in Crossref JSON metadata format consortia always occupy the last spot of the author list (e.g. example one, where a consortia is the middle position of the author byline, but last in the metadata), which  makes it impossible to use Crossref JSON metadata format to determine byline order for preprints co-authored with consortia. Crossref XML metadata format and Rxivist metadata retains the correct author order (note: they do not contain a field indicating exact byline position).

For 28,663 (55%) preprints in Crossref metadata, at least one author had an ORCID iD; for 4,177 (8%), all of the authors did. For 13 preprints, two or more authors shared the same ORCID iD (e.g., example one and two).

Lastly, although bioRxiv requires uploaders to add affiliation information for all authors during upload, this information isn’t captured in the Crossref metadata (though it is available in Rxivist).

5) Linking preprints to postprints

As Indicated in our previous blogposts, it is important to identify if a record on a preprint server is actually a preprint. BioRxiv monitors title and authors of records in Crossref and PubMed. On identifying a potential match, they send an email to the corresponding author for confirmation, and then add that information to the preprint metadata. Additionally, users can also manually alert bioRxiv of postprint publications.3

In the Crossref metadata, 20,465 (39%) records had a link to a DOI of their published version. Of those, one record contained its own DOI instead of that of the published postprint (at the time of writing this error had been corrected on the bioRxiv site but not in the metadata deposited to Crossref). We found an additional 29 pairs of preprints that shared an identical postprint DOI with each other. In some cases, this was caused by different versions of the same preprint being uploaded to bioRxiv as separate preprints (e.g., example one: preprints one and two, example two: preprints three and four). In others, the same version had been uploaded twice as two separate preprints (e.g., example three: preprints five and six, Figure 5, below). Additionally, despite accepting only original research preprints, both review papers and published papers have been found among bioRxiv records6-7 (discussion of the preprint community and bioRxiv moderators about these issues can be found here).

Finally, five preprints contained links to reviews and endorsements received on F1000 Prime in Crossref’s relation metadata field (e.g., example one).

Example of doi-related metadata issues
Figure 5. This record was uploaded twice on bioRxiv and was assigned different DOIs.

Metadata continues to improve

In the two months that have passed since we first downloaded the bioRxiv Crossref metadata and writing of this post, the bioRxiv team has been working and improving the metadata they provide. We hope that they similarly address the errors and omissions we’ve identified above, and enrich the metadata with information about author affiliations, withdrawn records, article categorization, versions, and comments. Overall, we identified fewer issues with bioRxiv metadata than we did for SHARE and OSF. We hope researchers using bioRxiv metadata from either Crossref or Rxivist will report if and how they overcame the challenges we described.

References

  1. Kaiser J. Are preprints the future of biology? A survival guide for scientists. Science. 2017;397.
  2. Fu DY, Hughey JJ. Releasing a preprint is associated with more attention and citations. bioRxiv. 2019:699652.
  3. Fraser N, Momeni F, Mayr P, Peters I. The effect of bioRxiv preprints on citations and altmetrics. bioRxiv. 2019:673665.
  4. Serghiou S, Ioannidis JP. Altmetric Scores, Citations, and Publication of Studies Posted as Preprints. JAMA: The Journal of the American Medical Association. 2018;319(4):402-4.
  5. Carneiro CFD, Queiroz VGS, Moulin TC, Carvalho CAM, Haas CB, Rayêe D, et al. Comparing quality of reporting between preprints and peer-reviewed articles in the biomedical literature. bioRxiv. 2019:581892.
  6. Abdill RJ, Blekhman R. Meta-Research: Tracking the popularity and outcomes of all bioRxiv preprints. eLife. 2019;8:e45133.
  7. Anaya J. 2018. [cited 2019]. Available from: https://medium.com/@OmnesRes/the-biorxiv-wall-of-shame-aa3d9cfc4cd7.

Comments? Questions? Drop us a line on Twitter (tag #scholcommlab). We have shared the issues outlined above with the bioRxiv team, and incorporated their feedback into this post. Additionally, recently bioRxiv has released a beta version of their own API. We welcome your insights and experiences with working with preprint metadata. Our source code is available on GitHub.


Large stack of papers on a white background

Analyzing preprints: The challenges of working with OSF metadata

By Mario Malički, Maria Janina Sarol, and Juan Pablo Alperin

This blog post is the second in a four part series documenting the methodological challenges we faced during our project investigating preprint growth and uptake. Stay tuned for our next post, in which we’ll describe our experiences with the bioRxiv metadata.

Our analysis of the SHARE database, described in our previous blog post, revealed many challenges of working with metadata that was merged from different preprint servers. To continue with our research, we needed to proceed with a detailed analysis of one server (provider) at a time. We decided to start with preprint servers created by, or in partnership with, The Center for Open Science (COS). Using their Open Science Framework (OSF) preprints infrastructure, COS launched its first interdisciplinary preprint server, titled OSF Preprints, in 2016, and in the years following helped launch additional 25 servers (Figure 1 below), each tailored for a specific discipline or country. We downloaded all 29,430 records available through the OSF API on June 18, 2019 and explored the main metadata fields, just as we had done for SHARE (documentation on OSF server metadata fields is available here).

Figure 1. The 25 servers backed by the Open Science Framework

1) Sources and records per source

While we could easily calculate how many records existed on each of the 25 servers, we soon realized that records do not necessarily correspond to preprints. Identifying preprints among the records ended up being more challenging than we had anticipated for a number of reasons.

First, although servers employing the OSF infrastructure can accept different types of records, there is no metadata field (or filter on the OSF search engine) to distinguish between them (SHARE metadata also lacks such a filed). Our exploration revealed that several of the records were postprints (i.e., published, peer reviewed papers (e.g., example one), full conference proceedings containing several papers (e.g., example two), PowerPoint presentations (e.g., example three), or Excel files (e.g., example four). Some of the records were also theses, with one of the servers, Thesis Commons, created specifically for this document type.

Second, we identified a significant number of duplicate records which, if left accounted for, overestimate the number of preprints or records available. By using a strict exact matching of titles and full author names, we found 1,810 (6%) duplicate OSF records. These records corresponded to 778 unique title and author pairs that had been reposted at least once on either the same OSF preprint server (e.g., example one: records one and two, Figure 2, below) or on different servers (example two: records one, two, three, four, and five, or example three: records one and two).

a comparison of two different preprint records
Figure 2. The same document uploaded as two different records on an OSF Preprints server, one created on May 6, the other on May 30 2019.

This number, however, underestimates the actual number of duplicates, as users creating these records on OSF servers sometimes enter slightly different information. For example:

  • different authors for the same record, as in example four: records one and two;
  • different spellings of an author’s name or omission of an author’s middle name, as in example five: records one and two; or
  • uploading the same records in two different languages, as in example six: records one and two.

To reliably identify all of the duplicates, we’d need a more detailed comparison of titles, names abstracts, and linked published papers (outlined later in this post).

Third, 96 (0.3%) records we identified had been withdrawn at some point. But unlike for published papers, for which the Committee on Publication Ethics recommends keeping the full text of the record available after indicating its retracted or withdrawn status, withdrawn documents on OSF are removed, leaving behind only the webpage and none of the content. This makes it impossible to determine if withdrawn records were indeed preprints or some other kind of document. Reasons for withdrawal were listed on the webpage and available in the metadata for 70 (73%) of records. But while some withdrawal reasons were written by the server moderators (e.g., example one), some by the authors (e.g., example two), there were also cases where it was not clear who wrote them (e.g., example three, Figure 3). No (metadata) information was available on the individuals who provided the reasons for withdrawal.

Additionally, we found six  orphaned records, for which the metadata existed in the database, but the webpage was no longer accessible. We could not find explanations on the differences between orphaned and withdrawn records, except for OSF’s orphaned metadata definition, which reads: A preprint can be orphaned if it's primary file was removed from the preprint node. This field may be deprecated in future versions."

Withdrawn preprint record
Figure 3. An example of a withdrawn record from the MinRxiv preprint server. No information is provided on who withdrew the record (authors or server moderators).

2) Subjects (i.e., scientific discipline or subdiscipline classification)

OSF servers use the bepress discipline taxonomy for classifying records to scientific fields, albeit with slight modifications. For example, bodoArXiv, a server for medieval studies, introduced Medieval Studies as a top level discipline, while the bepress taxonomy has Medieval Studies under the Arts and Humanities top-level discipline. Similarly Meta-Science and Neuroscience were introduced as a top-level disciplines in PsyArXiv, while in bepress Meta-Science does not exist as a separate sub-discipline, and Neuroscience falls under the top-level Life Sciences discipline. Only two of the records we identified did not have subject classification.

3) Dates

OSF metadata contains seven date fields, two of which we planned to use for our analyses: date_published (defined as “the time at which the preprint was published”) and original_publication_date (defined as: user-entered, the date when the preprint was originally published”). The definition of the other five date fields (date_created, preprint_doi_created, date_withdrawn, date_modified, and date_last_transitioned) is available here. The date_published field appears to be automatically populated, and was therefore present for all records. Original_publication_date, on the other hand, was entered by users for only 9,894 (35%) records.

We also found instances where the original_publication_date differed from the date listed in Crossref metadata (e.g., example one, the user entered date is February 13, 2018, but Crossref lists February 14; example two, Figure 4 below, the user entered December 31, 2008, but Crossref lists December 3, 2009). These kinds of discrepancies make it difficult to identify which records correspond to preprints (i.e., by comparing the date when a record was created on a preprint server with the date it was published as a peer reviewed paper). They also pose challenges for determining at what stage of the dissemination process authors uploaded the preprint (i.e., on the same date they submitted a paper to a journal, or at the time of the paper’s acceptance).

Two records associated with a preprint
Figure 4. Example of a mismatch between user entered original publication date (above) and the date listed in Crossref, and at the record’s journal website (below).

4) Contributors

In our previous blog post, we explained how in SHARE, Contributors were divided into Creators (authors) or Contributors (uploaders). In OSF, Contributors are also divided into two similar categories: bibliographic, those that are the authors of the record, and non-bibliographic, those that can upload (new) versions of the records. We found 2,278 (8%) records that had at least one non-bibliographic Contributor (uploader), but only 14 (1%) of those records also listed one of those uploaders as an author. This confirms, like that data in SHARE, that uploaders are rarely also the authors of the records they are uploading, and indicates users are likely misunderstanding the entry fields during preprint uploading.

We identified many of the same issues that plague the Contributor field in SHARE, also in the OSF metadata. When taken together, these issues make the author metadata unreliable for authorship network analysis and other scientific inquiries. For example, we identified cases where the correct number of authors in the metadata differed from the number of authors listed in the uploaded documents (e.g., example one: only one author is listed on the record, but three can be found in the uploaded PDF).

We also encountered cases where Contributors that were specified as non-bibliographic should have been bibliographic (e.g., example two: the metadata lists one bibliographic and five non-bibliographic authors, while the uploaded PDF lists all six as bibliographic authors), as well as cases where the authors listed in the metadata were not among the authors listed in the uploaded PDF (e.g., example three, Figure 5 below: the single author in the metadata differs from the three listed in the uploaded PDF). Additionally, there were 31 (0.1%) records where the same author (using simple exact full name matching) was listed more than once as an author on the same record (e.g., example four).

A preprint on the engrXiv server
Figure 5. Example of authors listed in the uploaded document differing from those entered as authors on the engrXiv preprint server.

When we dug deeper into the analyses of authors’ names, we also saw that 137 (0.5%) records were missing an author’s first (given) name or initials (e.g., example one), and 2,091 (7%) were missing a last name (e.g., example two). We also found instances where users listed the name of a journal as the name of the author (e.g., example three, Figure 6 below), the name of a proceeding (e.g., example four) or of an institute (e.g., example five), and cases where the professional titles were included in names (e.g., example six).

A preprint posted on the AgriXiv server
Figure 6. Example of a journal entered as an author on the AgriXiv preprint server.

Additionally, like in SHARE, we found inconsistencies regarding the position of authors in the byline order (i.e., their place in the list of authors). We identified 67 cases where two or more authors had the same position in the byline order according to the metadata information, and to the best of our knowledge none of these cases where due to the practice of equal or shared contributorship (e.g., example one: the last two authors share the byline order position according to the metadata, but not according to the uploaded PDF; example two: the first two authors share the byline order position according to the metadata, but not according to the uploaded PDF). Finally, we also identified cases where the byline order in the metadata was completely different than the one in the uploaded document (e.g., example three).

As a final remark on the Contributor field, we noted that information on authors’ affiliations is not (directly) included in the record’s metadata; it is available only for authors who have created an OSF account and have taken the time to fill in their employment information. That is, uploaders who post preprints on the OSF servers are not asked to fill in affiliation information or professional identifiers (e.g., an ORCID iD) for themselves or for their co-authors at the time of the preprint upload. Instead, co-authors are sent an email by the system asking them to create an account, and if they choose to do so, they are then prompted to include their identifiers, education, and professional affiliations (but none of these fields are mandatory). Moreover, because this metadata is linked to the user and not the preprint, it can easily introduce errors and inconsistencies, as affiliations change over time. There were 9,741 (33%) records for which at least one of the co-authors did not create their profile on OSF. Of the 17,510 unique users that did create their profiles, only 3,218 (18%) provided their affiliation information.

Additional exploration of OSF servers

Despite all of the issues with the basic OSF metadata we described above, we still wanted to include OSF servers in our analysis of preprint growth and uptake. To help estimate the number of unique preprints (not records) on the servers, we explored an additional metadata field that we did not explore in SHARE: a record’s link to its published peer reviewed version.

Like many other preprint servers, OSF servers offer a free text field to allow users to link their preprints to their published (likely peer reviewed) version. This can be done at the same time the preprint is uploaded (if the published version already exists), or at any time after it becomes available. However, on OSF, the field is labelled “Peer-reviewed publication DOI”, which means that preprints cannot be linked to papers published in venues that do not assign a DOI. We found 4,965 (17%) records with a peer-reviewed publication DOI.

However, after exploring the data, it became apparent that this field alone cannot be used as a clear indicator of whether or not the record corresponds to a preprint (e.g., example one, where the peer-reviewed publication DOI points to an identical PDF that was uploaded as a preprint). In 491 (10%) of cases, the “peer reviewed” DOI pointed to another OSF preprint (e.g., example two, where the DOI points to a duplicate of the same record). Furthermore, 70 (14%) of the records used the record’s own DOI as the peer-reviewed publication DOI (e.g., example three, Figure 7 below). We also identified cases where the DOI entered could not be resolved (i.e., link does not work, as in example four). In addition, we found 362 (7%) cases where DOIs pointed to preprints or data deposited at the Zenodo repository and 9 (0.2%) cases were they pointed to items deposited at the Figshare repository. Finally, we found 305 (6%) records that shared the same Peer-reviewed publication DOI with at least one other record in an OSF repository. In many cases, this was due to the preprints being duplicates of each other, but in others, the error seemed to be a data entry issue (e.g., example five: records one and two are different preprints that point to the same published paper).

A preprint with incorrect DOI information
Figure 7. Example of a record where the DOI of the preprint (Paper DOI) and the Peer Review Publication DOI were identical.

Proceed with caution

As can be seen from above, using OSF metadata to answer even the most basic questions about preprints growth and uptake is problematic. The contributor metadata, in particular, was very unreliable, making it difficult to conduct any analyses that relied on author information, including searching for possible postprints using the authors names. We therefore caution researchers and stakeholders citing the number of preprints as reported on the OSF website or in recently published scientific articles1,2,3. To the best of our knowledge, no analyses conducted so far have fully addressed the issues we encountered.

Note: We communicated many of the issues described above with the OSF team at the Centre for Open Science who have indicated they are taking steps to address them. The most recent version of the OSF roadmap is available here.

References

  1. Narock T, Goldstein EB. Quantifying the Growth of Preprint Services Hosted by the Center for Open Science. Publications. 2019;7(2):44.
  2. Rahim R, Irawan DE, Zulfikar A, Hardi R, Gultom E, Ginting G, et al., editors. INA-Rxiv: The Missing Puzzle in Indonesia’s Scientific Publishing Workflow. Journal of Physics: Conference Series; 2018: IOP Publishing.
  3. Riegelman A. OSF Preprints. The Charleston Advisor. 2018;19(3):35-8.

Comments? Questions? Drop us a line on Twitter (tag #scholcommlab). We have shared the issues outlined above with the COS team, and welcome your insights and experiences with working with preprint metadata. Our source code is available on GitHub.  


a stack of preprints

Analyzing preprints: The challenges of working with SHARE metadata

By Mario Malički, Maria Janina Sarol, and Juan Pablo Alperin

As can be seen from a series of recent publications,1-6 there is great interest surrounding preprints—scholarly manuscripts that are published ahead of peer review.

a stack of preprints
"97688449"by pttgogofish is licensed under CC BY-NC-SA 2.0

And while these studies show a rise in the number of documents posted on preprint servers, they fail to offer a clear picture on how the growing number of preprint servers are being used, and how different communities are engaging with preprints. That’s why ASAPbio teamed up with the Scholarly Communications Lab on a project investigating preprint uptake and use across life and other sciences.

As we will describe in this four-part blog series, we encountered many challenges in working with preprint metadata, including lack of documentation, missing values, and incompatible and erroneous data. We have documented these challenges, in the hopes of helping those, who like ourselves, would like to better understand what is happening with preprints.

We begin this series with an analysis of the resource that was supposed to aggregate metadata of multiple preprint sources—the SHARE database—while the other three blogs will focus on the Open Science Framework (OSF) preprint servers, bioRxiv, and arXiv.

A deep dive into the Center for Open Science’s SHARE database

The Center for Open Science (COS)—the same group behind the OSF Preprint servers—partnered with the Association of Research Libraries to build the SHARE Infrastructure in order to harvest and normalize metadata from more than 100 sources (preprint servers). Within the 2,222,495 records contained in SHARE as of January 2019 (full dataset is available here, 1.46 GB uncompressed, provided to us by the SHARE team), we uncovered many discrepancies, omissions, and errors, which ultimately prevented us in using it as a source for our planned analyses. We present or findings for each of the major metadata fields available in SHARE:

1) Sources and records per source

We could not find information on SHARE’s website or GitHub pages specifying to what extent records are harvested from the different sources available in SHARE. In the records database they provided us, we identified 103 unique sources (servers). But we also found some that did not contain all of records that were available at the sources themselves (e.g., there were only 24 preprints from SSRN, out of the more than 700,000 preprints housed there today). In addition, although SHARE attempts to consolidate information from multiple sources for each record (e.g., see Figure 1, below, for a record with information available from four sources), we also found instances where information from different records was consolidated as if it all corresponded to the same record (e.g., example, where information on two different arXiv records—records one and two—was merged, most likely due to the fact that both of these records link to the same postprint publication). A clear description on how information is harvested from each source could help researchers attempting to use this data.

Figure 1. SHARE aggregates preprint information from multiple sources

2) Subjects (i.e. scientific discipline or subdiscipline classification).

We were also unable to find documentation on how SHARE handles subject/discipline classifications from various sources, nor on whether there were plans to use a single classification system for all records. We did confirm that, of the records that had a subject field value, 98% used the bepress classification of disciplines. However, this was only 1 of 21 different classifications we identified in the database.

Most records though, 1,517,874 (68%), did not have subject information at all. We felt more could have been done to fill in the subject classifications for these records. For example, records from the RePEc preprint server—which deals exclusively with Economics—could be augmented under the subject Social and Behavioural Sciences, as per their bepress classification, and similarly records from the various arXiv repositories could be augmented based on their respective fields.

3) Dates

Dates are a complicated metadata field, as research documents can be time stamped at many different moments: when they are uploaded, when their DOI was minted, when they were made public, updated, retracted, etc. However, the SHARE metadata contains only two date fields for each record that are populated using the source’s metadata (an additional three date fields exist for timestamps that occur within the SHARE database itself). These two source-based dates are: date_published (defined as: “date when the record was first published, issued, or made publicly available in any form”) and date_updated (defined as: “date when the record was last updated in the source”).

However, 2,085,123 (94%) of the records we examined didn’t have a date_published in SHARE, and only had the date_updated. This made it impossible to distinguish whether the date_updated referred to the date that the record was first created, when its DOI was minted, or when it was published in a printed or online version of a journal. In other words, we couldn’t use this data to determine the elapsed time between a preprint and its published paper (postprint), or to confirm whether the record was a preprint at all (we’ll expand on this issue of preprint/postprint identification in our next blog post). For researchers and users who rely on this database, a clear indication of which dates are harvested from each source would be greatly welcome.

4) Contributors

In the SHARE database, a Contributor is categorized as either a Creator (defined as: “the agent who authored or was a major contributor to the creation of the work”) or as a Contributor (defined as: “the people or organizations responsible for making contributions to an object”). In more scholarly familiar terms, Creators are authors and Contributors are uploaders, i.e. those who uploaded initial or revised versions of the preprints. Any Creator (author) or Contributor (uploader) is further classified as being either a person, institution, organization, or consortium.

This classification system poses several challenges. First, because any preprint record can have multiple Creators and multiple Contributors, SHARE metadata cannot be used to determine who initially uploaded the preprint (and filled the information on the authors, keywords, and other preprint details). Instead, the metadata only offers information about who can upload new versions. This, however, is a common issue in most bibliographic databases; information on who initially submitted a manuscript to a journal is seldom provided.

Second, information on the Contributors (uploaders) was available for only 86,579 (4%) of records (most of which were preprints hosted of the OSF preprint servers, arXiv or HAL). While it could be assumed that the Contributors (uploaders) would most often also be Creators (authors) of those records, we found—using a simple exact name matching—that this occurred in only in 2,737 (3%) of the 86,579 records. And just as we were initially confused by the two terms, we also found examples where users uploading preprints might have misunderstood them as well. Such is the case shown in Figure 2, where one person is listed as a Creator (author) and five are listed as Contributors (uploaders), when in fact all six are authors.

Figure 2. SHARE lists only one of six authors as the “Creator” of this preprint

Affiliation information was almost non-existent for either authors or uploaders. We found no affiliation information for 2,153,579 (97%) of records.

Even more problematic were records where the number of authors in the metadata did not match the number of authors in the uploaded documents (e.g., this example lists one person as the only Creator and Contributor of the record, but the uploaded PDF has two additional authors, see below; and this example lists one Creator in the metadata, but three authors in the PDF—none of whom is the supposed Creator).

Figure 3. We identified several records in which the contributor names in the metadata did not match those in the uploaded documents

These errors make it very difficult to answer even the most basic preprint authorship questions reliably: How many authors does a specific preprint have? And what is the average (or median) number of authors per preprint? These issues also constrain more detailed investigations into authorship networks; author affiliations, countries or regions of origin, seniority; and identifying all preprints published by a single author.

“These errors make it very difficult to answer even the most basic preprint authorship questions reliably: How many authors does a specific preprint have? And what is the average number of authors per preprint?”

Comparing the metadata of each record with the information in the uploaded documents seems to be the only way to detect all of these errors. Therefore, we can only presume (and hope) that the percentage of records with an incorrect number of authors is small and will not distort our analyses. Preprint servers, however, should apply better methods or checks to ensure information on authors is correctly entered into the metadata when preprints are submitted or posted.

Unfortunately, these were not the only issues that would make analysis of preprint authorship challenging. In SHARE, the byline order position (i.e. place in the list of authors) is provided in the metadata field order_cited for each author or Creator. We found that 454,866 (20%) of records assigned the same byline order position to several different Creators (authors). To the best of our knowledge, these cases do not represent instances where authors contributed equally to a paper or preprint. Instead, they appear to be metadata errors.

One such error can be found for this preprint, where the metadata lists 4 authors—3 with the same byline order position—while the source’s website shows only two authors and the uploaded PDF only one. Other errors include listing all of the Creators (authors) twice in the metadata (see Figure 4, where 4 authors are included in the uploaded document and 8 in the metadata) and duplicating just one of the names (see Figure 5, which lists 6 authors on the paper, but 7 in the metadata).

Figure 4. Eight authors are included in the metadata, but only four are in the original record
Figure 5. Seven authors are listed in the metadata, but only six in the preprint itself

We presume these errors most likely occur when the record information is harvested from two different sources, and each source contains a slight variation in the spelling of authors’ names or initials. Additionally, even though we could not find documentation behind the order_cited for Contributors (uploaders), we found examples, as in case of this preprint, where the metadata listed the same byline order position for two uploaders.

To complicate things even further, we also found cases where the byline order listed in the metadata did not match the order in the posted PDF document (e.g., Figure 6 and Figure 7):

Figure 6. The byline order position for the second and fourth authors in the uploaded document is reversed in the metadata
Figure 7. The byline order position for the second and third authors in the uploaded document is reversed in the metadata

All these Contributor errors we detected mean that if researchers rely solely on the SHARE metadata, any authorship analyses they conduct will likely be very unreliable. We feel more work is needed to improve the quality of authorship information during a creation of a centralized database.

Still a long way to go

As can be seen from the many examples we describe above, even the most basic metadata in SHARE is too problematic to provide a reliable analysis on preprint uptake and use across life or other sciences. The issues we encountered are further compounded by the lack of clear and complete documentation about metadata fields, harvested sources, and the way records from those sources are integrated into the SHARE database. While we cannot attest to the value of SHARE as a service for preprint discovery, we find that, in its current form, it cannot be considered a valuable source for any rigorous analysis of preprints.

In the end, we decided to discard the SHARE database as a starting point for our analyses, and we turned toward the original servers that host that information. Stay tuned for the next post in our series, in which we’ll explore metadata from 24 OSF Preprint servers.

Acknowledgements: We would like to thank Naomi Penfold, Associate Director of ASAPbio, for comments on an earlier version of this post.

Note: We shared a draft of our blog post with the SHARE team at the Centre for Open Science prior to its publication, but did not receive a response at the time of publishing.

 

References

  1. Fu DY, Hughey JJ. Releasing a preprint is associated with more attention and citations. bioRxiv. 2019:699652.
  2. Chiarelli A, Johnson R, Pinfield S, Richens E. Preprints and Scholarly Communication: Adoption, Practices, Drivers and Barriers [version 1; peer review: 2 approved with reservations]. F1000Research. 2019;8(971).
  3. Fraser N, Momeni F, Mayr P, Peters I. The effect of bioRxiv preprints on citations and altmetrics. bioRxiv. 2019:673665.
  4. Serghiou S, Ioannidis JP. Altmetric Scores, Citations, and Publication of Studies Posted as Preprints. JAMA: The Journal of the American Medical Association. 2018;319(4):402-4.
  5. Narock T, Goldstein EB. Quantifying the Growth of Preprint Services Hosted by the Center for Open Science. Publications. 2019;7(2):44.
  6. Abdill RJ, Blekhman R. Meta-Research: Tracking the popularity and outcomes of all bioRxiv preprints. eLife. 2019;8:e45133.

Comments? Questions? Drop us a line on Twitter (tag #scholcommlab). We have shared the issues outlined above with the COS team, and welcome your insights and experiences with working with preprint metadata. Our source code is available on GitHub.