New preprint explores tracing data reuse and citations

April 20, 2023

|In Blog, Preliminary Findings, Research

In our digital era, scientists are certainly sharing and reusing open data. Yet it remains unclear how widespread data reuse and citation practices are within academic disciplines, and why scientists cite—or do not cite—data in their research work.

In a recent preprint from the Meaningful Data Counts project, Kathleen Gregory (postdoctoral researcher at the University of Vienna and University of Ottawa) and fellow ScholCommLab members—Anton Boudreau Ninkov, Chantal Ripp, Emma Roblin, Isabella Peters, and Stefanie Haustein—surveyed nearly 2,500 academic authors to explore their practices, preferences, and motivations for reusing and citing data, and how these practices vary by discipline.

In this interview, we ask Kathleen about how she got involved in study, why some researchers cite and reuse data while others do not, and how her work informs data citation policies and standards in the scholarly community.

Person sits at a desk staring into their computer filled with graphs

Q1. What drew you into studying data citations, sharing, and reuse?

This project combines two of my long standing interests: research data practices and bibliometrics. I’ve been interested in data sharing since my early days as an academic librarian when open access to data from federally funded research was first being discussed. This interest eventually led me to do a PhD exploring how researchers discover, make sense of and reuse data. Before doing my PhD, I spent some time focusing on bibliometric services in libraries and thinking about the responsible use of metrics in academic assessments. Exploring data citation practices, and their repercussions, provided a nice opportunity to merge these interests.

Q2. What questions or challenges were you setting out to address when you started this work?

One of our big questions in the project as a whole is to learn more about how researchers cite data in their work and for which purposes, and to think about how this varies by academic discipline. We started out by looking at this question bibliometrically in work led by Anton with DataCite. The survey which we report on in the preprint gave us a chance to look at this question from a different angle and to ask researchers not only about how they cite data (spoiler – it isn’t always with a citation in a reference list) and what they actually cite (another spoiler – it isn’t always a dataset) but also to learn more about why they do so.

One of the strengths of the survey is the care that we took with constructing a sample which, to the best of our efforts, is representative of academic authors by discipline, as represented in the Web of Science. This took many discussions to think about how best to do this, and much planning, but it also allows our findings to be stronger and perhaps more generalizable to the broader population.

Q3. What motivates researchers who cite data to do so?

Across the entire sample, we found that the majority of reasons might be construed as motivations which reflect ‘ideal’ research practice (e.g. to show intellectual debt), to help others to find data, or to support the validity of their research claims. Few respondents indicated that external factors, namely being advised to do so by journals or publishers, was a motivating factor in their decision to cite data.

We did find some significant disciplinary differences to this question, though. For example, social science and humanities (SHH) researchers selected that they cite data to acknowledge intellectual debt more frequently than expected. One possible explanation for this could be tied to common purposes for which our SSH respondents reuse data (e.g. to serve as the basis for a new study or to integrate sources to build an argument).

Q4. How do researchers prefer others to cite their work?

Around 84% of all respondents indicated that they would like others to refer to a related article or publication, followed by the data source (55%), and then the data themselves (46%). I found it super interesting, though, that nearly three-quarters of all respondents selected more than one option for this question. For example, about 950 respondents selected wanting other people to cite or mention both a related research article as well as the data themselves. This complicates the existing narratives suggesting that data citations alone may incentivise sharing data and suggests that a combination of citing different ‘data objects’ may be preferred by researchers.

Q5. In the survey, more than 450 respondents said that they do not reuse data. Why do some researchers not reuse data?

We asked people to self-identify as being a person who reuses data or one who does not. This classification itself is perhaps a bit problematic, as we know from past work in Science and Technology Studies that ‘usage’ of something—be that a car or the Internet or research data—exists on a spectrum. A researcher may reuse data to teach a class once per year, or they may be involved in a years-long comparative study and be reusing data daily, or they may not reuse data at all.

That said, the most frequently selected options were that reusing data was not relevant to their research methods or in their research communities. Other less frequently selected reasons—such as a lack of relevant data or difficulties finding data—indicate that perhaps some people may have tried to reuse data previously. I bring this up as I think it is important to remember that this variation in practice exists, that not everyone is a re-user of data, and that other ways of conducting research are also valuable.

Q6. What are your most interesting or surprising findings?

Perhaps most surprising to me at a high level was how much our data had to say about the practices of SSH researchers. We found that SSH researchers are the only disciplinary groups who prefer that others cite or mention their own data, as opposed to other data objects (e.g. an article or a collection of data). This contrasts past scientometric work showing that SSH scholars cite ‘data studies,’ rather than individual data files.

We were also able to identify unique differences between social scientists and humanities scholars, such as the tendency for humanities scholars to integrate data more than most other disciplinary groups or the pattern of social scientists to refer to related publications (rather than data) less than other groups.

A quote stating "It's an important reminder that we need to take out time and involve research communities in the development of policies and systems" by Kathleen Gregory, a member of the Scholarly Communications Lab.

Q7. Why is your study important?

Our study brings up what I think is a key question when designing policies and standards in the research data space in general. Our data show that data citation is rooted in longstanding disciplinary practices (e.g. the use of footnotes in the humanities), choices about standards (e.g. the use of APA citation guidelines in the social sciences) and choosing what to cite (e.g. the common practice of citing publications in addition to or instead of data).

Our data also show that most researchers who are citing data (in any way) report doing so for reasons in line with responsibly conducting research. This suggests that their current citation practices may be meeting both the norms of their communities and their own ethical principles. If this is the case, the questions I see which we need to consider are: When (and how) do we meet researchers where they are? When should their research practice be adapted to current technical requirements and our own recommendations for data citation, and when should those requirements and recommendations be adapted to reflect researchers’ actual practice?

It’s an important reminder that we need to take our time and involve research communities in the development of policies and systems.

To stay up to date with the Meaningful Data Counts project, check out Zenodo for all research outputs and sign up for our newsletter.

3 Comments

Comments are closed.