Different Methods of Identifying Preprint Matches Yield Diverging Estimates of Rates of Preprinting

Jun 24, 2025

Preprints

By Katie Corker (ASAPbio) and Ludo Waltman (Centre for Science and Technology Studies, Leiden University)

Each year, more research papers are published as preprints. However, this growth is occurring when the number of journal articles is also rapidly increasing. For advocates of preprinting like us, it is important to understand the growth of preprints in the context of this broader growth of journal articles.

One approach to measuring the rate of preprinting is to look at the percentage of journal articles that have an associated, or matched, preprint. In the diagram below, this value is represented by the light pink area divided by the light grey circle. This approach normalizes the quantity of preprinting relative to the quantity of article publishing in journals. It’s also possible to look at the percentage of preprints published in journals relative to all preprints (light pink area divided by the red circle), but here we focus on the first quantity.

A Venn diagram showing the relationship between journal articles (depicted with a gray circle) and preprints (depicted with a red circle). The overlapping area represents preprints later published in a journal (depicted in pink).

Several platforms provide data on preprint:journal matches, but it turns out that these different data sources provide diverging estimates of rates of preprinting. The question then arises as to which platform gives the most accurate estimates. If the algorithm that the platform uses is too liberal, false positive matches will cause the estimated rate of preprinting to be higher than it should be. On the other hand, if the algorithm is too conservative, matches will be missed, causing the estimated rate to be too low.

Comparing Different Sources of Preprint Match Data

Here, we compare the estimated rate of preprinting (i.e., the percentage of journal articles that have an associated preprint) based on matches derived from four open sources: Crossref, Europe PMC, Open Science Indicators (OSI) from PLOS, and Rzayeva, Pinfield, and Waltman’s (2025) open data. We then compare the accuracy of matches in these sources.

Crossref, a non-profit membership organization that stewards the scholarly metadata record, has provided a detailed explanation of the strategy it has developed to make preprint:journal matches. In addition to matches declared by preprint servers and journals, their algorithm uses titles and author lists to make matches. Matches identified by the algorithm are provided in periodic data dumps, but Crossref has expressed an intention to eventually include these matches in the full database. The most recent data dump was published in April 2025, but our values here stem from an earlier deposit (November 2023), because that was what Rzayeva et al. (2025) used.

Europe PMC, an open bibliographic database containing preprints from over 30 different life science preprint servers, uses Crossref-provided preprint:journal matches when they are available in the Crossref API. It also augments the Crossref data with its own matches based on “matching titles and first author surnames” (Levchenko et al., 2024).

For the last few years, the non-profit publisher PLOS has maintained a longitudinal dataset of Open Science Indicators (OSI). The dataset, developed in collaboration with DataSeer, provides a quarterly snapshot from 2018 to 2024 of the rate of uptake of several open science practices – including preprinting – in both PLOS and comparator journals.

The figure and the table below compare the estimates provided by these three sources alongside data from Rzayeva et al. (2025), who combined matches from OpenAlex, Dimensions, and Crossref to yield a comprehensive set of preprint matches.

One complication of comparing these different data sources is that their coverage of journals differs, making it hard to compare preprinting rates obtained from the different sources. To address this issue, we focus here on articles published in two journals (PLOS One and PLOS Biology) that are present in all of the sources. This strategy makes the total number of articles the same for all sources. The only difference is whether a given source catalogs a given matched preprint.

Europe PMC indexes 90,061 PLOS One and PLOS Biology articles spread across 2018-2023 that also appear in the OSI corpus. The figure and table below show the percent of these articles that have a matched preprint according to the different sources.

Line graph with time on the x-axis and percentage on the y-axis. The title is percent of PLOS One and PLOS Biology articles with preprints over time. There are four lines showing the time trend for Europe PMC, Crossref, Rzayeva et al., and OSI.

Estimates of the rate of preprinting vary widely across the different sources, even when restricting our focus to a common set of articles. For instance, Europe PMC estimates the rate of preprinting in 2022 for PLOS Biology and PLOS One to be 9.8%, whereas Crossref estimates it to be 13.9%, Rzayeva et al. estimate it to be 14.5%, and PLOS’s own OSI data estimates it to be 18.4%.

PLOS One and PLOS Biology	2018	2019	2020	2021	2022	2023
Europe PMC	3.3%	9.1%	11.8%	11.6%	9.8%	7.7%
Crossref	3.7%	10.0%	13.4%	14.3%	13.9%	9.1%
Rzayeva et al. (2025)	4.6%	10.9%	14.6%	15.6%	14.5%	11.8%
OSI (v9)	4.6%	11.0%	15.6%	18.2%	18.4%	18.4%
Number of articles	18,086	14,125	14,722	14,547	14,554	14,046

Assessing the Accuracy of Matches from Different Sources

One way to unpack the discrepancies reported above is to look at “disputed matches,” that is, papers that are alleged to match according to one data source, but not another. Those matches can then be checked manually to see whether they are errant or not. If false positive matches are identified, this finding implies that the estimated rate of preprinting is too high. If false negative matches are found, the estimated rate of preprinting is too low.

We examined a randomly selected sample of 20 disputed matches to compare Rzayeva et al.’s data to the OSI data. Of these matches, 17 were false positives (85%), suggesting that OSI’s matching algorithm is too liberal. Of the remaining three matches, two were correctly linked to a preprint, meaning they were missed by Rzayeva et al. The final article was matched to a conference presentation (though the match looked correct).

We also examined a random sample of 25 disputed matches detected by Rzayeva et al., but not present in Europe PMC. We found a high rate of misses, with 21 of 25 matches being correctly identified in Rzayeva et al. but missed by Europe PMC (84%), suggesting that Europe PMC’s algorithm is too conservative. The remaining 4 of 25 matches in Rzayeva et al. were false positives (16%).

Limitations and Caveats

We chose to focus here on PLOS One and PLOS Biology to more easily compare our data sources, but it is unlikely our results are restricted to these journals. We suspect that Europe PMC’s algorithm is too conservative in most contexts and likewise that the OSI algorithm is too liberal.

It’s worth noting that each data source brings some unique matches to the collection. Of the total 90,061 articles considered here, 85.3% have no indicated matches in Europe PMC, OSI, or Rzayeva et al. (which subsumes Crossref). 8.3% have a match in all three of these sources. The remaining 6.3% are unique to either one or two of the sources. Of course, all purported matches can be either false or true positives. It seems likely that matches that reflect agreement across sources are more likely to be true positives than false positives, but we did not evaluate this claim.

The small samples here mean that the margin of error of these estimates is quite high, but nonetheless, the apparently high rates of false positives (for OSI) and misses (for Europe PMC) are both cause for follow-up. ASAPbio has been in conversation with Crossref, Europe PMC, and PLOS to encourage improvements to matching algorithms. Each of these organizations has been very receptive to the outreach, and likewise, they plan to adjust their approach in the coming months to try to provide more accurate matches. Already, PLOS is adjusting their approach with their upcoming version (V10) of the OSI data, which is expected to be released soon.

Reflections on How to Best Assess the Rate of Preprinting

Advocates and the researcher community alike need accurate information about matches between papers published in journals and preprints. For advocates like ASAPbio, having accurate data helps us understand whether our advocacy efforts are improving uptake. For researchers, being able to review different versions of a paper has benefits for understanding the story behind a piece of research.

Returning to the question of how high the rate of life science preprinting is and whether it is growing over time, the answer depends on what question is asked. The statistics reported above show a higher rate of journal articles with matched preprints in 2022 than there was in 2018 and 2019, but 2023 has somewhat lower levels. For Crossref and Rzayeva et al., this apparent decrease might be due to limitations in the 2023 data (see discussion in Rzayeva et al.). Europe PMC and OSI disagree on whether or not a decrease is occurring, but given the potential for false positives and misses in those datasets, more investigation is needed to calculate more accurate estimates.

Additionally, the rate of articles with associated preprints is somewhat separate from the question of the total number of preprints (red circle above), or from the ratio of preprints to articles. Preprints may not always be published as journal articles for many reasons, one of which is that preprints are dynamic and reflect the natural evolution of research projects over time. This is a desirable feature of preprints, rather than a downside. Considering different statistics (raw counts, percentage of journal articles, percentage of preprints) provides different views on the overarching question of how much life science research is preprinted. Taking all of these factors into account, our current best guess is that somewhere around 13-14% of life science journal articles in 2024 are preprinted – a number we’re working to increase substantially.

0 Comments

More from ASAPbio

ASAPbio news

Welcoming New Preprints Enthusiasts to the ASAPbio Board of Directors

Jan 12, 2026

We are pleased to announce that the ASAPbio Board of Directors has welcomed six new members! The new board members have joined for two-year terms beginning January 1, 2026, and ending December 31, 2027.

ASAPbio news

Reimagining scholarly publishing: outcomes from a public forum to discuss the Publish, Review, Curate (PRC) publishing model

Dec 11, 2025

At a meeting held on the 3rd December 2025 at Kings College, Cambridge over 50 delegates, comprising researchers, publishers, librarians, research funders and scholarly communication infrastructure providers, came together to discuss the Publish, Rev... The meeting, which was organised by COAR and the University of Cambridge Library and Archives with contributions from ASAPbio, aimed to showcase various PRC initiatives from around the world and illustrate its many benefits such as increased rigour, ...

ASAPbio news

International community convenes in Pisa to advance coordinated reform in publishing and research assessment

Dec 03, 2025

On 14 November 2025, representatives from several organisations across the publishing reform and research assessment reform communities gathered in Pisa, Italy, for a workshop aimed at identifying and advancing joint actions to strengthen both moveme... Co-hosted by ASAPbio, Leiden University's Centre for Science and Technology Studies (CWTS), DORA, and the International Science Council (ISC), the meeting marked an important step in building momentum toward a more coordinated and impactful approach ...

Different Methods of Identifying Preprint Matches Yield Diverging Estimates of Rates of Preprinting

Leave a Comment Cancel reply

More from ASAPbio

Welcoming New Preprints Enthusiasts to the ASAPbio Board of Directors

Reimagining scholarly publishing: outcomes from a public forum to discuss the Publish, Review, Curate (PRC) publishing model

International community convenes in Pisa to advance coordinated reform in publishing and research assessment