By Katie Corker (ASAPbio) and Ludo Waltman (Centre for Science and Technology Studies, Leiden University)
Each year, more research papers are published as preprints. However, this growth is occurring when the number of journal articles is also rapidly increasing. For advocates of preprinting like us, it is important to understand the growth of preprints in the context of this broader growth of journal articles.
One approach to measuring the rate of preprinting is to look at the percentage of journal articles that have an associated, or matched, preprint. In the diagram below, this value is represented by the light pink area divided by the light grey circle. This approach normalizes the quantity of preprinting relative to the quantity of article publishing in journals. It’s also possible to look at the percentage of preprints published in journals relative to all preprints (light pink area divided by the red circle), but here we focus on the first quantity.
Several platforms provide data on preprint:journal matches, but it turns out that these different data sources provide diverging estimates of rates of preprinting. The question then arises as to which platform gives the most accurate estimates. If the algorithm that the platform uses is too liberal, false positive matches will cause the estimated rate of preprinting to be higher than it should be. On the other hand, if the algorithm is too conservative, matches will be missed, causing the estimated rate to be too low.
Comparing Different Sources of Preprint Match Data
Here, we compare the estimated rate of preprinting (i.e., the percentage of journal articles that have an associated preprint) based on matches derived from four open sources: Crossref, Europe PMC, Open Science Indicators (OSI) from PLOS, and Rzayeva, Pinfield, and Waltman’s (2025) open data. We then compare the accuracy of matches in these sources.
Crossref, a non-profit membership organization that stewards the scholarly metadata record, has provided a detailed explanation of the strategy it has developed to make preprint:journal matches. In addition to matches declared by preprint servers and journals, their algorithm uses titles and author lists to make matches. Matches identified by the algorithm are provided in periodic data dumps, but Crossref has expressed an intention to eventually include these matches in the full database. The most recent data dump was published in April 2025, but our values here stem from an earlier deposit (November 2023), because that was what Rzayeva et al. (2025) used.
Europe PMC, an open bibliographic database containing preprints from over 30 different life science preprint servers, uses Crossref-provided preprint:journal matches when they are available in the Crossref API. It also augments the Crossref data with its own matches based on “matching titles and first author surnames” (Levchenko et al., 2024).
For the last few years, the non-profit publisher PLOS has maintained a longitudinal dataset of Open Science Indicators (OSI). The dataset, developed in collaboration with DataSeer, provides a quarterly snapshot from 2018 to 2024 of the rate of uptake of several open science practices – including preprinting – in both PLOS and comparator journals.
The figure and the table below compare the estimates provided by these three sources alongside data from Rzayeva et al. (2025), who combined matches from OpenAlex, Dimensions, and Crossref to yield a comprehensive set of preprint matches.
One complication of comparing these different data sources is that their coverage of journals differs, making it hard to compare preprinting rates obtained from the different sources. To address this issue, we focus here on articles published in two journals (PLOS One and PLOS Biology) that are present in all of the sources. This strategy makes the total number of articles the same for all sources. The only difference is whether a given source catalogs a given matched preprint.
Europe PMC indexes 90,061 PLOS One and PLOS Biology articles spread across 2018-2023 that also appear in the OSI corpus. The figure and table below show the percent of these articles that have a matched preprint according to the different sources.
Estimates of the rate of preprinting vary widely across the different sources, even when restricting our focus to a common set of articles. For instance, Europe PMC estimates the rate of preprinting in 2022 for PLOS Biology and PLOS One to be 9.8%, whereas Crossref estimates it to be 13.9%, Rzayeva et al. estimate it to be 14.5%, and PLOS’s own OSI data estimates it to be 18.4%.
PLOS One and PLOS Biology
2018
2019
2020
2021
2022
2023
Europe PMC
3.3%
9.1%
11.8%
11.6%
9.8%
7.7%
Crossref
3.7%
10.0%
13.4%
14.3%
13.9%
9.1%
Rzayeva et al. (2025)
4.6%
10.9%
14.6%
15.6%
14.5%
11.8%
OSI (v9)
4.6%
11.0%
15.6%
18.2%
18.4%
18.4%
Number of articles
18,086
14,125
14,722
14,547
14,554
14,046
Assessing the Accuracy of Matches from Different Sources
One way to unpack the discrepancies reported above is to look at “disputed matches,” that is, papers that are alleged to match according to one data source, but not another. Those matches can then be checked manually to see whether they are errant or not. If false positive matches are identified, this finding implies that the estimated rate of preprinting is too high. If false negative matches are found, the estimated rate of preprinting is too low.
We examined a randomly selected sample of 20 disputed matches to compare Rzayeva et al.’s data to the OSI data. Of these matches, 17 were false positives (85%), suggesting that OSI’s matching algorithm is too liberal. Of the remaining three matches, two were correctly linked to a preprint, meaning they were missed by Rzayeva et al. The final article was matched to a conference presentation (though the match looked correct).
We also examined a random sample of 25 disputed matches detected by Rzayeva et al., but not present in Europe PMC. We found a high rate of misses, with 21 of 25 matches being correctly identified in Rzayeva et al. but missed by Europe PMC (84%), suggesting that Europe PMC’s algorithm is too conservative. The remaining 4 of 25 matches in Rzayeva et al. were false positives (16%).
Limitations and Caveats
We chose to focus here on PLOS One and PLOS Biology to more easily compare our data sources, but it is unlikely our results are restricted to these journals. We suspect that Europe PMC’s algorithm is too conservative in most contexts and likewise that the OSI algorithm is too liberal.
It’s worth noting that each data source brings some unique matches to the collection. Of the total 90,061 articles considered here, 85.3% have no indicated matches in Europe PMC, OSI, or Rzayeva et al. (which subsumes Crossref). 8.3% have a match in all three of these sources. The remaining 6.3% are unique to either one or two of the sources. Of course, all purported matches can be either false or true positives. It seems likely that matches that reflect agreement across sources are more likely to be true positives than false positives, but we did not evaluate this claim.
The small samples here mean that the margin of error of these estimates is quite high, but nonetheless, the apparently high rates of false positives (for OSI) and misses (for Europe PMC) are both cause for follow-up. ASAPbio has been in conversation with Crossref, Europe PMC, and PLOS to encourage improvements to matching algorithms. Each of these organizations has been very receptive to the outreach, and likewise, they plan to adjust their approach in the coming months to try to provide more accurate matches. Already, PLOS is adjusting their approach with their upcoming version (V10) of the OSI data, which is expected to be released soon.
Reflections on How to Best Assess the Rate of Preprinting
Advocates and the researcher community alike need accurate information about matches between papers published in journals and preprints. For advocates like ASAPbio, having accurate data helps us understand whether our advocacy efforts are improving uptake. For researchers, being able to review different versions of a paper has benefits for understanding the story behind a piece of research.
Returning to the question of how high the rate of life science preprinting is and whether it is growing over time, the answer depends on what question is asked. The statistics reported above show a higher rate of journal articles with matched preprints in 2022 than there was in 2018 and 2019, but 2023 has somewhat lower levels. For Crossref and Rzayeva et al., this apparent decrease might be due to limitations in the 2023 data (see discussion in Rzayeva et al.). Europe PMC and OSI disagree on whether or not a decrease is occurring, but given the potential for false positives and misses in those datasets, more investigation is needed to calculate more accurate estimates.
Additionally, the rate of articles with associated preprints is somewhat separate from the question of the total number of preprints (red circle above), or from the ratio of preprints to articles. Preprints may not always be published as journal articles for many reasons, one of which is that preprints are dynamic and reflect the natural evolution of research projects over time. This is a desirable feature of preprints, rather than a downside. Considering different statistics (raw counts, percentage of journal articles, percentage of preprints) provides different views on the overarching question of how much life science research is preprinted. Taking all of these factors into account, our current best guess is that somewhere around 13-14% of life science journal articles in 2024 are preprinted – a number we’re working to increase substantially.
By Katie Corker (ASAPbio) and Ludo Waltman (Centre for Science and Technology Studies, Leiden University)
Each year, more research papers are published as preprints. However, this growth is occurring when the number of journal articles is also rapidly increasing. For advocates of preprinting like us, it is important to understand the growth of preprints in the context of this broader growth of journal articles.
One approach to measuring the rate of preprinting is to look at the percentage of journal articles that have an associated, or matched, preprint. In the diagram below, this value is represented by the light pink area divided by the light grey circle. This approach normalizes the quantity of preprinting relative to the quantity of article publishing in journals. It’s also possible to look at the percentage of preprints published in journals relative to all preprints (light pink area divided by the red circle), but here we focus on the first quantity.
Several platforms provide data on preprint:journal matches, but it turns out that these different data sources provide diverging estimates of rates of preprinting. The question then arises as to which platform gives the most accurate estimates. If the algorithm that the platform uses is too liberal, false positive matches will cause the estimated rate of preprinting to be higher than it should be. On the other hand, if the algorithm is too conservative, matches will be missed, causing the estimated rate to be too low.
Comparing Different Sources of Preprint Match Data
Here, we compare the estimated rate of preprinting (i.e., the percentage of journal articles that have an associated preprint) based on matches derived from four open sources: Crossref, Europe PMC, Open Science Indicators (OSI) from PLOS, and Rzayeva, Pinfield, and Waltman’s (2025) open data. We then compare the accuracy of matches in these sources.
Crossref, a non-profit membership organization that stewards the scholarly metadata record, has provided a detailed explanation of the strategy it has developed to make preprint:journal matches. In addition to matches declared by preprint servers and journals, their algorithm uses titles and author lists to make matches. Matches identified by the algorithm are provided in periodic data dumps, but Crossref has expressed an intention to eventually include these matches in the full database. The most recent data dump was published in April 2025, but our values here stem from an earlier deposit (November 2023), because that was what Rzayeva et al. (2025) used.
Europe PMC, an open bibliographic database containing preprints from over 30 different life science preprint servers, uses Crossref-provided preprint:journal matches when they are available in the Crossref API. It also augments the Crossref data with its own matches based on “matching titles and first author surnames” (Levchenko et al., 2024).
For the last few years, the non-profit publisher PLOS has maintained a longitudinal dataset of Open Science Indicators (OSI). The dataset, developed in collaboration with DataSeer, provides a quarterly snapshot from 2018 to 2024 of the rate of uptake of several open science practices – including preprinting – in both PLOS and comparator journals.
The figure and the table below compare the estimates provided by these three sources alongside data from Rzayeva et al. (2025), who combined matches from OpenAlex, Dimensions, and Crossref to yield a comprehensive set of preprint matches.
One complication of comparing these different data sources is that their coverage of journals differs, making it hard to compare preprinting rates obtained from the different sources. To address this issue, we focus here on articles published in two journals (PLOS One and PLOS Biology) that are present in all of the sources. This strategy makes the total number of articles the same for all sources. The only difference is whether a given source catalogs a given matched preprint.
Europe PMC indexes 90,061 PLOS One and PLOS Biology articles spread across 2018-2023 that also appear in the OSI corpus. The figure and table below show the percent of these articles that have a matched preprint according to the different sources.
Estimates of the rate of preprinting vary widely across the different sources, even when restricting our focus to a common set of articles. For instance, Europe PMC estimates the rate of preprinting in 2022 for PLOS Biology and PLOS One to be 9.8%, whereas Crossref estimates it to be 13.9%, Rzayeva et al. estimate it to be 14.5%, and PLOS’s own OSI data estimates it to be 18.4%.
Assessing the Accuracy of Matches from Different Sources
One way to unpack the discrepancies reported above is to look at “disputed matches,” that is, papers that are alleged to match according to one data source, but not another. Those matches can then be checked manually to see whether they are errant or not. If false positive matches are identified, this finding implies that the estimated rate of preprinting is too high. If false negative matches are found, the estimated rate of preprinting is too low.
We examined a randomly selected sample of 20 disputed matches to compare Rzayeva et al.’s data to the OSI data. Of these matches, 17 were false positives (85%), suggesting that OSI’s matching algorithm is too liberal. Of the remaining three matches, two were correctly linked to a preprint, meaning they were missed by Rzayeva et al. The final article was matched to a conference presentation (though the match looked correct).
We also examined a random sample of 25 disputed matches detected by Rzayeva et al., but not present in Europe PMC. We found a high rate of misses, with 21 of 25 matches being correctly identified in Rzayeva et al. but missed by Europe PMC (84%), suggesting that Europe PMC’s algorithm is too conservative. The remaining 4 of 25 matches in Rzayeva et al. were false positives (16%).
Limitations and Caveats
We chose to focus here on PLOS One and PLOS Biology to more easily compare our data sources, but it is unlikely our results are restricted to these journals. We suspect that Europe PMC’s algorithm is too conservative in most contexts and likewise that the OSI algorithm is too liberal.
It’s worth noting that each data source brings some unique matches to the collection. Of the total 90,061 articles considered here, 85.3% have no indicated matches in Europe PMC, OSI, or Rzayeva et al. (which subsumes Crossref). 8.3% have a match in all three of these sources. The remaining 6.3% are unique to either one or two of the sources. Of course, all purported matches can be either false or true positives. It seems likely that matches that reflect agreement across sources are more likely to be true positives than false positives, but we did not evaluate this claim.
The small samples here mean that the margin of error of these estimates is quite high, but nonetheless, the apparently high rates of false positives (for OSI) and misses (for Europe PMC) are both cause for follow-up. ASAPbio has been in conversation with Crossref, Europe PMC, and PLOS to encourage improvements to matching algorithms. Each of these organizations has been very receptive to the outreach, and likewise, they plan to adjust their approach in the coming months to try to provide more accurate matches. Already, PLOS is adjusting their approach with their upcoming version (V10) of the OSI data, which is expected to be released soon.
Reflections on How to Best Assess the Rate of Preprinting
Advocates and the researcher community alike need accurate information about matches between papers published in journals and preprints. For advocates like ASAPbio, having accurate data helps us understand whether our advocacy efforts are improving uptake. For researchers, being able to review different versions of a paper has benefits for understanding the story behind a piece of research.
Returning to the question of how high the rate of life science preprinting is and whether it is growing over time, the answer depends on what question is asked. The statistics reported above show a higher rate of journal articles with matched preprints in 2022 than there was in 2018 and 2019, but 2023 has somewhat lower levels. For Crossref and Rzayeva et al., this apparent decrease might be due to limitations in the 2023 data (see discussion in Rzayeva et al.). Europe PMC and OSI disagree on whether or not a decrease is occurring, but given the potential for false positives and misses in those datasets, more investigation is needed to calculate more accurate estimates.
Additionally, the rate of articles with associated preprints is somewhat separate from the question of the total number of preprints (red circle above), or from the ratio of preprints to articles. Preprints may not always be published as journal articles for many reasons, one of which is that preprints are dynamic and reflect the natural evolution of research projects over time. This is a desirable feature of preprints, rather than a downside. Considering different statistics (raw counts, percentage of journal articles, percentage of preprints) provides different views on the overarching question of how much life science research is preprinted. Taking all of these factors into account, our current best guess is that somewhere around 13-14% of life science journal articles in 2024 are preprinted – a number we’re working to increase substantially.