{"id":3333,"date":"2021-07-02T00:00:00","date_gmt":"2021-07-02T00:00:00","guid":{"rendered":"http:\/\/pl-asapbio.local\/addressing-information-overload-in-scholarly-literature\/"},"modified":"2025-03-28T21:36:33","modified_gmt":"2025-03-28T21:36:33","slug":"addressing-information-overload-in-scholarly-literature","status":"publish","type":"post","link":"https:\/\/asapbio.org\/addressing-information-overload-in-scholarly-literature\/","title":{"rendered":"Addressing information overload in scholarly literature"},"content":{"rendered":"<p class=\"has-medium-font-size\"><em>Blog post by <a href=\"https:\/\/orcid.org\/0000-0002-9317-6819\" data-type=\"URL\" data-id=\"https:\/\/orcid.org\/0000-0002-9317-6819\" target=\"_blank\" rel=\"noopener\">Christine Ferguson<\/a> and <a href=\"https:\/\/orcid.org\/0000-0003-1419-2405\" data-type=\"URL\" data-id=\"https:\/\/orcid.org\/0000-0003-1419-2405\" target=\"_blank\" rel=\"noopener\">Martin Fenner<\/a><\/em><\/p>\n<div style=\"height:29px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n<p class=\"has-text-align-center\"><em><em>Information overload is the difficulty in understanding an issue and effectively making decisions when one has too much information about that issue, and is generally associated with the excessive quantity of daily information. \u2013 Wikipedia [1]<\/em><\/em><\/p>\n<div style=\"height:29px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n<p>Information overload is a common problem, and it is an old problem. It is not a problem of the internet age, and it is not specific to scholarly literature, but the growth of preprints in the last five years presents us with a proximal example of the challenge. <\/p>\n<p>We want to tackle this information overload problem and have some ideas on how to do this \u2013 presented at the end of this post. Are you willing to help? This post tells some of the back story of how preprints solve part of the problem \u2013 speedy access to academic information \u2013&nbsp; yet add to the growing information that we need to filter to find results that we can build on. It is written to inspire the problem solvers in our community to step forward and help us to realise some practical solutions.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Using journals to find relevant information<\/strong><\/h3>\n<p>In a classic presentation in 2008 [2], the writer Clay Shirky argued that while information overload might be a problem as old as the 15th century when the printing press was invented by Gutenberg, the rise of the internet for the first time had radically changed how we address this problem. Publishing used to be expensive, complicated and therefore risky, and this was addressed by only publishing content that was selected by the publisher to be \u201cworth publishing\u201d. Scientific publishing worked \u2013 and still works \u2013 in similar ways. One important change occurred with the dramatic growth of scientific publishing after World War II, when filtering by staff editors became unsustainable, and external peer review by academic experts slowly became the norm from the 1960s to the 1990s (e.g. <em>Nature<\/em> in 1973 and <em>The Lancet<\/em> in 1976) [3].<\/p>\n<p>Clay Shirky coined the phrase \u201cIt\u2019s Not Information Overload. It\u2019s Filter Failure\u201d in his 2008 presentation and made the point that publishing in the internet age has become so cheap that publication no longer needs to be the critical filtering step, rather that filtering can happen <em>after<\/em> publication. We can see this pattern in many mainstream industries, from movies to online shopping, with organizations such as Netflix and Amazon [4] investing heavily in recommender systems that substantially contribute to their revenues.<\/p>\n<p>Cameron Neylon applied these considerations to scholarly communication and found the scholarly communication community at an early stage in the transition to \u201cpublish first, filter later\u201d [5]. Ten years later his findings for the most part still hold true, as scholarly discovery services still for the most part focus on publications that have gone through a \u201cfilter\u201d step by a scholarly publisher.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Preprints: an alternative to the \u2018journal as a filter\u2019<\/strong><\/h3>\n<p>Preprints are the most visible implementation of the \u201cpublish first, filter later\u201d approach. Preprints in some disciplines, including high-energy physics, astrophysics, mathematics, and computer science, increasingly became the norm in the last 25 years, and currently the majority of high-energy physics papers are first published as preprints on the arXiv. In the life sciences the preprints server E-Biomed [6] was proposed by NIH director Harold Varmus in 1999, but the project was killed after a few months, not least because of strong and vocal opposition by biomedical publishers and societies. Instead, PubMed Central launched in 2000 to host open access journal publications instead of preprints [7]<strong>. <\/strong>After a delay of more than 15 years, preprints in the life sciences finally took off, and although they have grown considerably in number in the last five years, preprints still only represent a small fraction (6.4% in <strong>Figure 1<\/strong>) of all publications in biology:<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large is-resized\"><a href=\"https:\/\/asapbio.org\/wp-content\/uploads\/2025\/03\/Screenshot-2021-07-01-at-10.03.25-2.png\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/asapbio.org\/wp-content\/uploads\/2025\/03\/Screenshot-2021-07-01-at-10.03.25-2-1024x664.png\" alt=\"Yearly preprints\/all-papers in Microsoft Academic Graph, trend by domain, reproduced from Xie B, Shen Z, and Wang K 2021 [8]\" class=\"wp-image-6383\" width=\"633\" height=\"411\" srcset=\"https:\/\/asapbio.org\/wp-content\/uploads\/2025\/03\/Screenshot-2021-07-01-at-10.03.25-2-1024x664.png 1024w, https:\/\/asapbio.org\/wp-content\/uploads\/2025\/03\/Screenshot-2021-07-01-at-10.03.25-2-300x195.png 300w, https:\/\/asapbio.org\/wp-content\/uploads\/2025\/03\/Screenshot-2021-07-01-at-10.03.25-2-768x498.png 768w, https:\/\/asapbio.org\/wp-content\/uploads\/2025\/03\/Screenshot-2021-07-01-at-10.03.25-2.png 1182w\" sizes=\"auto, (max-width: 633px) 100vw, 633px\"><\/a><figcaption>Figure 1. Yearly preprints\/all-papers in Microsoft Academic Graph, trend by domain, reproduced from Xie B, Shen Z, and Wang K 2021 [8]<\/figcaption><\/figure>\n<\/div>\n<p>The notion of \u201cpublish first, filter later\u201d is now being promoted by a range of publishers who no longer penalise authors for publicising their submissions as preprints, but rather encourage submitting authors to post their manuscripts as preprints whilst these are being put through peer review by the journal. Some publishers are even more wedded to preprints as the publication of the future [9,10].<\/p>\n<p>Coming back to the original problem, preprints now add to the journal articles that researchers are tasked with filtering. That information overload poses a problem was recognised in a <a href=\"https:\/\/asapbio.org\/biopreprints2020-survey-initial-results;\">survey of stakeholders<\/a> (such as librarians, journalists, publishers, funders, research administrators, students, clinicians, and more) conducted last year by ASAPbio. The problem is exacerbated given the number of servers hosting relevant preprints \u2013&nbsp;<a href=\"https:\/\/asapbio.org\/preprint-servers\">ASAPbio\u2019s preprint platform directory<\/a> lists 56 preprint servers that host potentially relevant material.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Filtering for relevant content (to get around the information overload issue)<\/strong><\/h3>\n<p>While preprints in general, and specifically in the life sciences, are lowering the cost of and delays to sharing information, filtering for relevant content is still at a relatively early stage. To go deeper into the details of how relevant preprints can be discovered, it is important to make the important distinction between<\/p>\n<ol class=\"wp-block-list\">\n<li>Discovering relevant preprints at any point in time independent of peer review status<\/li>\n<li>Discovering relevant preprints that have undergone peer review<\/li>\n<li>Discovering relevant preprints immediately (days) after posting<\/li>\n<\/ol>\n<p>The first category includes discovery services that also include preprints as part of their content, including for example Europe PMC [11] and Meta [12]. Discovery strategies relevant for journal content can also be applied to preprints, e.g. search by keyword and\/or author.<\/p>\n<p>The second category focuses on peer-reviewed preprints, and is covered extensively elsewhere [13].&nbsp;<\/p>\n<p>The third category is the focus of this post \u2013 discovery of relevant preprints of interest to a researcher right after their posting, which rules out traditional peer review. The following filter strategies are possible:<\/p>\n<ol class=\"wp-block-list\">\n<li>Filter by subject area, keyword or author name<\/li>\n<li>Filter by personal publication history<\/li>\n<li>Filter by attention immediately after publication: social media (Twitter, Mendeley, etc.) and usage stats<\/li>\n<li>Filter by recommendations, e.g. from subject matter experts<\/li>\n<\/ol>\n<p>These filters can of course also be combined. The particular challenge is that they must work almost immediately (within days) after the preprint has been posted. This assumes a high level of automation, and a focus on immediacy. A combination of filters 1 and 3 works well with this approach: the information required for filter 1 is available in the metadata (e.g. via Crossref) when the content is posted, and attention (filter 3) can be determined immediately after the preprint is posted \u2013 Twitter is widely used for sharing links to bioRxiv\/medRxiv preprints, see examples in <strong>Figure 2<\/strong>. The Crossref Event Data service found 15,598 tweets for bioRxiv\/medRxiv preprints the week starting June 7, 2021.<\/p>\n<div style=\"height:45px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large is-resized\"><a href=\"https:\/\/asapbio.org\/wp-content\/uploads\/2025\/03\/Screenshot-2021-07-02-at-11.18.40-2.png\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/asapbio.org\/wp-content\/uploads\/2025\/03\/Screenshot-2021-07-02-at-11.18.40-2.png\" alt=\"\" class=\"wp-image-6408\" width=\"545\" height=\"624\" srcset=\"https:\/\/asapbio.org\/wp-content\/uploads\/2025\/03\/Screenshot-2021-07-02-at-11.18.40-2.png 881w, https:\/\/asapbio.org\/wp-content\/uploads\/2025\/03\/Screenshot-2021-07-02-at-11.18.40-2-262x300.png 262w, https:\/\/asapbio.org\/wp-content\/uploads\/2025\/03\/Screenshot-2021-07-02-at-11.18.40-2-768x880.png 768w\" sizes=\"auto, (max-width: 545px) 100vw, 545px\"><\/a><figcaption>Fig. 2. The use of twitter to bring attention to newly posted preprints<\/figcaption><\/figure>\n<\/div>\n<p>For filter 3, we\u2019ve considered \u2018bookmarking preprints in Mendeley\u2019 but these cannot currently be tracked in open APIs such as the Crossref Event Data service. Usage stats are another alternative, but are currently not available via API in the early days after publication.&nbsp;<\/p>\n<p>Another consideration is how to best inform researchers of these potentially relevant preprints. Given that cost and speed are the primary concerns, we consider the most appropriate approach to be dissemination of these filtering results via a regular (daily or weekly) RSS feed or newsletter.<\/p>\n<p>In summary, realising a list of biomedical preprints that have been filtered by a minimal number of tweets in the days after posting, and broken down by subject area, is a good initial filtering strategy to identify relevant preprints immediately after they have been posted. Interested researchers can access a filtered corpus via newsletter.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Existing efforts that track discovery of relevant preprints right after their posting&nbsp;&nbsp;<\/strong><\/h3>\n<p>A few examples&nbsp;<\/p>\n<ul class=\"wp-block-list\">\n<li>https:\/\/twitter.com\/PreprintBot \u2013 new this year, \u201ca bot that tweets preprints and comments from BioRxiv and MedRxiv\u201d [14]<\/li>\n<li>https:\/\/twitter.com\/PromPreprint \u2013 this has been running for a while; \u201cA bot tweeting @biorxivpreprint publications reaching the top 10% Altmetric score within their first month after publication\u201d[15]<\/li>\n<li>http:\/\/arxiv-sanity.com\/toptwtr \u2013 this started as a new way to list all arXiv preprints, but they added social media data at some point [16]<\/li>\n<li>https:\/\/scirate.com \u2013 a free and open access scientific collaboration network that allows users to follow <a href=\"http:\/\/arxiv.org\/\" target=\"_blank\" rel=\"noopener\">arXiv.org<\/a> categories and see the highest ranked new papers [17]<\/li>\n<li>https:\/\/rxivist.org \u2013 a free and open website that enables users to identify preprints from bioRxiv and medRxiv based on download count or mentions on Twitter. One can, for example, pick the most tweeted preprints in the last 7 days \u2013 and this presents a list of preprints that may have been posted at any point since the servers began [18].&nbsp;<\/li>\n<\/ul>\n<p>Our strategy for filtering life science preprints builds on these existing efforts but picks up only those preprints posted<em> in the past week<\/em> that have received tweets and proposes to use a newsletter as the primary communication channel. We propose to run this newsletter as a community experiment, where we iterate over the implementation based on researcher feedback on how helpful the newsletter is in addressing information overload. Other considerations: Can we focus more on who is tweeting rather than the number of tweets, or should we add an element of human curation? Can we filter life science preprints from additional servers?&nbsp;<\/p>\n<h2 class=\"wp-block-heading\">Call to action<\/h2>\n<p>If you want to help to tackle the information overload problem in the life sciences then leave a comment below or DM us. If enough folk are interested in working with us, we could generate a community group under the auspices of ASAPbio to work on information overload.&nbsp;<\/p>\n<\/p>\n<p><em>We thank Rich Abdill (University of Minnesota), Iratxe Puebla and Jessica Polka (both ASAPbio) for providing valuable feedback when writing this blog post.<\/em><\/p>\n<div style=\"height:44px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n<p style=\"font-size:18px\"><strong>References<\/strong><\/p>\n<ol class=\"wp-block-list\">\n<li>Information overload. In: Wikipedia. ; 2021. Accessed May 18, 2021. https:\/\/en.wikipedia.org\/w\/index.php?title=Information_overload&amp;oldid=1023377809<\/li>\n<li>O\u2019Reilly. Web 2.0 Expo NY: Clay Shirky (Shirky.Com) It\u2019s Not Information Overload. It\u2019s Filter Failure.; 2008. Accessed May 18, 2021. https:\/\/www.youtube.com\/watch?v=LabqeJEOQyI<\/li>\n<li>A brief history of peer review. F1000 Blogs. Published January 31, 2020. Accessed May 18, 2021. https:\/\/blog.f1000.com\/2020\/01\/31\/a-brief-history-of-peer-review\/<\/li>\n<li>Amazon\u2019s Recommendation Engine: The Secret To Selling More Online. Accessed May 18, 2021. https:\/\/rejoiner.com\/resources\/amazon-recommendations-secret-selling-online\/<\/li>\n<li>Neylon C. It\u2019s not filter failure, it\u2019s a discovery deficit. Serials. 2011;24(1):21-25. doi:10.1629\/2421<\/li>\n<li>Homan JM. E-biomed. Bull Med Libr Assoc. 1999;87(4):485-486. Accessed May 18, 2021. https:\/\/www.ncbi.nlm.nih.gov\/pmc\/articles\/PMC226626\/<\/li>\n<li>Kling R, Spector LB, Fortuna J. The real stakes of virtual publishing: the transformation of E-biomed into PubMed central. J Am Soc Inf Sci Technol. 2004;55(2):127-148. doi:10.1002\/asi.10352<\/li>\n<li>Xie B, Shen Z, Wang K. Is preprint the future of science? A thirty year journey of online preprint services. ArXiv210209066 Cs. Published online February 17, 2021. Accessed May 18, 2021. http:\/\/arxiv.org\/abs\/2102.09066<\/li>\n<li>Eisen MB, Akhmanova A, Behrens TE, Harper DM, Weigel D, Zaidi M. Implementing a \u201cpublish, then review\u201d model of publishing. eLife. 2020;9:e64910. doi:10.7554\/eLife.64910<\/li>\n<li>About Wellcome Open Research | How It Works | Beyond A Research Journal. Accessed June 27, 2021. https:\/\/wellcomeopenresearch.org\/about#section-box-4<\/li>\n<li>Preprints \u2013 About \u2013 Europe PMC. Accessed May 18, 2021. https:\/\/europepmc.org\/Preprints#preprint-servers<\/li>\n<li>Meta | Expand Your Research. Meta | Expand Your Research. Accessed June 27, 2021. https:\/\/www.meta.org\/<\/li>\n<li>Polka, Jessica, Strasser, Carly, Taraborelli, Dario. Shared Technology Needs for Preprints. Zenodo; 2021. doi:10.5281\/ZENODO.4700570<\/li>\n<li>Preprint Bot (@PreprintBot) \/ Twitter. Twitter. Accessed June 30, 2021. https:\/\/twitter.com\/PreprintBot<\/li>\n<li>PromisingPreprints (@PromPreprint) \/ Twitter. Twitter. Accessed June 30, 2021. https:\/\/twitter.com\/PromPreprint<\/li>\n<li>ArXiv Sanity Preserver. ArXiv Sanity Preserver. Accessed June 30, 2021. http:\/\/arxiv-sanity.com\/<\/li>\n<li>Top arXiv papers. SciRate. Accessed June 30, 2021. https:\/\/scirate.com\/<\/li>\n<li>Meta-Research: Tracking the popularity and outcomes of all bioRxiv preprints | eLife. Accessed June 30, 2021. https:\/\/elifesciences.org\/articles\/45133<\/li>\n<\/ol>\n","protected":false},"excerpt":{"rendered":"<p>Blog post by Christine Ferguson and Martin Fenner Information overload is the difficulty in understanding an issue and effectively making decisions when one has too much information about that issue, [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":1974,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[47,42,45,44],"tags":[],"class_list":["post-3333","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-feedbackasap","category-guest-posts","category-preprint-review","category-preprints"],"acf":[],"_links":{"self":[{"href":"https:\/\/asapbio.org\/wp-json\/wp\/v2\/posts\/3333","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/asapbio.org\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/asapbio.org\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/asapbio.org\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/asapbio.org\/wp-json\/wp\/v2\/comments?post=3333"}],"version-history":[{"count":1,"href":"https:\/\/asapbio.org\/wp-json\/wp\/v2\/posts\/3333\/revisions"}],"predecessor-version":[{"id":3334,"href":"https:\/\/asapbio.org\/wp-json\/wp\/v2\/posts\/3333\/revisions\/3334"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/asapbio.org\/wp-json\/wp\/v2\/media\/1974"}],"wp:attachment":[{"href":"https:\/\/asapbio.org\/wp-json\/wp\/v2\/media?parent=3333"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/asapbio.org\/wp-json\/wp\/v2\/categories?post=3333"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/asapbio.org\/wp-json\/wp\/v2\/tags?post=3333"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}