{"id":5434,"date":"2025-12-02T15:02:49","date_gmt":"2025-12-02T15:02:49","guid":{"rendered":"https:\/\/asapbio.org\/?p=5434"},"modified":"2025-12-02T15:02:51","modified_gmt":"2025-12-02T15:02:51","slug":"assessing-data-sharing-under-preprints-value-and-contribution-of-data-repositories","status":"publish","type":"post","link":"https:\/\/asapbio.org\/assessing-data-sharing-under-preprints-value-and-contribution-of-data-repositories\/","title":{"rendered":"Assessing Data Sharing under Preprints: Value and Contribution of Data Repositories"},"content":{"rendered":"\n<p>Written by Madeline Josephine Morrisson<sup>1,2<\/sup> &amp; Rachel Mtama<sup>,3,4<\/sup><\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Department of Biomedical Data Science, Geisel School of Medicine at Dartmouth, Lebanon, NH, USA.&nbsp;<\/li>\n\n\n\n<li>Department of Molecular and Systems Biology, Geisel School of Medicine at Dartmouth, Hanover, NH, USA.<\/li>\n\n\n\n<li>Ifakara Health Institute, Bagamoyo, Tanzania.<\/li>\n\n\n\n<li>Tanzania Human Genetic Organization-Communications Team, Dar es Salaam, Tanzania.<\/li>\n<\/ol>\n\n\n\n<h1 class=\"wp-block-heading\">Introduction<\/h1>\n\n\n\n<p>Open science involves sharing work early and openly. However, there is more to a manuscript than simply the words on the page. The underlying data is important to share as well, to build trust in research and further accelerate scholarly communication and advances in knowledge production. However, anecdotally, many researchers do not share their data when they post a preprint, choosing to hold this back until journal publication.<\/p>\n\n\n\n<p>Here, we survey the websites of four preprint servers, and seven data repositories, focused on the life sciences. The preprint servers (bioRxiv, medRxiv, Open Science Framework (OSF), and ResearchSquare) host manuscripts.&nbsp;The repositories house a variety of types of data, from gene expression of humans and non-human species (GEO, SRA, EGA), to protein structure (PDB), and other research outputs (Zenodo, figshare, Dryad). <\/p>\n\n\n\n<p>Our purpose was to determine the policies repositories may have, or not have, around hosting data that has not yet been published in a traditional academic journal, as well as policies preprint servers may have around inclusion of data.&nbsp;<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Preprint server policies<\/h2>\n\n\n\n<p>We surveyed the websites of bioRxiv, medRxiv, OSF, and ResearchSquare, to determine if policies exist for deposition of data generated in preprints.&nbsp;<\/p>\n\n\n\n<p>Both <a href=\"https:\/\/www.biorxiv.org\/about\/FAQ\" target=\"_blank\" rel=\"noopener\">bioRxiv<\/a> and <a href=\"https:\/\/www.medrxiv.org\/about\/FAQ\" target=\"_blank\" rel=\"noopener\">medRxiv <\/a>encourage data availability statements. Additionally, both note that if data for a manuscript is hosted on a member of the International Nucleotide Sequence Database Consortium, the sequences will be released once the manuscript is posted. While supplemental materials can be included on both of these servers, large additional files need to be put into the correct database. medRxiv has additional, stricter policies around patient data. Pictures of individuals cannot be included in manuscripts, and patient data needs to be properly deidentified. Recently, bioRxiv has <a href=\"https:\/\/openrxiv.org\/dryad-integration\/\" target=\"_blank\" rel=\"noopener\">partnered<\/a> with Dryad to streamline data sharing.<\/p>\n\n\n\n<p><a href=\"https:\/\/www.researchsquare.com\/legal\/editorial\" target=\"_blank\" rel=\"noopener\">ResearchSquare<\/a> highly recommends that authors make data generated within a manuscript publicly available, as well as contain a data availability statement.&nbsp;<\/p>\n\n\n\n<p>In addition to acting as a preprint server, OSF can also serve as a data repository. OSF is a platform to enable collaboration between scientists, whether this be private, or public, by aggregation of data and other work. The deposition of data related to preprints posted on OSF affiliated servers is recommended, but not required. As of August 25, 2025, OSF has suspended submissions to their <a href=\"https:\/\/www.cos.io\/blog\/suspension-of-submissions-to-generalist-preprint-server-for-review-and-next-steps-community-servers-hosted-by-osf-preprints-remain-active\" target=\"_blank\" rel=\"noopener\">generalist preprint server<\/a>, however the 14 community-run preprint servers that are affiliated with OSF are unaffected by this change.&nbsp;<\/p>\n\n\n\n<p>While the preprint servers surveyed mention the importance of data deposition, none seem to have clear guidelines that require data deposition.&nbsp;<\/p>\n\n\n\n<p>We recommend that preprint servers adopt clear and easy to find guidelines around data deposition. Strict rules around data deposition could discourage some scientists from utilizing preprinting, so a balance needs to be struck. Greater education about preprints and dispelling of myths could also support this, as well as connection to the FAIR principles.&nbsp;<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Data repositories &#8211; what are your options?<\/h2>\n\n\n\n<p>Data sharing has become a norm in science communication and dissemination as a means to comply with journals&#8217; data sharing policies, grants, project requirements, and for the sole purpose of making data available for other scientists in turn enhancing credibility to the research work.<\/p>\n\n\n\n<p>It is vital to understand the type of data your research work generated and in which repository this data can be stored. Data should be submitted to discipline-specific repositories, and in cases where a discipline-specific repository is unavailable, general data repositories like Zenodo and figshare can be used.&nbsp;<\/p>\n\n\n\n<p>When it comes to preprint publishing, many posted preprints do not have their data deposited in repositories, nor the link\/accession number shared along with the preprint for a work that has deposited its data in a repository. We looked into repositories&#8217; data policy sharing to see how they support early data sharing before peer review, and if they accept data associated with preprints.&nbsp;<\/p>\n\n\n\n<p>While most of the repositories allow submission of data and figures at any stage of the manuscript cycle, i.e. before the peer review process, an exception was observed with the&nbsp; Protein Data Bank (PDB) repository. PDB strictly admits data whose work has been or is undergoing the peer review process, and only published data will be made public.<\/p>\n\n\n\n<p>Data repositories have varying policies regarding data release, access, and discoverability. Some repositories require explicit actions from authors to make data public, while others may have embargo periods or access restrictions. In most formats an accession number or links for reviewers are provided for data verification during submissions in peer reviewed journals. GEO for instance, will publish data immediately once the accession number is made public in any form i.e as a preprint or a published peer-reviewed article. For the SRA, on the other hand, data submitters need to determine if their data is suitable for public distribution or requires controlled access.<\/p>\n\n\n\n<p>Data repositories should follow the FAIR data-sharing guidelines, as doing so enhances the value and impact of data, promotes transparency, and supports the long-term sustainability of scientific knowledge. It&#8217;s therefore in the best interest of authors and researchers to deposit their data in repositories that adhere to these principles, ensuring accessibility, visibility, and reusability of their work.<\/p>\n\n\n\n<p>The FAIR principles entail the following: <strong>\u201cFindable\u201d<\/strong> meaning metadata should be issued a persistent and unique identifier making it easy to locate. <strong>\u201cAccessible\u201d <\/strong>metadata should be retrievable using standardized protocols and clear usage conditions. \u201d<strong>Interoperable<\/strong>\u201d means data should be in standardized formats, vocabularies (including broadly applicable language), and structures that allow data to be integrated across different systems and disciplines. Finally, the <strong>\u201cReusable\u201d<\/strong> principle emphasizes that data should be clearly licensed, thoroughly documented, and aligned with domain-relevant community standards, ensuring that it can be confidently and effectively reused by others. In the era of making the research process more open and accessible, the FAIR principles of data sharing help create a more open, efficient, and collaborative research ecosystem.&nbsp;<\/p>\n\n\n\n<div class=\"post-module module__post-table \" id=\"\">\n\t\n\t\t<div class=\"table-row table-row--column-count-5 table-row--row-count-8\">\n\t\t\t<table class=\"table-row__table\">\n\t<tr class=\"table-row__table-row\"><th class=\"table-row__table-cell \" colspan=\"1\"><p>Repository<\/p>\n<\/th><th class=\"table-row__table-cell \" colspan=\"1\"><p>Policy on data state<\/p>\n<\/th><th class=\"table-row__table-cell \" colspan=\"1\"><p>Metadata availability<\/p>\n<\/th><th class=\"table-row__table-cell \" colspan=\"1\"><p>FAIR principles adherence<\/p>\n<\/th><th class=\"table-row__table-cell \" colspan=\"1\"><p>Upload restrictions<\/p>\n<\/th><\/tr><tr class=\"table-row__table-row\"><td class=\"table-row__table-cell \" colspan=\"1\"><p>Zenodo<\/p>\n<\/td><td class=\"table-row__table-cell \" colspan=\"1\"><p>Fully accepts any stage of research, including preprint. Any status of research data is accepted, from any stage of the research lifecycle.<\/p>\n<\/td><td class=\"table-row__table-cell \" colspan=\"1\"><p>Metadata openly available under CC0 license<\/p>\n<\/td><td class=\"table-row__table-cell \" colspan=\"1\"><p>Committed to FAIR principles; supports easy citation (DOI), open APIs, and interoperability.<\/p>\n<\/td><td class=\"table-row__table-cell \" colspan=\"1\"><\/td><\/tr><tr class=\"table-row__table-row\"><td class=\"table-row__table-cell \" colspan=\"1\"><p>GEO<\/p>\n<\/td><td class=\"table-row__table-cell \" colspan=\"1\"><p>Supports submission of data prior to peer review. You can submit data privately and share a reviewer access link. However, once a GEO accession is cited anywhere (including in a preprint), GEO staff will automatically make it public.<\/p>\n<\/td><td class=\"table-row__table-cell \" colspan=\"1\"><p>Metadata is extensive but accessing it can be challenging; processed tabular metadata available via APIs.<\/p>\n<\/td><td class=\"table-row__table-cell \" colspan=\"1\"><p>Designed with FAIR principles in mind to promote data reuse.<\/p>\n<\/td><td class=\"table-row__table-cell \" colspan=\"1\"><p>Focused on gene expression and genomic hybridization datasets and related metadata.<\/p>\n<\/td><\/tr><tr class=\"table-row__table-row\"><td class=\"table-row__table-cell \" colspan=\"1\"><p>SRA (Sequence Read Archive)<\/p>\n<\/td><td class=\"table-row__table-cell \" colspan=\"1\"><p>Accepts pre-publication data, it\u2019s common to upload your sequencing data before paper acceptance using embargo settings or make it private with access links for reviewers.<\/p>\n<\/td><td class=\"table-row__table-cell \" colspan=\"1\"><p>Very specific metadata requirements (i.e., sequencing info), looks like all of this is available. They will only restrict if it\u2019s an issue with confidentiality.<\/p>\n<\/td><td class=\"table-row__table-cell \" colspan=\"1\"><p>Doesn\u2019t appear to be explicitly mentioned but does seem to follow.<\/p>\n<\/td><td class=\"table-row__table-cell \" colspan=\"1\"><p>\u201cData submitters need to determine if their data is suitable for public distribution or if it needs controlled access.\u201d<\/p>\n<\/td><\/tr><tr class=\"table-row__table-row\"><td class=\"table-row__table-cell \" colspan=\"1\"><p>European Genome-phenome Archive (EGA)<\/p>\n<\/td><td class=\"table-row__table-cell \" colspan=\"1\"><p>Accepts raw data and provides an accession number to be used during publishing submissions.<\/p>\n<\/td><td class=\"table-row__table-cell \" colspan=\"1\"><p>Metadata public through REST API with unique accession IDs for study, sample, datasets, etc.<\/p>\n<\/td><td class=\"table-row__table-cell \" colspan=\"1\"><\/td><td class=\"table-row__table-cell \" colspan=\"1\"><\/td><\/tr><tr class=\"table-row__table-row\"><td class=\"table-row__table-cell \" colspan=\"1\"><p>Protein Data Bank (PDB)<\/p>\n<\/td><td class=\"table-row__table-cell \" colspan=\"1\"><p>Typically requires structural data to be associated with a peer-reviewed publication or imminent acceptance.<br \/>\nThere\u2019s no explicit allowance for data from preprints; generally, PDB expects accompanying peer-reviewed or accepted manuscripts.<\/p>\n<\/td><td class=\"table-row__table-cell \" colspan=\"1\"><p>Raw data deposition is strongly encouraged.<\/p>\n<\/td><td class=\"table-row__table-cell \" colspan=\"1\"><p>Point 1 of their mission is adherence to the FAIR principles.<\/p>\n<\/td><td class=\"table-row__table-cell \" colspan=\"1\"><p>Specific datatypes may discuss \u201cproblem structures\u201d with authors before release, specific information about the protein, aa seqence, etc.<\/p>\n<\/td><\/tr><tr class=\"table-row__table-row\"><td class=\"table-row__table-cell \" colspan=\"1\"><p>Dryad<\/p>\n<\/td><td class=\"table-row__table-cell \" colspan=\"1\"><p>Generally focused on data accompanying peer-reviewed publications not explicitly known for accepting preprint-only data.<\/p>\n<\/td><td class=\"table-row__table-cell \" colspan=\"1\"><p>Ensures metadata quality, stores and includes<\/p>\n<\/td><td class=\"table-row__table-cell \" colspan=\"1\"><p>Strong support for FAIR, explicitly mentioned in their best practices.<\/p>\n<\/td><td class=\"table-row__table-cell \" colspan=\"1\"><p>Curators for submitted data, any field, any format welcome, restrictions are around subject privacy, and not including requested metadata.<\/p>\n<\/td><\/tr><tr class=\"table-row__table-row\"><td class=\"table-row__table-cell \" colspan=\"1\"><p>figshare<\/p>\n<\/td><td class=\"table-row__table-cell \" colspan=\"1\"><p>No searchable policy here about preprints. Mostly used for sharing data and figures at any stage, but preprint handling unclear.<\/p>\n<\/td><td class=\"table-row__table-cell \" colspan=\"1\"><p>Encourage metadata sharing where appropriate (i.e., human data).<\/p>\n<\/td><td class=\"table-row__table-cell \" colspan=\"1\"><p>Explicitly mention FAIR.<\/p>\n<\/td><td class=\"table-row__table-cell \" colspan=\"1\"><p>Restrictions around size and privacy.<\/p>\n<\/td><\/tr>\n\t\t\t<\/table>\n\t\t<\/div>\n\t<\/div>\n\n\n\n<h2 class=\"wp-block-heading\">Call To Action<\/h2>\n\n\n\n<p>Through a survey of preprint servers and data repositories, we found general support for deposition of data and preprinting of manuscripts. Most data repositories mention the FAIR principles within their websites as well.&nbsp;<\/p>\n\n\n\n<p>However clear guidance is lacking as to when data generated during research activity should be deposited. We recommend that both preprint servers and data repositories adopt clear and easy to find guidelines that align with the FAIR principles. This will benefit all researchers and ensure reusability of data.&nbsp;<\/p>\n\n\n\n<p>And as always, we encourage greater education on the importance and validity of preprinting.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Resources<\/h2>\n\n\n\n<p>A larger table with additional data about the repositories is available on Zenodo (<a href=\"https:\/\/doi.org\/10.5281\/zenodo.17791518\" target=\"_blank\" rel=\"noopener\">https:\/\/doi.org\/10.5281\/zenodo.17791518<\/a>).<\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Written by Madeline Josephine Morrisson1,2 &amp; Rachel Mtama,3,4 Introduction Open science involves sharing work early and openly. However, there is more to a manuscript than simply the words on the [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":5457,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[42],"tags":[70],"class_list":["post-5434","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-guest-posts","tag-fellows"],"acf":[],"_links":{"self":[{"href":"https:\/\/asapbio.org\/wp-json\/wp\/v2\/posts\/5434","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/asapbio.org\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/asapbio.org\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/asapbio.org\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/asapbio.org\/wp-json\/wp\/v2\/comments?post=5434"}],"version-history":[{"count":5,"href":"https:\/\/asapbio.org\/wp-json\/wp\/v2\/posts\/5434\/revisions"}],"predecessor-version":[{"id":5468,"href":"https:\/\/asapbio.org\/wp-json\/wp\/v2\/posts\/5434\/revisions\/5468"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/asapbio.org\/wp-json\/wp\/v2\/media\/5457"}],"wp:attachment":[{"href":"https:\/\/asapbio.org\/wp-json\/wp\/v2\/media?parent=5434"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/asapbio.org\/wp-json\/wp\/v2\/categories?post=5434"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/asapbio.org\/wp-json\/wp\/v2\/tags?post=5434"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}