The archives are half-empty: a field-wide assessment of the availability of microbial community sequencing data

Stephanie D. Jurburg, Maximilian Konzack, Nico Eisenhauer, Anna Heintz-Buschart

Received Date: 13th April 20

The sequencing revolution has resulted in the explosive growth of public genetic repositories. These repositories now hold invaluable collections of 16S rRNA gene amplicon sequences, but the extent to which the currently archived data is findable, accessible, and reusable has not been evaluated. We conducted a field-wide assessment of the availability and state of publicly archived 16S rRNA gene amplicon sequencing data. Using custom-built pattern-based text extraction algorithms, we searched 26,927 publications in 17 microbiology or microbial ecology journals, and identified 2,015 studies which performed 16S rRNA gene amplicon sequencing. We found, for example, that 7.2% of these had not been made public at the time of analysis, a trend which increased over time. Of the 635 studies targeting the V3-V4 region of the 16S rRNA gene, 40.3% contained data which was not available or not reusable, and for 25.5% of the studies, faults in data formatting or data labelling were likely to create obstacles in data reuse. Taken together, only 34% of these datasets had potentially reusable data. Our study reveals significant gaps in the availability of currently deposited community sequencing data, identifies major contributors to data loss, and offers suggestions for improving data archiving practices in the future.

This is an abstract of a preprint hosted on an independent third party site. It has not been peer reviewed but is currently under consideration at Nature Communications.

