Linking Data Management and Sharing with Reproducibility

Lisa Federer
Post Image

Researchers today are generating more data than ever before – in genomics alone, researchers are expected to generate up to 40 exabytes (or 40 billion gigabytes) of data by 2025. If they ever hope to make any sense out of this mountain of data, researchers will need to start thinking about how to manage their data and share them with others who can help analyze them. In fact, many funders have started to require that researchers think about these issues.

The National Science Foundation (NSF), for example, has required researchers to submit data management plans (DMPs) since 2011.  The National Institutes of Health (NIH) willData Management Drawing soon require similar plans, outlining how data will be both managed and shared.  Many private funders, too, have started to require that proposals include DMPs.

For busy researchers, these new requirements can feel like just another frustrating hurdle in an already lengthy process of obtaining grant funding. Most funders evaluate the DMP as part of proposal review, and not sharing data as required may have consequences for future funding, so these requirements aren’t something researchers can afford to ignore. However, there are also plenty of good reasons for asking researchers to think about how they will manage and share their data. Librarians can play an important role in helping researchers understand how and why they should be managing and sharing their data.  In addition, librarians can help take some of the frustration out of data management and sharing by providing researchers with the training and tools they will need to implement best practices.

Notable among the reasons to think about management and sharing is that good data practices can help enhance scientific reproducibility.  Irreproducible research is widely recognized as a significant problem in a variety of scientific fields – estimates suggest that up to 90% of published research findings cannot be reproduced, making it difficult to have confidence in the validity of these findings. While there are no easy fixes to the reproducibility crisis, data management and sharing are important ways of helping researchers improve reproducibility.  

Most of us have unfortunately had the experience of losing our data when a hard drive crashes or a computer is damaged.  Losing our favorite family pictures or our important documents is frustrating, but when researchers’ data is lost, the consequences for reproducibility can be serious.  In a study that aimed to track down research data that supported published articles, the investigators were only able to obtain 20% of the datasets they requested.  Among authors who responded to the investigator’s request, nearly 80% reported that the data from their 20-year old papers no longer existed.  

DMP Tool LogoWhen data are no longer available, reproducibility of the original research is impossible.  Thus, it’s important for researchers to make careful plans about how to properly store and manage their data over an appropriate period of time.  Many libraries offer training and support for researchers who need to write a DMP.  For librarians who are themselves new to writing DMPs, the DMPTool can be helpful, with its interactive DMP-writing features and its examples of DMPs for many different funders.

Sharing, too, can play an important role in enhancing reproducibility.  In fact, reproducibility, by definition, means getting the same results as the original researchers, using the same data (unlike replicability, in which similar methods are used, but new data are collected). Therefore, reproducibility hinges on researchers’ ability and willingness to share data. In some research communities, data sharing is already accepted as part of the scientific culture (for example, in genomics). In others, work still needs to be done to incentivize data sharing and encourage researchers to move toward a culture of sharing. Librarians can help by providing training for researchers on how to prepare and share their data.

A variety of methods exist for sharing research data, including thousands of repositories, many of which will accept data for free.  Sometimes researchers can deposit their data in a subject-specific repository that accepts a particular type of data, like the Gene Expression Omnibus (GEO), which accepts genomics data, or ImmPort, which accepts immunology data. If no subject-specific repository is appropriate, researchers can use more general repositories, like figshare or Dryad, which will accept almost any type of data. With thousands of repositories available, researchers may be unsure about where they should deposit their data; librarians can assist by familiarizing themselves with repositories that accept data in the disciplines they support so that they can make recommendations to their researchers.

Librarians can play an important role in helping increase reproducibility by offering services that facilitate data management and sharing.  In fact, many libraries have already started providing support in these areas. However, researchers often don’t realize that libraries offer these types of services – or if they do, they may not realize the importance of data management and sharing to reproducibility.  Making the link from data management and sharing to reproducibility more explicit may be a way to encourage researchers to take advantage of these library services.

Of course, data management and sharing aren’t the only issues researchers need to take into consideration to ensure reproducibility – methods documentation, pre-registration of research outcomes, and use of standards are all essential to reproducibility.  Librarians have many opportunities to help researchers enhance the reproducibility of their science, but data management and sharing can be a great place to start!