Katharine Y. Chen1, MariaToro-Moreno1,†,Arvind Rasi Subramaniam1,†
1 Basic Sciences Division and Computational BiologySection of the Public Health Sciences Division, Fred Hutchinson CancerCenter, Seattle, USA
† Corresponding authors: M.T.M:mtoromor@fredhutch.org,A.R.S: rasi@fredhutch.org
Abstract
Laboratory research is a complex, collaborative process that involvesseveral stages, including hypothesis formulation, experimental design,data generation and analysis, and manuscript writing. Althoughreproducibility and data sharing are increasingly prioritized at thepublication stage, integrating these principles at earlier stages oflaboratory research has been hampered by the lack of broadly applicablesolutions. Here, we propose that the workflow used in modern softwaredevelopment offers a robust framework for enhancing reproducibility andcollaboration in laboratory research. In particular, we show thatGitHub, a platform widely used for collaborative software projects, canbe effectively adapted to organize and document all aspects of aresearch project’s lifecycle in a molecular biology laboratory. Weoutline a three-step approach for incorporating the GitHub ecosysteminto laboratory research workflows: 1. designing and organizingexperiments using issues and project boards, 2. documenting experimentsand data analyses with a version control system, and 3. ensuringreproducible software environments for data analyses and writing taskswith containerized packages. The versatility, scalability, andaffordability of this approach make it suitable for various scenarios,ranging from small research groups to large, cross-institutionalcollaborations. Adopting this framework from a project’s outset canincrease the efficiency and fidelity of knowledge transfer within andacross research laboratories. An example GitHub repository based on theabove approach is available athttps://github.com/rasilab/github_demo.
Introduction
Scientific progress is contingent on the ability to reproduce and buildupon previous findings. To promote reproducibility of published studies,journals and funding agencies increasingly require researchers to maketheir data and analysis methods available in publicrepositories1–3.In parallel, data-intensive fields such as machine learning,computational biology, and ecology have developed workflows and toolsthat facilitate computationalreproducibility4–7.Reproducibility is also a critical element of collaborative research,where multiple researchers need to share and build upon each other’swork. Indeed, many reproducibility standards and tools have beendeveloped in the context of large-scale, cross-institutionalcollaborations8–11.
Compared to computational research, reproducibility and collaborationare less emphasized during the course of laboratory research, especiallyin the context of small research groups. This is perhaps becauseprojects within a laboratory are often executed by individuals, and arerarely framed as collaborative efforts. However, viewed from a broaderperspective, collaboration is an integral and ubiquitous feature of evensmall laboratory research groups. Group members discuss ideas, exchangeprotocols, and share reagents and data on a daily basis. Traineescollaborate with group leaders in hypothesis formulation, datainterpretation, literature review, and manuscript writing. Mostimportantly, individuals “collaborate” with their future selves byanalyzing and building on their own results, and also with future labmembers who extend their work after they have left the lab. Manypublicized instances of post-publication irreproducibility arise fromthe inability of either the same scientist at a later time or adifferent scientist within the same laboratory to replicate previousresults12,13.Thus, workflows that improve scientific documentation and collaborationwithin laboratory groups from the outset of a project can enhancereproducibility of published studies across the wider researchcommunity.
While much effort has been devoted to improving reproducibility duringdata analysis and post-publication stages of research, these do notaddress earlier critical stages of laboratory research. Even in a ‘wet’lab where experiments are physically performed, researchers spendsignificant time and resources on literature review, hypothesisformulation, experimental design, in addition to the generation,visualization and interpretation of data. Nevertheless, few tools andworkflows exist to document and share these activities in a structuredand reproducible manner. Lab notebooks are the most common media used todocument laboratory research, but they are typically only used forrecording methods and data. An NIH handbook on lab notebooks, forexample, explicitly discourages the inclusion of speculative ideas orinformalconversations14,despite the recognition that these are often the source of scientificbreakthroughs. Electronic lab notebooks (ELNs), despite theirpopularity, are stored in proprietary formats, incur a recurrent cost,tend to become defunct over time, and have poor interoperability witheach other15.Cloud-based tools like Google Docs, Dropbox, and Sharepoint allowsharing of data and documents, but do not provide a structured way totrack changes over time or record project-related communication. Emailand messaging tools such as Slack and Microsoft Teams facilitateinformal discussion of ideas and data, but these are poorly suited fororganizing data and discussion in a reproducible manner. Thus, it isincreasingly common for data, experiment details, analysis code, andproject communication to be fragmented across multiple tools withinlaboratory research groups (Figure 1).
The process of software development bears several similarities toactivities in laboratory research. Developing software involvesunderstanding the state-of-the-art solutions to a problem, formulating ahypothesis for improving existing approaches, breaking down thehypothesis into smaller testable features, writing code to implementfeatures, analyzing the outcome of each feature development, andtroubleshooting and fixing errors when necessary. Software developmentoccurs in groups of all sizes ranging from a single developer tothousands of contributors, often distributed across different timezones,working on a common project. The necessity to document and share allstages of software development between contributors has led to theemergence of mature tools and workflows that facilitate reproducibilityand collaboration. These include concepts such as issuetracking16,versioncontrol17, andcontainerization18that have become integral to most software development projects.
Many of the common workflows associated with software development areimplemented in GitHub, a popular cloud-based platform used by over 100million developers worldwide and 90% of Fortune 100companies19,20.GitHub repositories are highly scalable, with both small projects builtby single developers and large projects with over 2,000contributors21.In the scientific community, GitHub is used to share data analysisworkflows afterpublication22,develop and share computationaltools23,perform individual recordkeeping10,24,25,and conduct open science projects andcollaborations11.However, how the standard workflows and rich features of GitHub (Figure1) can be adapted to improve reproducibility andcollaboration within a traditional laboratory research group has notbeen explored.
Here, we aim to provide a practical demonstration of GitHub’s use in thecontext of a laboratory research group centered around molecular biologyexperiments. Our goal is to show how GitHub provides an intuitive andstructured framework to organize and document all aspects of a researchproject’s lifecycle, including literature review, hypothesisformulation, experiment design, lab work, data analysis, and manuscriptand grant writing (Figure 2). Like many academicresearchers, we started using GitHub for version control of dataanalysis scripts. Over the past nine years, we have expanded our use ofGitHub to document all aspects of research in our laboratory, and managecollaborations both within our institution and with externalcollaborators. None of us had any formal training in softwaredevelopment or the use of GitHub. We arrived at our current workflowthrough trial and error, tutorials on the internet, and by seeking helpfrom more experienced users within and outside our lab. We find thatstarting a project with this framework enables us to take full advantageof GitHub’s many features, though adopting this workflow at any stage ofa project can still be beneficial.
Set up GitHub for laboratoryresearch
GitHub is centered around the concept of repositories, which can bethought of as cloud-based folders that contain all files anddocumentation related to a specific research project. While a singleGitHub repository can be used to record all work by a single user, akinto a traditional lab notebook, organizing repositories based on projectsprovides better structure for collaborative research (Figure3). Repositories have a unique name, which isoften the project name, and a standardized URL in the formathttps://github.com/GROUP_NAME/PROJECT_NAME (for example,https://github.com/rasilab/ribosome_collisions_yeast). Each GitHubrepository’s content can also be edited on a local computer, allowingusers to work offline and synchronize with the cloud when they connectto the internet. To edit repository files locally, we typically useVisual Studio Code, a popular open source editor that works seamlesslywith GitHub and has extensive features for writing and data analysis.
Adopting a GitHub-based workflow within a group or a team starts with adesignated administrator creating a GitHub organization. Then, eachmember of the group creates a GitHub account for themselves, and aremade members of the GitHub organization by the administrator. Once theorganization and user accounts are set up, the administrator or anygroup member can create a new repository for each project the group isworking on. All work and communication related to each project will berecorded within the corresponding repository. A practically unlimitednumber of repositories can be created within a GitHub organization, andaccess to each repository can be controlled by the administrator. Allfunctionalities that we describe in the following sections are currentlyavailable as part of the free GitHub plan. Research groups ineducational and non-profit institutions also get free access to theGitHub Team plan, which can be useful for accessing more advancedfeatures.
Use issues to organize andcollaborate
Experiments are the fundamental units of laboratory research projects.Yet few standards exist to conduct the design, execution, documentation,and interpretation of experiments in a structured, reproducible, andcollaborative manner. Even what constitutes a single experiment is oftenunclear from perusing lab notebooks. Notebooks are typicallychronological records of all activities by a single researcher, makingit difficult to isolate individual experiments. Further, the fragmentedtools for laboratory research are reflected in the fragmented nature ofthe experiments themselves. Design and execution steps are recorded inlab notebooks, physical samples are stored in freezers or shelves, datais kept in centralized servers, analysis scripts are backed up inpersonal computers, and ideas, hypotheses, and interpretation arediscussed in person or over electronic messaging apps. Tracking the fulltrajectory of an experiment across email discussions, physical samples,data, and analysis scripts becomes a challenge, particularly for futurelab members who may need to build upon the work after the leadresearcher leaves the project.
The GitHub ‘issues’ feature provides an intuitive and flexible interfaceto organize and collaborate on all aspects of a laboratory experiment.In software projects, issues were originally used to track bugs orproblems (hence the name ‘issue’), but their utility has expanded to newfeature proposals, maintenance tasks, and general discussiontopics16.Similarly, in laboratory research, we use issues not just fortroubleshooting, but for organizing and discussing all aspects ofresearch, from hypothesis formulation and experiment design to datainterpretation and manuscript writing. Each issue is limited to a singletopic, which focuses the ensuing discussion and resolution. In GitHub,issues are given a unique number and a URL(https://github.com/GROUP_NAME/PROJECT_NAME/issues/ISSUE_NUMBER) thatprovide a centralized location to track all work and discussion relatedto that issue. Each issue has a description field and optionalcommenting fields, which can be used to write, attach files, and pasteimages.
In our research group, each experiment begins with the creation of a newissue in the corresponding project repository by any of the projectmembers (Figure 4). The issue description is used todescribe the rationale and background of the experiment and the strategyfor performing the experiment. Project members can discuss aspects ofexperimental design, provide clarification in the comments section, andupdate the issue description as needed. Once an experiment is started,the comments section is used to discuss troubleshooting steps,intermediate data and figures, and interpretation of results. The issuenumber provides a convenient way to reference the experiment acrossphysical samples, work logs, computer file names, and discussions inother issues. For instance, we include the issue number as a prefix onthe labels of sample tubes along with suffixes denoting the sample typeor condition, which enables succinct and unambiguous tracking of samplesby all lab members. Once an experiment is completed, the issuedescription is updated with key conclusions, tables, and figures, andthe issue is ‘closed’. Even if the experiment proposed in an issue ispaused or ultimately not pursued, there is a record of thedecision-making process and the issue can be re-opened by a projectmember at any time. Finally, we use issues not just for experiments, butalso for discussing broader ideas for projects, reviewing specificl*terature topics, and for collaborating on grant proposals andmanuscripts. For complex experiments with multi-stage design andanalysis steps, we create separate issues to document the design andanalysis steps.
GitHub provides a number of features to organize and prioritize issueswithin a project and across projects (Figure 4). One ormore ‘assignees’ can be associated with each issue to ensure that theyreceive notifications about any work or discussion related to the issue,and to track responsibilities. Color-coded ‘labels’ can be used todistinguish between different issue types such as ‘experiment’, ‘dataanalysis’, ‘literature review’, or ‘project idea’, or to indicate thestate of the issue such as ‘todo’, ‘ongoing’, ‘paused’, ‘completed’,‘abandoned’. Issues can be grouped together into ‘milestones’ to trackprogress towards a specific goal or deadline. For example, we usemilestones to group issues that need to be completed for a figure in amanuscript or a grant. ‘Issue templates’ can also be created tostandardize the format of common issue types across contributors. Forexample, an ‘Experiment’ issue template can include prompts to includerelevant background, strategy, and conclusion in the description sectionand pre-populate the issue type label and common assignees such as thelead researcher on th e project. Each time a group member creates a newissue, they have the option to use one of these issue templates, whichis especially useful for new or junior group members. GitHub ‘projectboards’ provide a higher level visual interface to organize andprioritize issues across projects and repositories. For example, a groupmember can create a project board for themselves and add their ownfields such as due date or priority to each issue. During projectmeetings, the project board can be used to quickly understand each groupmember’s priorities and deadlines, helping ensure that all collaboratorsare on the same page.
Resource | URL |
---|---|
GitHub | https://docs.github.com/get-started |
Markdown | https://www.markdownguide.org/getting-started |
Visual Studio Code | https://code.visualstudio.com/docs |
Docker | https://docs.docker.com |
Git | https://swcarpentry.github.io/git-novice/ |
Pandoc | https://pandoc.org/MANUAL.html |
Semantic versioning | https://semver.org |
In summary, issues, a widely used feature in software development, alsoprovide an intuitive structure to organize and collaborate across everystage of laboratory-based projects from hypothesis formulation tomanuscript writing. An issue-based workflow enables group members toaccess and contribute to all project-related documentation andcommunication, regardless of the stage in which they join the project.This can be particularly useful for new lab members, who can quickly getup to speed on a project by reading through the issue descriptions andcomments. Closed issues can be reopened if needed, which can be usefulfor revisiting old experiments or ideas. Thus, by providing acentralized location for tracking information relevant to a specificexperiment, analysis, or idea, GitHub issues facilitate reproducibilityand knowledge transfer during all stages of laboratory researchprojects.
Use Git to store and track yourwork
When a researcher performs a specific experiment or data analysis task,they record the execution steps and results along the way. This recordis often maintained in physical or electronic lab notebooks for wet labexperiments and as code files in computer folders for data analyses.Once a set of experiments or analyses are completed, the researcher maywrite a manuscript or grant proposal that summarizes the results andconclusions. Such manuscripts or grant proposals are typically writtenusing software such as Microsoft Word or Google Docs, often in acollaborative manner with multiple authors. Since these steps can extendover several months or even years, it is frequently challenging tomaintain an organized and chronological record of the contributions madeby each group member, and to track the changes made to multiple files indifferent folders. It is common to have many copies of the same filewith cryptic names like ‘manuscript_v3.docx’, ‘annotations_ARS.xls’ toindicate their provenance. While electronic lab notebooks and ‘trackchange’ features in word processing software can help maintain a recordof changes to single files, these tools are neither designed to workacross files of various types, nor to easily identify each author ofoverlapping changes.
Git is a version control system that records the history of fileadditions and modifications in a folder, and is used by over 90% ofprogrammers worldwide to track changes to theircode20. Gitallows multiple copies of the folder to be asynchronously edited acrosscomputers, and GitHub repositories are essentially remote copies of alocal folder tracked using Git. Anyone with access to a GitHubrepository can download a local copy of the folder (‘clone’ in Gitterminology) , add or edit file in the folder, choose which files theywant to track (‘stage’), create a snapshot of the changes (‘commit’),and synchronize with the GitHub repository (‘push’). These Git featuresare tightly integrated into popular text editors like Visual StudioCode, which allows users to make, stage, commit, and push changes to aGitHub repository without leaving the text editor. Each commit isaccompanied by a ‘commit message’ which is a short description of whatchanges were made since the previous commit. Importantly, the commithistory of a project serves as an audit trail, recording who did what,and when.
In our research group, we store all files relevant to project within asingle folder on our local computers. We use Git to track changes inthat folder, and synchronize it with a cloud-based GitHub repository. Wewrite documents in plain text with lightweight Markdown syntax, as oftenas possible. Markdown enables focusing on content over formatting,enables all changes to be tracked by Git , and can be easily convertedto other formats (PDF, DOCX, HTML) using open source software likePandoc. Within each repository, we use standardized subfolder names forlab notebook entries, code, data, manuscripts, grants, and presentations(Figure 5). Within each of these subfolders,each project contributor creates a separate folder to record their work,even though every group member can contribute to all files in therepository. Lab notebooks entries corresponding to distinct GitHubissues are stored in separate files. We record all work pertinent to anissue in lab notebook files, similar to traditional lab notebookentries. Each lab notebook file includes the corresponding issue numberin its name and a link to the issue in its contents to enable easycross-referencing. All group members can access each other’s labnotebooks across different project repositories and participate indiscussion and troubleshooting steps by commenting on the correspondingissue.
We also store data analysis materials within the GitHub projectrepository, which ensures that experiment logs (lab notebooks) andanalysis scripts are tightly linked and easily referenced betweendocuments. Data analysis scripts, summary tables, and visualizationfigures are stored in an ‘analysis’ subfolder of the parent repository,using the same hierarchical structure and naming convention as for labnotebook entries. Separate folders are created for each issue withsubfolders for data, scripts, figures, tables, and sample metadata.While we store small datasets in the GitHub repository as comma- ortab-separated text files, larger datasets are stored either in privateAmazon Web Services S3 folders, or in public repositories such as theSequence Read Archive. We include short scripts to download data fromtheir long-term storage location, which also serves as a record of thelocation of the data. A comma-delimited, sample metadata file is createdfor every dataset, following tidyprinciples26, tofacilitate data analyses. Summary figures and tables are linked from thelab notebook page as a record of how the data was analyzed, and arelinked from issue comments during data interpretation andtroubleshooting discussions.
In addition to lab notebooks and data analyses, we use the GitHubrepository to store manuscripts, grant proposals, and presentationsrelated to the project. We write manuscripts and grant proposals asplain-text Markdown files, with one sentence per line, to enable easytracking of changes across Git commits. Markdown files can be easilyconverted to DOCX, TeX, or PDF formats using Pandoc, which also providesa suite of useful features for scientific writing such as citationprocessing and template-based formatting to meet journal and fundingagency requirements. Multiple project contributors often edit manuscriptand grant proposal files in parallel, which can be readily combinedusing the native merge functionality of Git while preserving the fullhistory of contributions. We prepare figures and presentation slides inthe widely used text-based scalable vector graphics (SVG) format, usingthe powerful open-source software Inkscape. SVG files are rendered inGitHub Markdown files and web browsers, and can be processed by mostcommercial graphics design software. Presentations are written asMarkdown files with SVG-based images and speaker notes, which can thenbe converted to slides in PPTX, PDF, or HTML formats using Pandoc.
In summary, the version control functionality of Git and GitHub that iswidely used to track changes to software code, can also be readily usedto track the work performed over the course of a project in a researchlaboratory, including molecular biology “wet” labs. Git repositoriesare drop-in replacements for lab notebooks while also providing atightly integrated structure for data analyses, manuscript writing,grant preparation, and slide presentation tasks. By serving as acentralized location for all project-related materials, they enablecontributors to reproduce and build on each other’s work. Crucially,even though we utilize GitHub for syncing repositories, projectmaterials themselves are independent of GitHub or any other platform,which secures their long-term accessibility. Further, by providing achronological and transparent audit of all project-relatedcontributions, Git repositories incentivize collaboration between groupmembers, and unambiguously indicate individual contributions duringmanuscript preparation and publication.
Use containers for coding and writingtasks
Reproducibility and collaboration within a group critically depend onthe ability of members to run each other’s data analyses and obtain thesame results. This allows members to build on each other’s work,troubleshoot issues, and reproduce results for manuscript writing andgrant preparation. However, software environments are difficult toreplicate, especially when they involve multiple programming languages,packages, and dependencies, as is common for complex data analysisworkflows in molecular biology. Typical challenges that group membersand future collaborators face when attempting to rerun months- oryears-old analysis workflows include deprecated syntax, incompatiblepackage versions, and broken dependencies. Additionally, collaborativewriting tasks such as manuscript or grant preparation require a specificset of software tools to produce the final document, which can bedifficult to replicate across different operating systems and softwareversions. The time and effort required to troubleshoot softwareincompatibility issues can be substantial, and can lead to theabandonment of the task altogether.
The problem of replicating data analysis workflows and softwareenvironments is a common challenge in software development as well as inlaboratory research. Computational researchers have long recognized thisproblem and have relied on tools such as the Conda packagemanager27, theSnakemake workflowmanager28, andDocker containers to addressit6. Containersare encapsulated, self-sufficient units that contain all the softwareneeded to run an analysis, and can be shared and run on any computerthat supports the container runtime environment. Public containerregistries, like Docker Hub orBiocontainers29,provide reproducibly-created containers, which can be used in dataanalysis workflows without the need to install any software. However,laboratory research groups have been slow to adopt these softwarereproducibility tools, which are often perceived as too complex ortime-consuming to learn and use.
In our research group, we use software containers to perform all dataanalyses and writing tasks in reproducible software environments. Wetake advantage of the Packages feature of GitHub to host our containersin a centralized location (https://github.com/orgs/rasilab/packages)that is free to use and publicly accessible. Each container in ourgroup’s GitHub Packages collection is linked to a dedicated GitHubrepository to store the plain text recipe, called a Dockerfile, forcreating that container (Figure 6). We have created afew general purpose containers with R, Python, and Pandoc software thatwe routinely use for data analysis and writing tasks in our group.Occasionally, we also create new containers from scratch, or modify anexisting container from a public container registry to include aspecific software package that is needed for a specialized analysis. Weuse semantic versioning to tag each container and its associated GitHubrepository, which allows us to unambiguously identify its contents anduse the same container in our data analysis workflows.
Our group uses containers in several ways for interactive data analyses,writing tasks, and complex bioinformatic workflows (Figure6). Containers in our group’s GitHub Packages canalso be used by external collaborators and readers of our publishedmanuscripts to reproduce data analyses. For interactive analyses on alocal computer with the Docker runtime environment, one of the generalpurpose containers from our group’s GitHub Packages registry can becopied (‘pulled’) to the local computer. Then the user can either accessthe container through the Remote-Containers extension in the VisualStudio Code editor, or run the container in a terminal window. Runningcontainers can be used to convert Markdown files to other formats usingPandoc, or to run R or Python scripts for data analyses. In sharedcomputing environments such as high-performance computing clusters,containers can be downloaded to a shared location and run using theApptainer (Singularity) runtime environment. Apptainer containers inremote computing environments can be used in workflow management tools,like Snakemake, to run multi-step computational analyses, or accessedfrom a personal computer using the Remote-Tunnels extension in theVisual Studio Code editor for interactive analyses.
In summary, containers, which are widely used in software developmentand computational research, facilitate reproducible and collaborativedata analysis and writing within laboratory research groups. GitHubPackages provide a centralized location for groups to store theirfrequently used software environments as Docker containers, and sharethem within the group and with the external scientific community. Oncecontainers are set up and optimized for a group’s common workflows by anexperienced group member or a bioinformatician, they can be used by allgroup members, far into the future, without change or detailed know-how.In our experience, containers are particularly useful for new members toget up to speed on our group’s data analysis and writing workflowswithout struggling to replicate the necessary software environments.
Discussion
In this practical guide, we have described our group’s approach fortracking all stages of laboratory research from idea generation andexperimental design, to data analysis and manuscript writing. Ourapproach is motivated by the recognition that established softwaredevelopment practices provide a concrete framework for addressing thereproducibility and collaboration challenges faced by laboratoryresearch groups. We have adopted widely used features from softwaredevelopment workflows, such as issues, version control, and containers,and adapted them to the specific needs of a molecular biologylaboratory. We have illustrated our approach using the GitHub platform,but other platforms such as GitLab and Bitbucket offer similarfunctionalities.
We recognize that adopting the approach outlined here can involve asteep learning curve, especially for laboratory research groups withlimited computational experience. However, there are several benefits tousing GitHub for laboratory research that we believe outweigh theinitial investment of time and effort. First, Git and GitHub are widelyused in both academia and industry, and thus the organization anddocumentation practices we describe are highly transferrable skills fortrainees. Second, Git and GitHub have comprehensive and user-friendlydocumentation (Table 1), and a number oftutorials and forums are available online to help new users troubleshootany issues that arise. Furthermore, these tools are so widely used thatvirtually any bioinformatician or bioinformatic core at an institutioncan help new teams set up and troubleshoot their GitHub workflow. Third,the workflow and features described here are highly modular. Therefore,teams can incrementally adopt them, while still deriving benefits totheir overall research productivity. Finally, the approach describedhere costs nothing to implement, and can be used by any research groupregardless of their size, funding level, or institutional affiliation.
While this guide covers the core functionalities of Git and GitHub forlaboratory research, there are additional features that can furtherenhance collaboration and productivity. For instance, Git branchingallows finer control over collaborative data analysis and writing acrosslarge teams, by allowing parallel development of different threads ofideas while retaining the history. GitHub Actions can enable thecreation of automated workflows for repetitive tasks, such as updatinglab website and documentation when changes are pushed to a GitHubrepository. Cloud-based containers, such as GitHub Codespaces, canenable groups to perform most of their analysis and writing tasks fromwithin a web browser without the need to install any software on theirlocal computer. Wiki and Discussions features in GitHub allowdocumentation of protocols and open-ended conversations that are outsidethe scope of specific projects. These features, while beyond the scopeof this introductory guide, can be adopted by laboratory research groupsas they become more comfortable with Git and GitHub.
The organizational approach described here is tailored to the lifecycleof a conventional laboratory research project from idea generation tomanuscript writing. Nevertheless, this workflow offers richpossibilities for a more reproducible and collaborative researchenterprise at the institutional and community levels. For instance,GitHub issues can be used after manuscript publication to handle reagentrequests and answer protocol-related questions, thus providing acentralized location for community feedback and engagement. Institutionscan provide backup and support for GitHub repositories, thereby ensuringthat the research record is preserved even if the original researchgroup is no longer active or associated with the institution. Withpublic GitHub repositories, community experts can contribute ideas andfeedback during the research process, and their contributions will bevisible in the repository history and issue comments. The GitHubrepository itself can serve as a living manuscript, with GitHub releasesor tags constituting different versions of the manuscript as it evolvesover time. Thus, the approach outlined here could potentially acceleratethe pace of scientific discovery by enabling faster dissemination ofresults and fostering more collaboration opportunities.
Author Contributions
K.Y.C, M.T.M., and A.R.S. wrote the manuscript. A.R.S. acquired funding.
Acknowledgements
We thank members of the Subramaniam lab, the Basic Sciences Division,and the Computational Biology Program at Fred Hutch for discussions, andRechel Geiger, Pravrutha Raman, and Jamie Yelland for feedback on themanuscript. This research was funded by NIH R35 GM119835 (A.R.S.), NSFMCB 1846521 (A.R.S.), NIH R01 AT012826 (A.R.S.), and the Hanna H. GrayFellowship GT16007 (M.T.M). The funders had no role in decision topublish or preparation of the manuscript.
Competing interests
None
References
re1.Hrynaszkiewicz, I., Simons, N., Hussain, A., Grant, R.& Goudie, S. Developing aResearch Data Policy Framework for All Journals and Publishers.Data Science Journal 19, 5–5 (2020).
pre2.Tedersoo, L. et al.Data sharing practicesand data availability upon request differ across scientificdisciplines. Sci Data 8, 192 (2021).
pre3.Data Management and Sharing Policy | DataSharing.https://sharing.nih.gov/data-management-and-sharing-policy.
pre4.Heil, B. J. et al.Reproducibilitystandards for machine learning in the life sciences. Nat Methods18, 1132–1135 (2021).
pre5.Noble, W. S.A Quick Guide toOrganizing Computational Biology Projects. PLOS ComputationalBiology 5, e1000424 (2009).
pre6.Grüning, B. et al.PracticalComputational Reproducibility in the Life Sciences. Cell Syst6, 631–635 (2018).
pre7.Jenkins, G. B. et al.Reproducibility in ecology andevolution: Minimum standards for data and code. Ecol Evol13, e9961 (2023).
pre8.Diaba-Nuhoho, P. & Amponsah-Offeh, M.Reproducibility andresearch integrity: the role of scientists and institutions. BMCResearch Notes 14, 451 (2021).
pre9.Baker, M. 1,500scientists lift the lid on reproducibility. Nature 533,452–454 (2016).
pre10.Ram, K. Gitcan facilitate greater reproducibility and increased transparency inscience. Source Code for Biology and Medicine 8, 7(2013).
pre11.Lowndes, J. S. S. et al.Our path to betterscience in less time using open data science tools. Nat EcolEvol 1, 1–7 (2017).
pre12.Berg, J.Editorial retraction.Science 358, 458–458 (2017).
pre13.Wosen, J. Genentech review of Tessier-Lavigne paperfinds no evidence of fraud — but hints at a different misconduct case.STAThttps://www.statnews.com/2023/04/06/genentech-marc-tessier-lavigne-stanford-misconduct-investigation/(2023).
pre14.Ryan, P. Keeping a Lab Notebook: Basic Principles andBest Practices. Office of Intramural Training and Education,National Institutes of Health (2010).
pre15.Higgins, S. G., Nogiwa-Valdez, A. A. & Stevens, M. M.Considerations forimplementing electronic laboratory notebooks in an academic researchenvironment. Nat Protoc 17, 179–189 (2022).
pre16.Johnson, J. N. & Dubois, P. F.Issue Tracking.Computing in Science and Engg. 5, 71–77 (2003).
pre17.Blischak, J. D., Davenport, E. R. & Wilson, G.A Quick Introductionto Version Control with Git and GitHub. PLOS ComputationalBiology 12, e1004668 (2016).
pre18.Moreau, D., Wiebels, K. & Boettiger, C.Containers forcomputational reproducibility. Nat Rev Methods Primers3, 1–16 (2023).
pre19.Key GitHub Statistics in 2024 (Users, Employees, andTrends). Kinsta® https://kinsta.com/blog/github-statistics/(2023).
pre20.Stack Overflow Developer Survey 2022. StackOverflow https://survey.stackoverflow.co/2022.
pre21.The state of open source software. The State ofthe Octoversehttps://octoverse.github.com/2022/state-of-open-source.
pre22.Cadwallader, L. & Hrynaszkiewicz, I.A survey of researchers’ codesharing and code reuse practices, and assessment of interactive notebookprototypes. PeerJ 10, e13933 (2022).
pre23.Perez-Riverol, Y. et al.Ten Simple Rules forTaking Advantage of Git and GitHub. PLoS Comput Biol12, e1004947 (2016).
pre24.Stanisic, L., Legrand, A. & Danjean, V.An Effective Git AndOrg-Mode Based Workflow For Reproducible Research. OperatingSystems Review 49, 61 (2015).
pre25.Chure, G. Be Prospective, Not Retrospective: APhilosophy for Advancing Reproducibility in Modern Biological Research.(2022)doi:10.48550/arXiv.2210.02593.
pre26.Wickham, H.Tidy Data. J. Stat.Soft. 59, (2014).
pre27.Grüning, B. et al.Bioconda: sustainableand comprehensive software distribution for the life sciences.Nat Methods 15, 475–476 (2018).
pre28.Köster, J. & Rahmann, S.Snakemake—ascalable bioinformatics workflow engine. Bioinformatics28, 2520–2522 (2012).
pre29.Da Veiga Leprevost, F. et al.BioContainers: anopen-source and community-driven framework for softwarestandardization. Bioinformatics 33, 2580–2582 (2017).
p