Complexity vs Quality: The Bumpy Relation of Scientific Software
Scientific software is used in physical, environmental, earth and life sciences on a daily basis to make important discoveries. Due to its highly specialized nature, scientific software is frequently developed by scientists with deep domain knowledge, but not necessarily deep knowledge in technologies and tools used by software engineers and developers that build more mainstream applications. As a result, scientific software tends to be highly customized, less flexible, complex, poorly tested, less documented and even less maintained in the long run
Computational science: Error… why scientific programming does not compute (Zeeya Merali, Nature: 2010)
Reproducible Computational Research
Many issues plaguing scientific software have been discussed in the literature, but the ability to reproduce computational discoveries has taken center stage in recent years. The term reproducible computational research has been coined, and used as an umbrella concept for identifying and proposing solutions to issues that affect the reproducibility of computational scientific research.
Scientific Reproducibility through Computational Workflows and Shared Provenance Representations (Yolanda Gil, NSF Workshop: 2010)
Some Proposed Solutions
Although the challenge of reproducible computational research is multi-dimensional, some of the proposed solutions are rooted in existing, well established and robust software engineering solutions such as:
- Source code management (SCM)
- Computational Workflow Engines
- Scalable and distributed compute platforms
- Compute and storage hardware virtualization
- Centralized repositories of digital collections of scientific data
In addition, the organized and homogeneous tagging of scientific data with metadata (data about data) has been a well-established foundation for information retrieval and discovery. The development of consistent metadata and controlled vocabularies is another important component to searching, finding and using scientific data in a manner consistent with reproducible research.
Finally, (and to some degree an obvious requirement) reproducible computational research depends on the ability of other scientists or research experts to freely access the source code and scientific data used in generating new computational discoveries. These free and open access concepts have been championed by many in the software development community under the umbrella of the open-source community. Open-source code is meant to be a collaborative effort, where programmers improve upon the source code and share the changes within the community.
The BioUno open source project seeks to improve scientific application automation, performance, reproducibility, usability, and management by applying and extending software engineering (SE) best practices in the field of scientific research applications. Deliverables from the project have found a variety of applications in life science research (bioinformatics, genetics, drug discovery).
- We explore and apply the application of best practices in software engineering to support the project mission
- We develop extensions to established SE tools, frameworks and technologies that directly support or indirectly enhance scientific applications.
- We develop APIs and integration points that empower scientific applications through the use
- We promote collaboration and reuse through contributing to existing open source projects
- We educate users through blog, wiki, and presentations on the application of SE best practices in scientific applications
- We advocate with software engineers for enabling SE tools and frameworks for use by scientists
Check out our roadmap for a list of short and long term specific objectives.
BioUno has pioneered the use of continuous integration tools and techniques to create reproducible computational pipelines and to manage computer clusters in support of scientific research applications.
In addition, BioUno has adopted a variety of Software Engineering best practices, to achieve its objectives:
- Revision Control (e.g. Git, Subversion, branching strategies),
- Continuous Integration (e.g. Jenkins CI, SonarQube, code metrics, reproducible builds),
- Software Testing (e.g. Nestor-QA, TestLink, TDD, code coverage),
- Virtualization (e.g. Docker, Vagrant, VirtualBox)
- It uses SE tools and techniques to create powerful pipelines and to manage computer clusters.
Finally, BioUno strives to minimize the open source proliferation problem. While the BioUno project covers a broad range of technologies and tools, it tries to avoid the Open-Source proliferation problem by actively contributing to existing open-source projects rather than releasing or starting a new project.