Computational notebooks for Open Science


Posted on

abstract image

Hugh Shanahan (Royal Holloway University of London) gave a presentation at the DSNE seminar on 10 March 2022 on the use of computational notebooks for Open Science.

In the first part of the talk, Hugh gave an informative introduction of computational notebooks[1] such as those provided by Jupyter[2], R markdown[3] and MATLAB[4] notebooks. Computational notebook[5] is a virtual notebook environment used for literate programming, which consists of cells of documentation, executable code, and code output (Fig. 1)[6]. Hugh argued the use of computational notebooks has become a hugely successful mechanism for sharing the analysis of data and this represents an important step forward for reproducibility. At the time of the talk, there are 8 million Jupyter notebooks on GitHub.

Why are computational notebooks so successful? Hugh suggested these factors:

- Reproducibility—the executable code boxes invites user to click them and see what happen

- Understanding—it provides an environment to experiment with each part of the analysis alongside with its documentation

- Integration—it provides an easy mechanism to swap datasets and re-run the analysis

Hugh also shared with us the results of a survey his team carried out last year on notebook use cases, which will be published in due course. He also points out the limitation of notebooks as software—for instance, it does not work well for writing large software packages, However, it is an excellent tool for prototyping and for writing examples to showcase a piece of software. It was also highlighted that notebooks have become an increasing integral way to document research workflow. For the third year in a row EarthCube has issued a call for notebooks as peer-reviewed submissions[7].

Towards the end of the talk, Hugh looked ahead and highlighted some open issues on the long-term preservations of notebooks and making them FAIR[8]. Notebooks are software and it can be challenging and costly to make sure they can still run decades later. Some long-term preservation plan is needed as they contain so much research information. There is also a need to prevent issues such as broken web links, better use of metadata is needed. Finally, there is also a need to increase findability of notebooks. Solutions may include making them easier to find on search engines, including them on journal indices, and build a mechanism for citation of notebooks.

At DSNE, the use of notebooks has become commonality in the past few years. Notebooks are particularly useful for exploring environmental data and experimenting with different machine learning methods. Under the “Virtual Lab development” theme of DSNE[9] and other initiatives, the cloud collaborative virtual research environment DataLabs[10] has been developed to help address a number of computing challenges environmental science. Computational notebooks underpins the DataLabs platform, as scientists can collaborate online via the use of notebooks to integrate and analyse data, apply novel data science methods, and share their output. DataLabs furthers and facilitates the step forward for reproducibility that computational notebooks represent.

We strive to promote this to the environmental science community.

Fig.1: Screenshot of a Jupyter notebook[11].

Fig.2: A DataLabs screenshot of the project notebooks landing page

[1] Or just ‘notebooks’ for short

[2] https://jupyter.org/

[3] https://rmarkdown.rstudio.com/lesson-10.html

[4] https://uk.mathworks.com/products/matlab/live-editor.html

[5] https://en.wikipedia.org/wiki/Notebook_interface

[6] It is important not to confuse a computational notebook and an electronic lab notebook (e.g. LabStep, OneNote) that records the procedures and steps taken in a wet lab or in the field.

[7] https://www.earthcube.org/post/call-for-notebooks-cfn-22

[8] https://en.wikipedia.org/wiki/FAIR_data; Wilkinson et al. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data. 3 (1): 160018. https://doi.org/10.1038/SDATA.2016.18

[9] https://www.lancaster.ac.uk/data-science-of-the-natural-environment/blogs/virtual-labs-breaking-down-barriers-in-environmental-data-science

[10] https://eds.ukri.org/tools/datalabs

[11] Modified from https://www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/

Related Blogs


Disclaimer

The opinions expressed by our bloggers and those providing comments are personal, and may not necessarily reflect the opinions of Lancaster University. Responsibility for the accuracy of any of the information contained within blog posts belongs to the blogger.


Back to blog listing