Python and Open Science Technologies

Vaclav (Vashek) Petras

NCSU GeoForAll Lab at the Center for Geospatial Analytics
North Carolina State University

GIS 710: Geospatial Analytics for Grand Challenges
November 18, 2024

https://bit.ly/2oR9owy

Learning Objectives

  • bits and pieces needed for re-running code
  • overview of concrete technologies for computational reproducibility
  • hands-on practice with some of these technologies (Python, Jupyter, Binder)

Code: Python

  • readable syntax, close to ad-hoc pseudo-code
  • scripting, interpreted
  • general purpose
  • open source implementations
  • alternatives for data science:
    • R
    • Julia
    • Matlab (proprietary), GNU Octave (open source)


Python logo

Interaction: Jupyter

  • Julia, Python, R (and many others)
  • JupyterLab, Jupyter Notebooks, IPython
  • combination of text, images, computational results, code, ...
  • alternatives:
    • RStudio, R Markdown
    • Quarto
    • other notebooks and literate programming tools

Jupyter Notebook with GRASS GIS

Code Management: Git and GitHub

  • GitHub is something else than Git!
  • Git is a version-control system.
  • GitHub is a service providing Git repository hosting and related services.
  • Git is open source. GitHub is proprietary, freemium service.
  • Alternatives:
    • Mercurial, Bazaar, Fossil, Subversion, ...
    • GitLab, Bitbucket, ...


GitHub Octocat logo

Running and Sharing: Binder

  • mybinder.org
  • Service turning (Git) repositories and (ZIP) archives into computational environments
  • Jupyter Notebooks, JupyterLab, RStudio, RShiny, Jupyter Appmode, presentation slides, ...


Binder logo

Binder Alternatives

  • Code Ocean (Jupyter, RStudio, preservation)
  • CoCalc (Jupyter, ..., desktop applications)
  • RStudio Cloud (R focused)
  • Google Colab (Jupyter Notebooks, Google Cloud Platform)
    • Good for collaboration, interaction with EE, but not for open science.
    • Requires sign in, notebook can access your Google Drive and credentials.
    • Setup of the environment is part of the notebook.
  • The Whole Tale (funded by NSF)
  • HydroShare (by CUAHSI, funded by NSF)
  • Pangeo (funded by NSF, NASA, ...)
  • ...

Why Binder?

  • Community and long term project (Project Jupyter at NumFOCUS and cloud providers)
  • Python, R, ...
  • Similar concepts to RStudio Cloud, Code Ocean, ...
  • Open source (building blocks of Binder: JupyterHub, BinderHub, repo2docker)
    • More than one instance exists
  • Connects to Git, GitHub, GitLab, Zenodo, figshare, HydroShare, ...
  • No account needed to view, edit, and run


Binder logo

Computational Environment: Docker

  • “virtualization” tool to wrap software with all its dependencies into one package
  • image is the software or binary, container is what is running
  • enables to run many isolated computational environments
  • open source software and related freemium platform
  • alternatives:
    • virtual machines (local or in the cloud)
    • Vagrant (configuration like Docker, but for virtual machines)
    • Singularity (similar to Docker, but for HPC clusters)
    • ...


Docker logo

Computational Environment: Specifications

Dockerfile: plain text specification of how the computational environment should look like in Docker

FROM ubuntu:22.04
RUN apt-get update
RUN apt-get install -y \
        g++ \
        python \
        python-numpy

requirements.txt: list of Python packages (and optionally their versions) consumed by package installer

pyunpack
GDAL==3.0.1

More accepted by Binder: environment.yml, runtime.txt, apt.txt, DESCRIPTION, default.nix, postBuild, start, ...

Summary

  • GitHub hosts Git repositories and allows editing the content.
  • Binder uses Docker to create the customized computational environments.
  • People use JupyterLab to run and edit scripts and Jupyter Notebooks.


JupyterLab with GRASS GIS

Hands-on

https://bit.ly/2oR9owy
Section: Exercise