Reproducible computational environments

Vaclav (Vashek) Petras

NCSU GeoForAll Lab at the Center for Geospatial Analytics
North Carolina State University

NCSU GIS 595-601: Tools for open geospatial science
November 8, 2017

Motivation

Dependencies

  • unknown software dependencies
    • code does not specify its dependencies
  • many software dependencies
    • the list of dependencies is long
  • chain of dependencies
    • the specified dependencies require new set dependencies

Dependency hell

  • software package depends on number of other software packages with specific versions
  • different software package depends on the same software packages but requires different (incompatible) versions
  • DLL hell (MS Windows)
  • JAR hell (Java)

Data

  • data for computations
  • software/system configuration

Code rot

  • bit rot, software rot, software erosion, software decay, software entropy
  • dependencies change
  • code breaks
  • code gives different results in different environments

Documentation

  • imprecise documentation
  • lack of documentation
  • completeness of documentation
  • how to put all things together?

Solutions

Python Virtual Environments

  • python -m venv
  • isolated installation of Python packages
  • each environment has a separate set of packages

Virtual Machines

  • guest operating system is running on a virtual machine
  • virtual machine disk/image contains (installed) operating system, libraries, and executable programs (and data)
  • open source: QEMU (Quick Emulator), VirtualBox (Oracle VM VirtualBox)
  • proprietary: VMware Workstation, Parallels Desktop for Mac

Docker

Docker vs Virtual Machine

Docker:

  • containers running on a host system (kernel)
  • shares resources
  • operating system and dependencies in layers
  • lightweight, fast

Virtual Machine:

  • host system runs virtual machine with guest system
  • needs lot of computational resources
  • heavyweight, slow
  • completely isolated

Docker running non-Linux OS

  • possible to run MS Windows, but you or people using your code will likely break the license agreement sooner or later
  • MS Windows (more or less) requires MS Windows host operating system

Problems in scientific use

  • operating system may get outdated, base image may disappear
  • operating system package repository is decommissioned
  • convenient advanced images may disappear
  • base image gets updated
  • too much focus on clouds and industry (e.g. new versions for MS Windows require professional edition)

Many problems appear only when only a Dockerfile is published. Publishing image removes some of the problems, but brings some disadvantages such as big binary files.

Alternatives

  • Open Container Initiative (OCI), CoreOS rkt (Rocket), Vagrant, ...

Using Docker

Dockerfile

FROM ubuntu:16.04
RUN apt-get update
RUN apt-get install -y \
        g++ \
        python \
        python-numpy \
        ...

Building an image

  • build image using a Dockerfile
  • in a directory (repository) with a Dockerfile
docker build -t test1 .

Running a container

  • run a container based on an image
  • runs any associated or specified command
docker run -t test1 /code/computation.sh

Linking directories

  • link local (guest) directory with a directory in a container
  • useful for getting large data in and out from a container
docker run -v /home/.../test1-data:/data -t test1 /code/computation.sh

Linking ports

  • link guest port to port in a container
  • needed for web applications such as Jupyter Notebook
docker run -p 8888:8888 -t test1 /code/computation.sh

Setting environmental variables

  • set system environmental variable in the container
  • useful for program settings and parameters
docker run -e VARIABLE_NAME=42 -t test1 /code/computation.sh

Passing parameters to the process

docker run -t test1 /code/computation.sh param1 param2