Reproducible computational environments
Vaclav (Vashek) Petras
NCSU GIS 595-601: Tools for open geospatial science
November 8, 2017
Dependencies
- unknown software dependencies
- code does not specify its dependencies
- many software dependencies
- the list of dependencies is long
- chain of dependencies
- the specified dependencies require new set dependencies
Dependency hell
- software package depends on number of other software packages
with specific versions
- different software package depends on the same software packages
but requires different (incompatible) versions
- DLL hell (MS Windows)
- JAR hell (Java)
Data
- data for computations
- software/system configuration
Code rot
- bit rot, software rot, software erosion, software decay, software entropy
- dependencies change
- code breaks
- code gives different results in different environments
Documentation
- imprecise documentation
- lack of documentation
- completeness of documentation
- how to put all things together?
Python Virtual Environments
python -m venv
- isolated installation of Python packages
- each environment has a separate set of packages
Virtual Machines
- guest operating system is running on a virtual machine
- virtual machine disk/image contains (installed) operating system,
libraries, and executable programs (and data)
- open source: QEMU (Quick Emulator), VirtualBox (Oracle VM VirtualBox)
- proprietary: VMware Workstation, Parallels Desktop for Mac
Docker vs Virtual Machine
Docker:
- containers running on a host system (kernel)
- shares resources
- operating system and dependencies in layers
- lightweight, fast
Virtual Machine:
- host system runs virtual machine with guest system
- needs lot of computational resources
- heavyweight, slow
- completely isolated
Docker running non-Linux OS
- possible to run MS Windows, but you or people using your code
will likely break the license agreement sooner or later
- MS Windows (more or less) requires MS Windows host operating system
Problems in scientific use
- operating system may get outdated, base image may disappear
- operating system package repository is decommissioned
- convenient advanced images may disappear
- base image gets updated
- too much focus on clouds and industry
(e.g. new versions for MS Windows require professional edition)
Many problems appear only when only a Dockerfile is published.
Publishing image removes some of the problems, but brings some
disadvantages such as big binary files.
Alternatives
- Open Container Initiative (OCI), CoreOS rkt (Rocket), Vagrant, ...
Dockerfile
FROM ubuntu:16.04
RUN apt-get update
RUN apt-get install -y \
g++ \
python \
python-numpy \
...
Building an image
- build image using a Dockerfile
- in a directory (repository) with a Dockerfile
docker build -t test1 .
Running a container
- run a container based on an image
- runs any associated or specified command
docker run -t test1 /code/computation.sh
Linking directories
- link local (guest) directory with a directory in a container
- useful for getting large data in and out from a container
docker run -v /home/.../test1-data:/data -t test1 /code/computation.sh
Linking ports
- link guest port to port in a container
- needed for web applications such as Jupyter Notebook
docker run -p 8888:8888 -t test1 /code/computation.sh
Setting environmental variables
- set system environmental variable in the container
- useful for program settings and parameters
docker run -e VARIABLE_NAME=42 -t test1 /code/computation.sh
Passing parameters to the process
docker run -t test1 /code/computation.sh param1 param2