NCSU GIS 714:
Geospatial Computing and Simulations

Urban growth modeling with FUTURES on Hazel

Here we demonstrate how to run parallel jobs in GRASS using an urban growth modeling case study.

We will all share PERMANENT mapset of Location albers with prepared data and each student will create their own mapset (using their user name) for computations in their own folder ( description of this setup). Typically, the PERMANENT Mapset may be in a research storage and you may not have write access to it. Please do not modify anything in the PERMANENT mapset.

mkdir /share/gis714s23/$USER/grassdata/albers
ln -s /share/gis714s23/grassdata/albers/PERMANENT/ /share/gis714s23/$USER/grassdata/albers/
Install pynodelauncher:
module use --append /usr/local/usrapps/geospatial/modulefiles
module load gcc
module load grass
module load PrgEnv-intel
pip install git+https://github.com/ncsu-landscape-dynamics/pynodelauncher.git
Start GRASS GIS in a temporary mapset mapset:
grass --tmp-mapset /share/gis714s23/$USER/grassdata/albers/
Check you can access the data:
g.list type=raster,vector  -mt
Install FUTURES extension:
g.extension r.futures
Exit GRASS session (temporary mapset will be removed):
exit
We won't be going through the entire workflow, but rather demonstrating some parallel computation techniques on selected parts of the workflow. The next three sections (computing slope, distance to features, and development pressure) are independent, so you can submit all the jobs at the same time.

Computing slope

Tool r.slope.aspect can take advantage of multiple cores, because it uses OpenMP library to automatically distribute work among certain number of threads. OpenMP uses shared memory and therefore the job needs to run on a single node. We will submit a parallel job, that requests 4 cores and specifies that those cores need to be on one computing node (span[hosts=1]). The number of cores specified in the header should match the number in nprocs parameter. We will compute slope in a separate mapset called slope. Note that we don't have to set computational region for the new mapset because all new mapsets are using default region that is set in PERMANENT.
#!/bin/bash
#BSUB -n 4
#BSUB -W 15
#BSUB -R "rusage[mem=1GB]"
#BSUB -R span[hosts=1] 
#BSUB -oo slope_out
#BSUB -eo slope_err
#BSUB -J slope

module use --append /usr/local/usrapps/geospatial/modulefiles
module load grass

# remove this mapset which may exist from a previous run of this job
rm -rf /share/gis714s23/$USER/grassdata/albers/slope

# create a new mapset with -c flag
grass -c /share/gis714s23/$USER/grassdata/albers/slope --exec r.slope.aspect elevation=DEM slope=slope nprocs=4
Submit this job with a different number of cores and compare the run time. Change the name of the mapset if you want to run it at the same time.

Computing distance

Here we compute distance to roads, water and forest that will serve as predictors. Since tool r.grow.distance is not internally parallelized, we will execute distance to roads, water, and forest so that they run concurrently using 3 cores. This can be simply achieved by executing them in the background. Alternatively, you can split it and execute as 3 separate jobs.
#!/bin/bash
#BSUB -n 3
#BSUB -W 3:00
#BSUB -R "rusage[mem=1GB]"
#BSUB -oo dist_out
#BSUB -eo dist_err
#BSUB -J dist

module use --append /usr/local/usrapps/geospatial/modulefiles
module load grass

rm -rf /share/gis714s23/$USER/grassdata/albers/roads
rm -rf /share/gis714s23/$USER/grassdata/albers/water
rm -rf /share/gis714s23/$USER/grassdata/albers/forest

# run in background by appending & after the tool
grass -c /share/gis714s23/$USER/grassdata/albers/roads --exec r.grow.distance input=roads distance=dist_to_roads -m &
grass -c /share/gis714s23/$USER/grassdata/albers/water --exec r.grow.distance input=water distance=dist_to_water -m &
grass -c /share/gis714s23/$USER/grassdata/albers/forest --exec r.grow.distance input=forest distance=dist_to_forest -m &
# this will wait for the processes to finish before starting the next set of processes
wait
grass /share/gis714s23/$USER/grassdata/albers/roads --exec r.mapcalc "log_dist_to_roads = log(dist_to_roads + 1)" &
grass /share/gis714s23/$USER/grassdata/albers/water --exec r.mapcalc "log_dist_to_water = log(dist_to_water + 1)" &
grass /share/gis714s23/$USER/grassdata/albers/forest --exec r.mapcalc "log_dist_to_forest = log(dist_to_forest + 1)" &
wait

Computing development pressure

Here we combine the two previous approaches to compute development pressure for 2001 and 2019. Tool r.futures.devpressure is internally parallelized, so we can run the tool for both years concurrently and for each of them request certain number of cores. Notice that we need to request twice as many cores then (on a single node).
#!/bin/bash
#BSUB -n 16
#BSUB -W 10:00
#BSUB -R span[hosts=1]
#BSUB -R "rusage[mem=2GB]"
#BSUB -oo devpressure_out
#BSUB -eo devpressure_err
#BSUB -J devpressure

module use --append /usr/local/usrapps/geospatial/modulefiles
module load grass

rm -rf /share/gis714s23/$USER/grassdata/albers/devpressure_2001
rm -rf /share/gis714s23/$USER/grassdata/albers/devpressure_2019

# run in background by appending & after the tool
grass -c /share/gis714s23/$USER/grassdata/albers/devpressure_2001 --exec r.futures.devpressure input=urban_no_roads_2001 output=devpressure_2001 size=20 gamma=1 nprocs=8 &
grass -c /share/gis714s23/$USER/grassdata/albers/devpressure_2019 --exec r.futures.devpressure input=urban_no_roads_2019 output=devpressure_2019 size=20 gamma=1 nprocs=8 &

# this will wait for the processes to finish
wait

Running the simulation

Once all the previous steps are finished, we can run the futures simulation. We will use already precomputed input CSV files. First, in your home create a new file named "pga.sh" with the following content:
#!/bin/bash

g.region raster=state_${1}
r.mask raster=masking
r.futures.pga developed=urban_2019 development_pressure=devpressure_2019@devpressure_2019 \
    compactness_mean=0.5 compactness_range=0.1 discount_factor=1 \
    predictors=log_dist_to_forest@forest,log_dist_to_roads@roads,log_dist_to_water@water,slope@slope \
    n_dev_neighbourhood=20 devpot_params=/share/gis714s23/input_csv/best_model.csv num_neighbors=4 seed_search=probability \
    development_pressure_approach=gravity gamma=1 scaling_factor=1 num_steps=31 \
    subregions=state_${1} demand=/share/gis714s23/input_csv/demand_${1}.csv \
    output=out_state_${1}_seed_${2} patch_sizes=/share/gis714s23/input_csv/patch_sizes_${1}.csv random_seed=${2}
Then create a script "bsub_futures.sh" with the content below:
#!/bin/bash
#BSUB -n 21
#BSUB -R "rusage[mem=150GB]"
#BSUB -R span[ptile=10]
#BSUB -W 40:00
#BSUB -oo futures_out
#BSUB -eo futures_err
#BSUB -J futures

module use --append /usr/local/usrapps/geospatial/modulefiles
module load grass
module load PrgEnv-intel


rm -f /share/gis714s23/${USER}/pga_jobs.sh
for SEED in `seq 1 10`
do
    for STATE in 1 12 13 37 45 47
    do
        rm -rf /share/gis714s23/$USER/grassdata/albers/pga_${STATE}_${SEED}
        echo grass -c /share/gis714s23/$USER/grassdata/albers/pga_${STATE}_${SEED} --exec bash /home/${USER}/pga.sh ${STATE} ${SEED} >> /share/gis714s23/${USER}/pga_jobs.sh
    done
done

mpiexec python -m mpi4py -m pynodelauncher /share/gis714s23/${USER}/pga_jobs.sh
This submission script first generates a file, which looks like this:
grass -c /share/gis714s23/akratoc/grassdata/albers/pga_1_1 --exec bash /home/akratoc/pga.sh 1 1
grass -c /share/gis714s23/akratoc/grassdata/albers/pga_12_1 --exec bash /home/akratoc/pga.sh 12 1
...
This file contains commands that will be distributed for computation across multiple nodes, each one of them computing the simulation for single state an random seed. Notice the submission parameters, there are slightly more complex. We estimate memory to be 10-15 GB per state (will differ based on the state size), so we can fit around 10 concurrent simulations on a node when we request 150GB. Therefore, we specify that with rusage[mem=150GB] and span[ptile=10]. When we request 20 cores, that will result in requesting 2 nodes. We request 21 instead of 20 to include one extra core for MPI thread management. This is not necesserily the most efficient way, we could also submit separate jobs, but large number of job submissions is generally discouraged. There will be total of 60 (6 states x 10 seeds) FUTURES computations on 18 cores, once a computation from the list is finished the next one is started. When you submit your job, it may be pending for a while, you are requesting a lot of resources for a long time. For more complex submission you can consult HPC IT.

Postprocessing

We need to patch the results for individual states. For that we will use tool r.patch which is internally parallelized. First, create a bash script patch.sh that accepts 2 parameters (seed number and number of cores), loads grass, lists all FUTURES output rasters belonging to the specified seed and executes r.patch in a newly created mapset.
# get parameters
SEED=${1}
N=${2}

module use --append /usr/local/usrapps/geospatial/modulefiles
module load grass
module load PrgEnv-intel

# list results for each state for a single seed and save to a variable
MAPS=`grass --tmp-mapset /share/gis714s23/$USER/grassdata/albers --exec \
     g.list type=raster pattern="out_state_*_seed_${SEED}" mapset="*" -m separator=comma`
rm -rf /share/gis714s23/$USER/grassdata/albers/results_${SEED}
grass -c /share/gis714s23/$USER/grassdata/albers/results_${SEED} --exec r.patch input=${MAPS} output="out_seed_${SEED}" nprocs=${N}
grass /share/gis714s23/$USER/grassdata/albers/results_${SEED} --exec r.colors map=out_seed_${SEED} raster=out_state_1_seed_${SEED}@pga_1_${SEED}
Then create another script that submits the patching jobs for each seed (submit_patch.sh):
N=4
for SEED in {1..10}
do
    bsub -n ${N} -W 1:00 -R span[hosts=1] -J patch_${SEED} -oo patch_${SEED}_out -eo patch_${SEED}_err bash patch.sh ${SEED} ${N}
done
Run the script with bash submit_patch.sh. This will submit 10 jobs. The results are rasters out_seed_1@results_1, out_seed_2@results_2, etc. Report the run times for all your computations and include a figure with the results.