Urban growth modeling with FUTURES on Hazel
Here we demonstrate how to run parallel jobs in GRASS using an urban growth modeling case study.
We will all share PERMANENT mapset of Location albers
with prepared data and each student will create
their own mapset (using their user name) for computations in their own folder (
description of this setup).
Typically, the PERMANENT Mapset may be in a research storage and you may not have write access to it.
Please do not modify anything in the PERMANENT mapset.
mkdir /share/gis714s23/$USER/grassdata/albers
ln -s /share/gis714s23/grassdata/albers/PERMANENT/ /share/gis714s23/$USER/grassdata/albers/
Install pynodelauncher:
module use --append /usr/local/usrapps/geospatial/modulefiles
module load gcc
module load grass
module load PrgEnv-intel
pip install git+https://github.com/ncsu-landscape-dynamics/pynodelauncher.git
Start GRASS GIS in a temporary mapset mapset:
grass --tmp-mapset /share/gis714s23/$USER/grassdata/albers/
Check you can access the data:
g.list type=raster,vector -mt
Install FUTURES extension:
g.extension r.futures
Exit GRASS session (temporary mapset will be removed):
exit
We won't be going through the entire workflow, but rather demonstrating some parallel computation techniques on selected
parts of the workflow. The next three sections (computing slope, distance to features, and development pressure)
are independent, so you can submit all the jobs at the same time.
Computing slope
Toolr.slope.aspect
can take advantage of multiple cores, because it uses OpenMP library to automatically distribute
work among certain number of threads. OpenMP uses shared memory and therefore the job needs to run on a single node.
We will submit a parallel job,
that requests 4 cores and specifies that those cores need to be on one computing node (span[hosts=1]
).
The number of cores specified in the header should match the number in nprocs
parameter.
We will compute slope in a separate mapset called slope.
Note that we don't have to set computational region for the new mapset because all new mapsets are using default region
that is set in PERMANENT.
#!/bin/bash
#BSUB -n 4
#BSUB -W 15
#BSUB -R "rusage[mem=1GB]"
#BSUB -R span[hosts=1]
#BSUB -oo slope_out
#BSUB -eo slope_err
#BSUB -J slope
module use --append /usr/local/usrapps/geospatial/modulefiles
module load grass
# remove this mapset which may exist from a previous run of this job
rm -rf /share/gis714s23/$USER/grassdata/albers/slope
# create a new mapset with -c flag
grass -c /share/gis714s23/$USER/grassdata/albers/slope --exec r.slope.aspect elevation=DEM slope=slope nprocs=4
Submit this job with a different number of cores and compare the run time. Change the name of the mapset if you want to run it at the same time.
Computing distance
Here we compute distance to roads, water and forest that will serve as predictors. Since toolr.grow.distance
is not internally parallelized, we will execute distance to roads, water, and forest
so that they run concurrently using 3 cores. This can be simply achieved by executing them in the background.
Alternatively, you can split it and execute as 3 separate jobs.
#!/bin/bash
#BSUB -n 3
#BSUB -W 3:00
#BSUB -R "rusage[mem=1GB]"
#BSUB -oo dist_out
#BSUB -eo dist_err
#BSUB -J dist
module use --append /usr/local/usrapps/geospatial/modulefiles
module load grass
rm -rf /share/gis714s23/$USER/grassdata/albers/roads
rm -rf /share/gis714s23/$USER/grassdata/albers/water
rm -rf /share/gis714s23/$USER/grassdata/albers/forest
# run in background by appending & after the tool
grass -c /share/gis714s23/$USER/grassdata/albers/roads --exec r.grow.distance input=roads distance=dist_to_roads -m &
grass -c /share/gis714s23/$USER/grassdata/albers/water --exec r.grow.distance input=water distance=dist_to_water -m &
grass -c /share/gis714s23/$USER/grassdata/albers/forest --exec r.grow.distance input=forest distance=dist_to_forest -m &
# this will wait for the processes to finish before starting the next set of processes
wait
grass /share/gis714s23/$USER/grassdata/albers/roads --exec r.mapcalc "log_dist_to_roads = log(dist_to_roads + 1)" &
grass /share/gis714s23/$USER/grassdata/albers/water --exec r.mapcalc "log_dist_to_water = log(dist_to_water + 1)" &
grass /share/gis714s23/$USER/grassdata/albers/forest --exec r.mapcalc "log_dist_to_forest = log(dist_to_forest + 1)" &
wait
Computing development pressure
Here we combine the two previous approaches to compute development pressure for 2001 and 2019. Tool r.futures.devpressure is internally parallelized, so we can run the tool for both years concurrently and for each of them request certain number of cores. Notice that we need to request twice as many cores then (on a single node).#!/bin/bash
#BSUB -n 16
#BSUB -W 10:00
#BSUB -R span[hosts=1]
#BSUB -R "rusage[mem=2GB]"
#BSUB -oo devpressure_out
#BSUB -eo devpressure_err
#BSUB -J devpressure
module use --append /usr/local/usrapps/geospatial/modulefiles
module load grass
rm -rf /share/gis714s23/$USER/grassdata/albers/devpressure_2001
rm -rf /share/gis714s23/$USER/grassdata/albers/devpressure_2019
# run in background by appending & after the tool
grass -c /share/gis714s23/$USER/grassdata/albers/devpressure_2001 --exec r.futures.devpressure input=urban_no_roads_2001 output=devpressure_2001 size=20 gamma=1 nprocs=8 &
grass -c /share/gis714s23/$USER/grassdata/albers/devpressure_2019 --exec r.futures.devpressure input=urban_no_roads_2019 output=devpressure_2019 size=20 gamma=1 nprocs=8 &
# this will wait for the processes to finish
wait
Running the simulation
Once all the previous steps are finished, we can run the futures simulation. We will use already precomputed input CSV files. First, in your home create a new file named "pga.sh" with the following content:#!/bin/bash
g.region raster=state_${1}
r.mask raster=masking
r.futures.pga developed=urban_2019 development_pressure=devpressure_2019@devpressure_2019 \
compactness_mean=0.5 compactness_range=0.1 discount_factor=1 \
predictors=log_dist_to_forest@forest,log_dist_to_roads@roads,log_dist_to_water@water,slope@slope \
n_dev_neighbourhood=20 devpot_params=/share/gis714s23/input_csv/best_model.csv num_neighbors=4 seed_search=probability \
development_pressure_approach=gravity gamma=1 scaling_factor=1 num_steps=31 \
subregions=state_${1} demand=/share/gis714s23/input_csv/demand_${1}.csv \
output=out_state_${1}_seed_${2} patch_sizes=/share/gis714s23/input_csv/patch_sizes_${1}.csv random_seed=${2}
Then create a script "bsub_futures.sh" with the content below:
#!/bin/bash
#BSUB -n 21
#BSUB -R "rusage[mem=150GB]"
#BSUB -R span[ptile=10]
#BSUB -W 40:00
#BSUB -oo futures_out
#BSUB -eo futures_err
#BSUB -J futures
module use --append /usr/local/usrapps/geospatial/modulefiles
module load grass
module load PrgEnv-intel
rm -f /share/gis714s23/${USER}/pga_jobs.sh
for SEED in `seq 1 10`
do
for STATE in 1 12 13 37 45 47
do
rm -rf /share/gis714s23/$USER/grassdata/albers/pga_${STATE}_${SEED}
echo grass -c /share/gis714s23/$USER/grassdata/albers/pga_${STATE}_${SEED} --exec bash /home/${USER}/pga.sh ${STATE} ${SEED} >> /share/gis714s23/${USER}/pga_jobs.sh
done
done
mpiexec python -m mpi4py -m pynodelauncher /share/gis714s23/${USER}/pga_jobs.sh
This submission script first generates a file, which looks like this:
grass -c /share/gis714s23/akratoc/grassdata/albers/pga_1_1 --exec bash /home/akratoc/pga.sh 1 1 grass -c /share/gis714s23/akratoc/grassdata/albers/pga_12_1 --exec bash /home/akratoc/pga.sh 12 1 ...This file contains commands that will be distributed for computation across multiple nodes, each one of them computing the simulation for single state an random seed. Notice the submission parameters, there are slightly more complex. We estimate memory to be 10-15 GB per state (will differ based on the state size), so we can fit around 10 concurrent simulations on a node when we request 150GB. Therefore, we specify that with
rusage[mem=150GB]
and span[ptile=10]
.
When we request 20 cores, that will result in requesting 2 nodes.
We request 21 instead of 20 to include one extra core for MPI thread management. This is not necesserily the most efficient way,
we could also submit separate jobs, but large number of job submissions is generally discouraged. There will be total of 60 (6 states x 10 seeds) FUTURES computations on 18 cores, once a computation from the list is finished
the next one is started. When you submit your job, it may be pending for a while, you are requesting a lot of resources for a long time.
For more complex submission you can consult HPC IT.
Postprocessing
We need to patch the results for individual states. For that we will use toolr.patch
which is internally parallelized.
First, create a bash script patch.sh
that accepts 2 parameters (seed number and number of cores),
loads grass, lists all FUTURES output rasters belonging to the specified seed
and executes r.patch in a newly created mapset.
# get parameters
SEED=${1}
N=${2}
module use --append /usr/local/usrapps/geospatial/modulefiles
module load grass
module load PrgEnv-intel
# list results for each state for a single seed and save to a variable
MAPS=`grass --tmp-mapset /share/gis714s23/$USER/grassdata/albers --exec \
g.list type=raster pattern="out_state_*_seed_${SEED}" mapset="*" -m separator=comma`
rm -rf /share/gis714s23/$USER/grassdata/albers/results_${SEED}
grass -c /share/gis714s23/$USER/grassdata/albers/results_${SEED} --exec r.patch input=${MAPS} output="out_seed_${SEED}" nprocs=${N}
grass /share/gis714s23/$USER/grassdata/albers/results_${SEED} --exec r.colors map=out_seed_${SEED} raster=out_state_1_seed_${SEED}@pga_1_${SEED}
Then create another script that submits the patching jobs for each seed (submit_patch.sh
):
N=4
for SEED in {1..10}
do
bsub -n ${N} -W 1:00 -R span[hosts=1] -J patch_${SEED} -oo patch_${SEED}_out -eo patch_${SEED}_err bash patch.sh ${SEED} ${N}
done
Run the script with bash submit_patch.sh
. This will submit 10 jobs.
The results are rasters out_seed_1@results_1
, out_seed_2@results_2
, etc.
Report the run times for all your computations and include a figure with the results.