Running Experiments

Running experiments in Raster Vision is done using the rastervision run command. This looks in all the places stated by the command for Experiment Set classes and executes methods to get a collection of ExperimentConfig objects. These are fed into the ExperimentRunner that is chosen as a command line argument, which then determines how the commands derived from the experiments should be executed.

ExperimentRunners

An ExperimentRunner takes a collection of ExperimentConfig objects and executes commands derived from those configurations. The commands it chooses to run are based on which commands are requested from the user, which commands already have been run, and which commands are common between ExperimentConfigs.

Note

Raster Vision considers two commands to be equal if their inputs, outputs and command types (e.g. rv.CHIP, rv.TRAIN, etc…) are the same. Raster Vision will avoid running multiple of the same command in one run with sameness defined in this way.

During the process of deriving commands from the ExperimentConfigs, each Config object in the experiment has the chance to update itself for a specific command (using the update_for_command method), and report what its inputs and outputs are (using the report_io method). This is an internal mechanism, so you won’t have to dive too deeply into this unless you are a contributor or a plugin author. However, it’s good to know that this is when some of the implicit values are set into the configuration. For instance, the model_uri property can be set on a rv.BackendConfig by using the with_model_uri on the builder; however the more standard practice is to let Raster Vision set this property during the update_for_command process described above, which it will do based on the root_uri of the ExperimentConfig as well as other factors.

The base ExperimentRunner class constructs a Directed Acyclic Graph (DAG) of the commands based on which commands consume as input other command’s outputs, and passes that off to the implementation to be executed. The specific implementation will choose how to actually execute each command.

When an ExperimentSet is executed by an ExperimentRunner, it is first converted into a CommandDAG representing a DAG of commands. In this graph, there is a node for each command, and an edge from X to Y if X produces the input of Y. The commands are then executed according to a topological sort of the graph, so as to respect dependencies between commands.

Two optimizations are performed to eliminate duplicated computation. The first is to only execute commands whose outputs don’t exist. The second is to eliminate duplicate nodes that are present when experiments partially overlap, like when an ExperimentSet is created with multiple experiments that generate the same chips:

_images/commands-tree-workflow.png

Running locally

A rastervision run local ... command will use the LocalExperimentRunner, which builds a Makefile based on the DAG and then executes it on the host machine. This will run multiple experiments in parallel.

Running on AWS Batch

rastervision run aws_batch ... will execute the commands on AWS Batch. This provides a powerful mechanism for running Raster Vision experiment workflows. It allows for queues of CPU and GPU instances to have 0 instances running when not in use. With the running of a single command on your own machine, AWS Batch will increase the instance count to meet the workload with low-cost spot instances, and terminate the instances when the queue of commands is finished. It can also run some commands on CPU instances (like chip), and others on GPU (like train), and will run multiple experiments in parallel.

The AWSBatchExperimentRunner executes each command by submitting a job to Batch, which executes the rastervision run_command inside the Docker image configured in the Batch job definition. Commands that are dependent on an upstream command are submitted as a job after the upstream command’s job, with the jobId of the upstream command job as the parent jobId. This way AWS Batch knows to wait to execute each command until all upstream commands are finished executing, and will fail the command if any upstream commands fail.

If you are running on AWS Batch or any other remote runner, you will not be able to use your local file system to store any of the data associated with an experiment - this includes plugin files.

Note

To run on AWS Batch, you’ll need the proper setup. See Setting up AWS Batch for instructions.

Running commands in Parallel

Raster Vision can run certain commands in parallel, such as the CHIP and PREDICT commands. To do so, use the --splits option in the run command of the CLI.

Commands implement a split method on them, that either returns the original command if they cannot be split, e.g. with training, or a sequence of commands that each do a subset of the work. For instance, using --splits 5 on a CHIP command over 50 training scenes and 25 validation scenes will result in 5 CHIP commands, that can be run in parallel, that will each create chips for 15 scenes.

The command DAG that is given to the experiment runner is constructed such that each split command can be run in parallel if the runner supports parallelization, and that any command that is dependent on the output of the split command will be dependent on each of the splits. So that means, in the above example, a TRAIN command, which was dependent on a single CHIP command pre-split, will be dependent each of the 5 individual CHIP commands after the split.

Each runner will handle parallelization differently. For instance, the local runner will run each of the splits simultaneously, so be sure the split number is in relation to the number of CPUs available. The AWS Batch runner will submit jobs for each of the command splits, and the Batch Compute Environment will dictate how many resources are available to run Batch jobs simultaneously.