Architecture and Customization#

Codebase Overview#

The Raster Vision codebase is designed with modularity and flexibility in mind. There is a main, required package, rastervision.pipeline, which contains functionality for defining and configuring computational pipelines, running them in different environments using parallelism and GPUs, reading and writing to different file systems, and adding and customizing pipelines via a plugin mechanism. In contrast, the “domain logic” of geospatial deep learning using PyTorch, and running on AWS is contained in a set of optional plugin packages. All plugin packages must be under the rastervision native namespace package.

Each of these packages is contained in a separate setuptools/pip package with its own dependencies, including dependencies on other Raster Vision packages. This means that it’s possible to install and use subsets of the functionality in Raster Vision. A short summary of the packages is as follows:

The figure below shows the packages, the dependencies between them, and important base classes within each package.

The dependencies between Python packages in Raster Vision The dependencies between Python packages in Raster Vision

Writing pipelines and plugins#

In this section, we explain the most important aspects of the rastervision.pipeline package through a series of examples which incrementally build on one another. These examples show how to write custom pipelines and configuration schemas, how to customize an existing pipeline, and how to package the code as a plugin.

The full source code for Examples 1 and 2 is in rastervision.pipeline_example_plugin1 and Example 3 is in rastervision.pipeline_example_plugin2 and they can be run from inside the Raster Vision Docker image. However, note that new plugins are typically created in a separate repo and Docker image, and Bootstrap new projects with a template shows how to do this.

Example 1: a simple pipeline#

A Pipeline in Raster Vision is a class which represents a sequence of commands with a shared configuration in the form of a PipelineConfig. Here is a toy example of these two classes that saves a set of messages to disk, and then prints them all out.

rastervision.pipeline_example_plugin1.sample_pipeline#
from typing import List, Optional
from os.path import join

from rastervision.pipeline.pipeline import Pipeline
from rastervision.pipeline.file_system import str_to_file, file_to_str
from rastervision.pipeline.pipeline_config import PipelineConfig
from rastervision.pipeline.config import register_config
from rastervision.pipeline.utils import split_into_groups


# Each Config needs to be registered with a type hint which is used for
# serializing and deserializing to JSON.
@register_config('pipeline_example_plugin1.sample_pipeline')
class SamplePipelineConfig(PipelineConfig):
    # Config classes are configuration schemas. Each field is an attributes
    # with a type and optional default value.
    names: List[str] = ['alice', 'bob']
    message_uris: Optional[List[str]] = None

    def build(self, tmp_dir):
        # The build method is used to instantiate the corresponding object
        # using this configuration.
        return SamplePipeline(self, tmp_dir)

    def update(self):
        # The update method is used to set default values as a function of
        # other values.
        if self.message_uris is None:
            self.message_uris = [
                join(self.root_uri, '{}.txt'.format(name))
                for name in self.names
            ]


class SamplePipeline(Pipeline):
    # The order in which commands run. Each command correspond to a method.
    commands: List[str] = ['save_messages', 'print_messages']

    # Split commands can be split up and run in parallel.
    split_commands = ['save_messages']

    # GPU commands are run using GPUs if available. There are no commands worth running
    # on a GPU in this pipeline.
    gpu_commands = []

    def save_messages(self, split_ind=0, num_splits=1):
        # Save a file for each name with a message.

        # The num_splits is the number of parallel jobs to use and
        # split_ind tracks the index of the parallel job. In this case
        # we are splitting on the names/message_uris.
        split_groups = split_into_groups(
            list(zip(self.config.names, self.config.message_uris)), num_splits)
        split_group = split_groups[split_ind]

        for name, message_uri in split_group:
            message = 'hello {}!'.format(name)
            # str_to_file and most functions in the file_system package can
            # read and write transparently to different file systems based on
            # the URI pattern.
            str_to_file(message, message_uri)
            print('Saved message to {}'.format(message_uri))

    def print_messages(self):
        # Read all the message files and print them.
        for message_uri in self.config.message_uris:
            message = file_to_str(message_uri)
            print(message)

In order to run this, we need a separate Python file with a get_config() function which provides an instantiation of the SamplePipelineConfig.

rastervision.pipeline_example_plugin1.config1#
from rastervision.pipeline_example_plugin1.sample_pipeline import (
    SamplePipelineConfig)


def get_config(runner, root_uri):
    # The get_config function returns an instantiated PipelineConfig and
    # plays a similar role as a typical "config file" used in other systems.
    # It's different in that it can have loops, conditionals, local variables,
    # etc. The runner argument is the name of the runner used to run the
    # pipeline (eg. local or batch). Any other arguments are passed from the
    # CLI using the -a option.
    names = ['alice', 'bob', 'susan']

    # Note that root_uri is a field that is inherited from PipelineConfig,
    # the parent class of SamplePipelineConfig, and specifies the root URI
    # where any output files are saved.
    return SamplePipelineConfig(root_uri=root_uri, names=names)

Finally, in order to package this code as a plugin, and make it usable within the Raster Vision framework, it needs to be in a package directly under the rastervision namespace package, and have a top-level __init__.py file with a certain structure.

rastervision.pipeline_example_plugin1.__init__#

def register_plugin(registry):
    """Each plugin must register itself and FileSystems, Runners it defines.
    
    The version number helps ensure backward compatibility of configs across
    versions. If you change the fields of a config but want it to remain
    backward-compatible you can increment the version below and define a
    config-upgrader function that makes the old version of the config dict
    compatible with the new version. This upgrader function should be passed to
    the :func:`.register_config` decorator of the config in question.
    """
    registry.set_plugin_version('rastervision.pipeline_example_plugin1', 0)


# Must import pipeline package first.
import rastervision.pipeline

# Then import any modules that add Configs so that the register_config decorators
# get called.
import rastervision.pipeline_example_plugin1.sample_pipeline
import rastervision.pipeline_example_plugin1.sample_pipeline2

We can invoke the Raster Vision CLI to run the pipeline using:

> rastervision run inprocess rastervision.pipeline_example_plugin1.config1 -a root_uri /opt/data/pipeline-example/1/ -s 2

Running save_messages command split 1/2...
Saved message to /opt/data/pipeline-example/1/alice.txt
Saved message to /opt/data/pipeline-example/1/bob.txt
Running save_messages command split 2/2...
Saved message to /opt/data/pipeline-example/1/susan.txt
Running print_messages command...
hello alice!
hello bob!
hello susan!

This uses the inprocess runner, which executes all the commands in a single process locally (which is good for debugging), and uses the LocalFileSystem to read and write files. The -s 2 option says to use two splits for splittable commands, and the -a root_uri /opt/data/sample-pipeline option says to pass the root_uri argument to the get_config function.

Example 2: hierarchical config#

This example makes some small changes to the previous example, and shows how configurations can be built up hierarchically. However, the main purpose here is to lay the foundation for Example 3: customizing an existing pipeline which shows how to customize the configuration schema and behavior of this pipeline using a plugin. The changes to the previous example are highlighted with comments, but the overall effect is to delegate making messages to a MessageMaker class with its own MessageMakerConfig including a greeting field.

rastervision.pipeline_example_plugin1.sample_pipeline2#
from typing import List, Optional
from os.path import join

from rastervision.pipeline.pipeline import Pipeline
from rastervision.pipeline.file_system import str_to_file, file_to_str
from rastervision.pipeline.pipeline_config import PipelineConfig
from rastervision.pipeline.config import register_config, Config
from rastervision.pipeline.utils import split_into_groups


@register_config('pipeline_example_plugin1.message_maker')
class MessageMakerConfig(Config):
    greeting: str = 'hello'

    def build(self):
        return MessageMaker(self)


class MessageMaker():
    def __init__(self, config):
        self.config = config

    def make_message(self, name):
        # Use the greeting field to make the message.
        return '{} {}!'.format(self.config.greeting, name)


@register_config('pipeline_example_plugin1.sample_pipeline2')
class SamplePipeline2Config(PipelineConfig):
    names: List[str] = ['alice', 'bob']
    message_uris: Optional[List[str]] = None
    # Fields can have other Configs as types.
    message_maker: MessageMakerConfig = MessageMakerConfig()

    def build(self, tmp_dir):
        return SamplePipeline2(self, tmp_dir)

    def update(self):
        if self.message_uris is None:
            self.message_uris = [
                join(self.root_uri, '{}.txt'.format(name))
                for name in self.names
            ]


class SamplePipeline2(Pipeline):
    commands: List[str] = ['save_messages', 'print_messages']
    split_commands = ['save_messages']
    gpu_commands = []

    def save_messages(self, split_ind=0, num_splits=1):
        message_maker = self.config.message_maker.build()

        split_groups = split_into_groups(
            list(zip(self.config.names, self.config.message_uris)), num_splits)
        split_group = split_groups[split_ind]

        for name, message_uri in split_group:
            # Unlike before, we use the message_maker to make the message.
            message = message_maker.make_message(name)
            str_to_file(message, message_uri)
            print('Saved message to {}'.format(message_uri))

    def print_messages(self):
        for message_uri in self.config.message_uris:
            message = file_to_str(message_uri)
            print(message)

We can configure the pipeline using:

rastervision.pipeline_example_plugin1.config2#
from rastervision.pipeline_example_plugin1.sample_pipeline2 import (
    SamplePipeline2Config, MessageMakerConfig)


def get_config(runner, root_uri):
    names = ['alice', 'bob', 'susan']
    # Same as before except we can set the greeting to be
    # 'hola' instead of 'hello'.
    message_maker = MessageMakerConfig(greeting='hola')
    return SamplePipeline2Config(
        root_uri=root_uri, names=names, message_maker=message_maker)

The pipeline can then be run with the above configuration using:

> rastervision run inprocess rastervision.pipeline_example_plugin1.config2 -a root_uri /opt/data/pipeline-example/2/ -s 2

Running save_messages command split 1/2...
Saved message to /opt/data/pipeline-example/2/alice.txt
Saved message to /opt/data/pipeline-example/2/bob.txt
Running save_messages command split 2/2...
Saved message to /opt/data/pipeline-example/2/susan.txt
Running print_messages command...
hola alice!
hola bob!
hola susan!

Example 3: customizing an existing pipeline#

This example shows how to customize the behavior of an existing pipeline, namely the SamplePipeline2 developed in Example 2: hierarchical config. That pipeline delegates printing messages to a MessageMaker class which is configured by MessageMakerConfig. Our goal here is to make it possible to control the number of exclamation points at the end of the message.

By writing a plugin (ie. a plugin to the existing plugin that was developed in the previous two examples), we can add new behavior without modifying any of the original source code from Example 2: hierarchical config. This mimics the situation plugin writers will be in when they want to modify the behavior of one of the geospatial deep learning pipelines without modifying the source code in the main Raster Vision repo.

The code to implement the new configuration and behavior, and a sample configuration are below. (We omit the __init__.py file since it is similar to the one in the previous plugin.) Note that the new DeluxeMessageMakerConfig uses inheritance to extend the configuration schema.

rastervision.pipeline_example_plugin2.deluxe_message_maker#
from rastervision.pipeline.config import register_config
from rastervision.pipeline_example_plugin1.sample_pipeline2 import (
    MessageMakerConfig, MessageMaker)


# You always need to use the register_config decorator.
@register_config('pipeline_example_plugin2.deluxe_message_maker')
class DeluxeMessageMakerConfig(MessageMakerConfig):
    # Note that this inherits the greeting field from MessageMakerConfig.
    level: int = 1

    def build(self):
        return DeluxeMessageMaker(self)


class DeluxeMessageMaker(MessageMaker):
    def make_message(self, name):
        # Uses the level field to determine the number of exclamation marks.
        exclamation_marks = '!' * self.config.level
        return '{} {}{}'.format(self.config.greeting, name, exclamation_marks)
rastervision.pipeline_example_plugin2.config3#
from rastervision.pipeline_example_plugin1.sample_pipeline2 import (
    SamplePipeline2Config)
from rastervision.pipeline_example_plugin2.deluxe_message_maker import (
    DeluxeMessageMakerConfig)


def get_config(runner, root_uri):
    names = ['alice', 'bob', 'susan']
    # Note that we use the DeluxeMessageMakerConfig and set the level to 3.
    message_maker = DeluxeMessageMakerConfig(greeting='hola', level=3)
    return SamplePipeline2Config(
        root_uri=root_uri, names=names, message_maker=message_maker)

We can run the pipeline as follows:

> rastervision run inprocess rastervision.pipeline_example_plugin2.config3 -a root_uri /opt/data/pipeline-example/3/ -s 2

Running save_messages command split 1/2...
Saved message to /opt/data/pipeline-example/3/alice.txt
Saved message to /opt/data/pipeline-example/3/bob.txt
Running save_messages command split 2/2...
Saved message to /opt/data/pipeline-example/3/susan.txt
Running print_messages command...
hola alice!!!
hola bob!!!
hola susan!!!

The output in /opt/data/sample-pipeline contains a pipeline-config.json file which is the serialized version of the SamplePipeline2Config created in config3.py. The serialized configuration is used to transmit the configuration when running a pipeline remotely. It also is a programming language-independent record of the fully-instantiated configuration that was generated by the run command in conjunction with any command line arguments. Below is the partial contents of this file. The interesting thing to note here is the type_hint field that appears twice. This is what allows the JSON to be deserialized back into the Python classes that were originally used. (Recall that the register_config decorator is what tells the Registry the type hint for each Config class.)

{
    "root_uri": "/opt/data/sample-pipeline",
    "type_hint": "sample_pipeline2",
    "names": [
        "alice",
        "bob",
        "susan"
    ],
    "message_uris": [
        "/opt/data/sample-pipeline/alice.txt",
        "/opt/data/sample-pipeline/bob.txt",
        "/opt/data/sample-pipeline/susan.txt"
    ],
    "message_maker": {
        "greeting": "hola",
        "type_hint": "deluxe_message_maker",
        "level": 3
    }
}

We now have a plugin that customizes an existing pipeline! Being a toy example, this may all seem like overkill. Hopefully, the real power of the pipeline package becomes more apparent when considering the standard set of plugins distributed with Raster Vision, and how this functionality can be customized with user-created plugins.

Customizing Raster Vision#

When approaching a new problem or dataset with Raster Vision, you may get lucky and be able to apply Raster Vision “off-the-shelf”. In other cases, Raster Vision can be used after writing scripts to convert data into the appropriate format.

However, sometimes you will need to modify the functionality of Raster Vision to suit your problem. In this case, you could modify the Raster Vision source code (ie. any of the code in the packages in the main Raster Vision repo). In some cases, this may be necessary, as the right extension points don’t exist. In other cases, the functionality may be very widely-applicable, and you would like to contributing it to the main repo. Most of the time, however, the functionality will be problem-specific, or is in an embryonic stage of development, and should be implemented in a plugin that resides outside the main repo.

General information about plugins can be found in Bootstrap new projects with a template and Writing pipelines and plugins. The following are some brief pointers on how to write plugins for different scenarios. In the future, we would like to enhance this section.