Making satellite imagery easy-to-use: speeding up computations
In our previous post, we examined how satellite imagery can be used in the social sector and how the MOSAIKS algorithm enables us to draw out “features” from these images without needing complex image-processing models. But the story doesn’t end with the algorithm.
Satellite images are large files. This means the retrieval from storage, processing through the MOSAIKS pipeline, and storing the resultant features is quite slow (though still computationally efficient relative to other options). The default MOSAIKS pipeline we employ generates approximately 4,000 features, i.e. a vector of 4,000 numbers, for each satellite image. When attempting to extract features from a large number of images, storage, both in memory during processing and on disk after features have been generated, is a problem.
We solved this problem in 2 steps. First, we used Dask Distributed (Dask, for brevity), a Python library that helps parallelize the fetching and processing of images. Second, we set up a scalable cluster in a cloud computing platform.
Distributed computing with Dask
Dask is a Python library for distributed computing. It assesses available computing resources and the tasks that need to be run, and then intelligently allocates these tasks across the resources.
In our case, the “tasks” involve fetching satellite images and converting them into features for each GPS location a user inputs. A single task involves these steps: (1) read a GPS location, (2) fetch satellite images for this location, considering the appropriate year, season, resolution, etc., (3) apply MOSAIKS to the image to extract features, and (4) save these features. The total number of tasks equals the number of GPS locations. It’s easy to see why extracting MOSAIKS features for thousands of GPS locations would be incredibly slow if performed sequentially.
With Dask, we can perform many of these tasks in parallel. Dask creates a “cluster” filled with “workers” that each receive a portion of the total computational resources (like CPU nodes) we have. We can then organize our tasks into batches and run them on the Dask cluster, where each task is assigned to a different worker. So, if we have 10 workers, we can process features for 10 GPS coordinates at a time, giving us a 10-fold speed increase (minus any Dask overheads).
With Dask, we can perform many of these tasks in parallel. Dask creates a “cluster” with multiple “workers” that each get assigned a fraction of the total computational resources (e.g. CPU nodes) available to us. We can then organize our tasks into batches and run them on the Dask cluster, where each task is assigned to a free worker. This means that if we have 10 workers, we can process features for 10 GPS coordinates at a time, giving us a 10-fold speed increase (minus any Dask overheads).
You can find a description of how Dask assigns tasks to workers here and in the Dask Distributed documentation. The main takeaways from our experience with Dask are that:
- Dask has clear documentation and is straightforward to use. The developer doesn’t need to struggle with figuring out the optimal way to parallelize their code across the resources.
- Once we adapt our pipeline to run through Dask, we can execute it on a local cluster (on one machine) or even a cluster spread across multiple machines. This flexibility is possible because Dask can establish clusters on a variety of setups and needs minimal code adjustments to interface with each.
- The most effective way to use Dask is by defining a task (something that can be executed on a machine without specialized parallel processing) and layering Dask functionality on top of it. This makes spotting bugs and pinpointing errors easier.
Azure Cloud Computing Infrastructure
Dask indeed accelerates the feature extraction process, enabling us to obtain MOSAIKS features in mere hours instead of days. However, running this algorithm on a single machine with limited computing power still poses a challenge, especially when we’re dealing with thousands of GPS coordinates. This led us to using Microsoft’s Azure Virtual Machine service to scale our computational power1.
Azure offers us the ability to set up a remote machine tailored to our specific computational needs. The service ensures the optimal combination of hardware and software to meet our demands and maintains them for as long as we require. Crucially, Azure provides the flexibility to scale these resources up or down based on our task load.
In our specific case, we integrated a Dask workload manager on the Azure platform. Based on the number of GPS coordinates a user enters, the workload manager creates workers. Now, each worker is a virtual machine (VM) within Azure, equipped with its own operating system and computational resources determined by our chosen VM type and size. The workers are assigned tasks similarly to the process described above. When all tasks are completed, we simply shut down the workers and close the Dask workload manager.
For an in-depth understanding of this setup, you can check out this link. Here are the key insights from our experience with Azure VMs:
- Provided there’s a stable internet connection, we can initiate the feature extraction process for thousands of GPS coordinates from a laptop, let it run on the Azure platform, and save the resultant features to cloud storage (we’ve tested this with Amazon Web Services’ S3 storage).
- The choice of VM size and type significantly affects the speed of the feature extraction process, but we need to optimize our computations in order to balance this against the costs associated with an arbitrarily big and powerful VM
- You can create and terminate the cluster and underlying Azure infrastructure, in code using Dask. This simplifies the cluster setup and management. This also helps control costs, as we can dismantle the cluster and all its resources through code once the job is done.
The library is now open-source
Our mosaiks
package offers users the flexibility to run the MOSAIKS algorithm on a local machine, either with or without Dask, or with Dask on a cloud platform. This allows the user to select the degree of computational complexity that suits their specific use-case. We believe this triad of tools—MOSAIKS, Dask, and Azure—significantly enhances the accessibility of satellite imagery analysis for development sector participants, and particularly so for our work.
We specifically chose to go with Microsoft’s cloud computing service since we were already using their satellite imagery platform Microsoft Planetary Computer ↩︎