Sourced (DevGlobal) Creating our on-ramp: how to train a hungry AI model
The ramp project is an ambitious endeavor for many reasons, not the least of which is the curation and open release of high-quality training data in the form of tens of thousands of high-resolution imagery chips with accompanying labels that “teach” our machine learning model what a building looks like under various conditions. When we talk about training data, we are referring to satellite and drone images that have vector labels covering the extent of buildings. These labels are broken into 256 pixel X 256 pixel training “chips” that will be fed into our model so it can learn what a building looks like, and ultimately detect buildings in fresh imagery. The labels can be captured using any standard geospatial format including Shapefile, GeoJSON, and GeoPackage. By feeding a machine learning model thousands of examples of buildings across the globe, eventually, that model will be able to process a completely new image and identify all the buildings in that scene. An ML model will only be as good as the examples provided to train it, with a common phrase used in the AI/ML community – “garbage in, garbage out.” What this really means is that if you show your model many examples of buildings but the polygon outlines do not capture the entire building, or they include some nearby features like a backyard, animal pen, or roadway, then your model will start to get confused as to what a building is and is not. To maximize the quality of our training data, and our eventual model, we have researched past efforts to identify what is usable and created fresh training data in-house and through our partner Radiant Earth Foundation.
Foundational Training Data Requirements
To train the model we prioritized access to publicly released high-resolution satellite/drone imagery and building rooftop labels. In this scenario, high-resolution means better than 60 cm, typically less than 15 degrees off-nadir, and nearly cloud-free. While satisfying these requirements may seem straightforward, there are nuances to what makes the best training data set. For example, devising a way to review what is available in a timely fashion, what defines a set that is “good enough” and making sure we have consistency in label quality across a variety of inputs are just some pieces of the puzzle. The ramp model requires training data over geographically diverse AOIs that will be used to build our “baseline” model, as well as concentrated training data over Bangladesh which will be used to “fine-tune” the baseline model to perform, in theory, better than the generalized baseline model. There will be much more detail on these models in forthcoming blog posts, so stay tuned!
Imagery Needs
Open high-resolution satellite imagery is a critical component of the ramp workflow. Luckily, there are numerous options for open imagery including the Maxar Open Data Program, Mapbox, and imagery releases from past AI/ML competitions, which all meet our requirements for high-resolution imagery with Creative Commons licensing. The table below details the specific license associated with our different training data inputs. The Maxar Open Data Program has pre- and post-disaster imagery across the globe which we have downloaded and reviewed to assess spatial resolution, off-nadir angle, presence of clouds/haze, and presence of different community types such as dense urban, peri-urban, and rural areas. Across these thousands of image chips, it is important to introduce the diversity of buildings, environments, and conditions that are representative of the larger areas the model will map. The diversity needed across a training dataset is difficult to forecast. This resulted in us taking a balanced approach to training data for the baseline model where we are currently using the following inputs: SpaceNet, Open Cities AI Challenge, Mapbox imagery with OSM labels, and Maxar Open Data imagery that has been labeled by DevGlobal and Radiant Earth Foundation (REF). A balanced training data set means the model will not be biased toward identifying buildings in specific geographies or under certain climatic conditions. By teaching our model what buildings look like in a dense urban area like Dhaka, along with rural regions like Kushalnagar, India, and feeding the model examples of both satellite imagery and drone imagery, the model will have a robust ability to detect buildings around the globe. Across our sources, the team has curated imagery and labels over Dhaka, Shangai, Paris, Accra, Kinshasa, Kampala, Oman, India, and the Philippines. This initial batch of diverse labels accounts for nearly 50,000 labeled chips, and additional labels will be created to improve the model.
Benefitting from the Open Community
There have been numerous machine learning competitions in the last few years which the ramp project has benefited from greatly including SpaceNet and the Open Cities AI Challenge, but that only gets us so far. Mapbox offers creative commons licensing through its Mapbox Tiling Service API, and we can pull down accompanying OSM labels with the imagery which can in turn be used as training data. By leveraging everything we can from the open community we hope to further emphasize the importance of open efforts such as a ramp, as they will be used as foundational inputs for future endeavors. Without access to open imagery from organizations like Maxar, this effort would be much more challenging, and an additional goal of the ramp is to champion open imagery and imagery plus label releases that benefit low- and middle-income countries and the global development community. While having access to labels is a fantastic head start, there is a large variance in the quality of labels from all these sources. In the case of SpaceNet, many of the labels were shifted to better represent the building footprint as opposed to the rooftop, which we need to adjust for our use. For the Mapbox/OSM labels it can be a crapshoot to find OSM labels that were generated with the imagery layer that is pulled down from the API, and even in cases where there is coincident imagery there will still be cases of missing labels.