DIY: Synthetic data generators for your model's next satellite mission
Create and index imagery for improving onboard model performance
Consider a GeoAI model developer not sure if their model will work for their upcoming satellite mission (especially if running onboard). The sensor is very different than the one used for model training. There might be a need for more synthetic data. The conversation with them goes like this:
Limiting ourselves to multispectral sensors for this blog, there is enough evidence that model performance degrades if the spatial resolution, spectral response, preprocessing or sensor specifications in testing data are very different from training data. There are extremely sophisticated but costly tools to simulate remote sensing physics and sensors to generate synthetic imagery for testing models. But if you want to quickly start off with model testing with minor accuracy hit, it might be a good idea to build something simpler yourself.
The blog covers the following tools to build your own synthetic data generator:
Review of the PhiSat-2 Sensor and level of processing modeling notebook
Evaluation notebook for spectral mapping between two sensors
Evaluation notebook of the CLIP-RS-ICD model for captioning
Requirements for a simple data generator
A tool is required that can take existing open satellite imagery and turn it into imagery of a desired satellite sensor by mapping spectral response, spatial resolution and viewing geometry (lets say nadir facing) between two sensors and sensor specifications (e.g. noise characteristics) and preprocessing done on the target sensor image. Captioning such an image helps to search images of certain class. For model testing, say, get all coastal images for your coastline monitoring model.
Review of the PhiSat-2 simulator
The European Space Agency (ESA), had launched the OrbitalAI Φsat-2 challenge as a global competition last year. For this competition, the participants use a simulated dataset and create a working prototype model adapted to run onboard Φsat-2 (an upcoming mission at the time of the challenge). The challenge provided the participants with a Φsat-2 simulator jupyter notebook to generate representative L1C onboard imagery from open-access Sentinel-2 L1C imagery.
The L1C imagery referred to Top of Atmosphere Reflectance in sensor geometry, fine geo-referenced, fine band-to-band alignment (<10 m RMSE). The PhiSat-2 simulator's essential steps are shown in workflow above. In the interest of simulating any area of interest and not have any cost constraints, Sentinel-2 Multispectral sensor's L1C reflectance product is chosen as an input to the simulator. The simulator takes the 10m-Ground Sampling Distance (GSD) and implements a bicubic interpolation to get to the representative 4.75m-GSD image output for all the eight bands. The next step involves a band misalignment simulation. Due to the pushbroom acquisition mode, the various bands over a given area are not acquired at the same time and suffer from platform attitude stability during the various acquisition times. Keeping a specific band as reference, a misalignment direction and magnitude is chosen as random number extracted from a normal distribution to shift each of the bands. To account for the system Modulation Transfer Function (payload + platform), the next step consists of the convolution of the input image with a filter function representing the system specific Point Spread Function (PSF). Basis the solar irradiance, Sun-zenith angle and Earth-Sun distance, the multiplicative factor needed to convert radiance to reflectance is computed to simulate the L1C product. The last step involves tiling the image into 4096 x4096-sized grids as the final product available for processing onboard. For more technical details, look at the scientific notice.
If you know the sensor specifications, you can use functions (and add your own) in the notebook to build a data generator of your own. You can add preprocessing steps to simulate the level of processing at which your model expects the imagery input to be.
Relies on open data (through SentinelHub)
Spatial resolution simulation involves bicubic interpolation. This can be improved by superresolution (like this) which can retain shape integrity and band ratios.
Changing view geometry is not a feature here.
Spectral mapping between VENµS and Sentinel 2 sensors
Before a mission, there may not be any directly available comparative imagery to map spectral relation but post-flight, there just might be enough data to build a spectral mapping with open data like Sentinel 2. As covered in my previous blogpost, VENµS datasets at 5m resolution are openly available (consider the mission to be equivalent of your target mission) which can be spectrally mapped to Sentinel 2 MSI images.
We provide a workflow and associated code in a Colab notebook and sample images for your reference.
For a VENµS (L1-equivalent) scene that is derived onboard, a reference Sentinel 2 L1C (pre-atmospheric-correction) Image which just exceeds the area of the VENµS scene and is closest in time w.r.t. Raw image captured onboard for the VENµS scene.
Get keypoints for both images using Lightglue’s extractor and matcher (performs better than SIFT+RANSAC) and find homography.
Reproject the Sentinel 2 L1C scene onto the L2A scene.
Extract corresponding R,G,B pixels of the Sentinel 2 L1C's R,G,B pixels after removing cloud pixels or zero-brightness pixels.
Build a linear regressor that maps the R,G,B pixels for each channel between the L2A scene and the Sentinel2 scene.
Notice the distinct variation in the water and ground colour between the two images. Upon closer inspection, there are many sensor artefacts in VENµS imagery that don’t seem to appear in the Sentinel 2 imagery. If you are able to follow through the steps provided in notebook, it would be easy to replicate red-red spectral map between the two sensors and build a linear regressor. As can be seen, there are several outlier points which can be due to different colour patterns in the two images. In images, cloud and cloud shadows, atmospheric effects all play a role in the fit quality.
If you know the view time and geometry and radiance values, you can modify the notebook to build a spectral map for your own mission.
Lightglue makes the extraction and matching of keypoints reliable for two different sensors.
Relies on open data (through SentinelHub) and some target mission imagery.
Expect to improve model performance as soon as you have better spectral response in your simulated imagery dataset using spectral maps for your mission.
Requires a larger sample size of data and cloud/shadow-rejection models to build more robust classifiers. Read this for more ideas.
Captioning: Evaluation of the CLIP-RS-ICD model
Transformer architecture in deep learning has gained immense popularity in computer vision problems as well and one such application in remote sensing is image captioning, i.e. assigning labels to remote sensing images. Technically speaking, it is zero-shot image classification which is a task to classify images into one of several classes, without any prior training or knowledge of the classes. Zero-shot image classification works by transferring knowledge learnt during the training of one model to classify novel classes that were not present in the training data. One of the most popular models for zero shot image classification is the CLIP model from OpenAI which uses transformer based image and text encoders. A version of this model called the CLIP-RSICD is finetuned on image caption pairs from mainly the RSICD dataset which contains more than 10k remote sensing images from Google Earth, Baidu Map, MapABC, Tianditu and 5 sentences descriptions per image with each image having a 224 X 224 image resolution at various spatial resolutions. Other datasets used in training are- UCM dataset and Sydney dataset which too contain images with 5 captions each. This model provides the ability to quickly search large database of such images for specific features i.e. text to image retrieval. Our larger goal of performing inference on this model is to evaluate if model developers can quickly find images of their interest from our database through some relevant search prompts.
The CLIP-RSICD model was used to perform inference on Sentinel2GlobalLULC which is a dataset of Sentinel-2 georeferenced RGB imagery annotated for global land use/land cover mapping with deep learning. It contains over 190k images from 29 LULC classes and each image has a 224 X 224 image resolution at 10 m spatial resolution. During inference, image and the captions corresponding to classes of interest are provided to the model for every image and the model assigns a probability score to each caption (i.e. class) which indicates the degree of relatedness or similarity of that class to the input image according to the model. The evaluation Colab notebook is available fpr details.
Learnings and takeaways
Some learnings from using this model for the inference process:
Provide full sentences i.e. some visual object description as input captions and not just labels/classes as prompts for the input image. Sometimes images can include features of more than one class like forest (major class) and a waterbody eg: a river in it (minor class), but we want the model to classify it according to the major class, so in those cases the prompts can be framed like- ‘An aerial photograph depicting a forest predominantly’. Example:
Providing too many captions at once can confuse the model and cause the model to undermine the probability of classes more relevant to the image. As a rule of thumb, consider providing upto 10 captions at a time. Since the above dataset has 29 classes, the classes were grouped into 8 main i.e. super classes as per their high level similarity with each other. Inference for each image was carried out in a 2 step hierarchical manner- in the first step inputting only the super classes for each image and in the second step inputting all the classes under/corresponding to the super class with the highest probability obtained from the first step. The class with the highest probability hence obtained is the one that is considered to be the most similar to the input image. 8 Super classes used were- ‘Urban and built-up area’, ‘Grassland or Barren land’, ‘Snow or Lichen land’, ‘Shrubland’, ‘Wetland’, ‘Cropland’, ‘Forest’, ‘Water body’.
Understand the image pre-processing used during fine tuning of the CLIP-RSICD dataset and make sure the same is used to pre process images before passing it to the same model for inference. Additionally using histogram equalization like CLAHE (Contrast Limited Adaptive Histogram Equalization) to improve the contrast of the images prior to the pre processing helps improve performance.
Inferencing speed: Average GPU inference time (from loading image to getting the probs on CPU): 83.9ms is impressive.
Measure performance of the model mainly as top k accuracy and create a plot of k (from 1 to 29) vs top k accuracy. Use top 10 (or equivalent for your case) labels for captioning and indexing every synthetic image for searching.
The ultimate way to improve model performance would be to fine tune the CLIP model on the specific dataset, and test the inference performance on a holdout i.e. test set of the dataset.
The blog attempts to cover some open tools and data that help you build a synthetic data generator for free and caption it for summoning all useful data for your model testing. It is recommended that you build your own tools to suit your mission needs.
So let the pixels fall, and spectra sing, in every byte, a new discovery spring,
For in the quest to see, to know, to share, the greatest reward is the journey, and care.
Want to know more about what we are doing? Click here
Thanks for reading SkyServe Blog! Subscribe for free to receive new posts and support my work.