While everyone was sheltering in place in 2020, a group of citizen scientists decided to tackle the problem of auto-detecting interesting weather patterns in earth’s imagery collected by NASA satellites. The problem - we were dealing with a scale we had never seen before - 20 years worth of earth’s imagery collected continuously by not just NASA but also other private and public space agencies across the world which was only growing exponentially by the day. We wanted to build a reverse image search engine on this massive unlabelled dataset and automatically detect interesting phenomena such as hurricanes, polar vortexes, melting ice caps etc. NASA’s scientists had performed extensive research to solve this problem in theory - but no one had attempted to build a production quality system to put it in practice before.
SpaceML was started in collaboration with NASA’s Frontier Development Lab and Google Cloud and is built entirely by industry professionals and student mentees around the world with their donated time.
In this presentation we will talk about how we solved the problem of applying deep learning to continuously search for interesting weather patterns in petabytes of earth’s imagery. We will cover the challenges involved in continuous data processing, indexing and running distributed search while providing a low latency, highly available search API. And how we used Google Cloud offerings such as Dataflow, Functions, App Engine along with Pytorch and nearest neighbor search libraries such as SCANN, FAISS and Annoy to make it happen. We will detail the end to end self supervised learning system that we built with an eye on cost constrained usage of cloud resources while maintaining extensibility for other space science endeavors. We will also touch upon the organizational challenges in building this system with a highly distributed team including how we employed fast prototyping to build confidence in the system while gradually increasing the scale to petabytes of data.
We built this system with the goal of open sourcing the set of components to expand the project’s applicability beyond space science. In this talk we will describe the architecture of the individual components so that you can leverage them to enable deep learning on any type of dataset in your field.