🪣 Cloud Data Access

🪣 Cloud Data Access#

NASA’s migration from “on-premise” to cloud#

image src: https://asf.alaska.edu/about-asf-daac/

NASA has 12 Distributed Active Archive Centers (DAACs). Each DAAC is associated with a few sub-disciplines of Earth science, and those specialties correspond to which missions and data products those DAACs are in charge of. For example, LPDAAC is the land processes DAAC and is in charge of the Harmonized Landsat Sentinel (HLS) Product which is often used for land classification. Up until about 6 years ago (which is about when I started working with NASA), all NASA Earth Observation archives resided “on-premise” at the data center’s physical locations in data centers they manage.

NASA, anticipating the exponential growth in their Earth Observation data archives, started the Earthdata Cloud initiative. Now, NASA DAACs are in the process of migrating their collections to cloud storage. Existing missions are growing their collections as well, but new missions such as NISAR and SWOT are or will be the most significant contributors to NASA’s archival volume growth.

image src: https://www.earthdata.nasa.gov/esds/esds-highlights/2022-esds-highlights

Now, high priority and new datasets are being stored on cloud object storage.

What is cloud object storage?#

Cloud object storage stores unstructured data in a flat structure, called a bucket in AWS, where each object is identified with a unique key. The simple design of cloud object storage enables near infinite scalability. Object storage is distinguished from a database which requires database management system software to store data and often has connection limits. Object storage is distinct from file storage because files are stored in a hierarchical format and a network is not always required. Read more about cloud object storage and how it is different from other types of storage in the AWS docs.

../../_images/s3-bucket-with-objects.png

Cloud object storage is accessible over the internet. If the data is public, you can use an HTTP link to access data in cloud object storage, but more typically you will use the cloud object storage protocol, such as s3://path/to/file.text along with some credentials to access the data. Using the s3 protocol to access the data is commonly referred to as direct access. Access over the network is critical because it means many servers can access data in parallel and these storage systems are designed to be infinitely scalable and always available.

Popular libraries to access data on S3 are boto3 and s3fs.

There are different access patterns, it can be confusing! 🤯#

Download data from a DAAC to your personal machine.
Download data from cloud storage, say using HTTP, to your personal machine.
Login to a virtual machine in the cloud, like CryoCloud, and download data from a DAAC.
Login to a virtual machine in the cloud and access data directly using a cloud object protocol, like s3.

../../_images/different-modes-of-access.png

Cloud vs Local Storage#

Feature	Local	Cloud
Scalability	❌ limited by physical hardware	✅ highly scalable
Accessibility	❌ access is limited to local network or complex setup for remote access	✅ accessible from anywhere with an internet connection
Collaboration	❌ sharing is hard	✅ sharing is possible with tools for access control
Data backup	❌ risk of data loss due to hardware failure or human error	✅ typically includes redundancy (read more)
Performance	✅ faster since it does not depend on any network	❌ performance depends on internet speed or proximity to the data

Takeaways

NASA datasets are still managed by DAACs, even though many datasets are moving to the cloud.
Users are encouraged to access the data directly in the cloud through AWS services (like cryocloud!)