🧊 Cloud-Optimized ICESat-2

🧊 Cloud-Optimized ICESat-2#

Cloud-Optimized vs Cloud-Native#

Recall from 03-cloud-optimized-data-access.ipynb that we can make any HDF5 file cloud-optimized by restructuring the file so that all the metadata is in one place and chunks are “not too big” and “not too small”. However, as users of the data, not archivers, we don’t control how the file is generated and distributed, so if we’re restructuring the data we might want to go with something even better - a “cloud-native” format.

Cloud-Native Formats#

Cloud-native formats are formats that were designed specifically to be used in a cloud environment. This usually means that metadata and indexes for data is separated from the data itself in a way that allows for logical dataset access. Data and metadata are not always stored in the same object or file in order to maximize the amount of data that can be lazily loaded and queried. Some examples of cloud-native formats are Zarr and GeoParquet, which is discussed below.

Warning

Generating cloud-native formats is non-trivial.

Geoparquet#

Apache Parquet is a powerful column-oriented data format, built from the ground up to as a modern alternative to CSV files. GeoParquet is an incubating Open Geospatial Consortium (OGC) standard that adds interoperable geospatial types (Point, Line, Polygon) to Parquet.

From https://geoparquet.org/

To demonstrate one such cloud-native format, geoparquet, we have generated a geoparquet store (see atl08_parquet.ipynb) for a subset of the ATL08 dataset and will visualize it using a very performant geospatial vector visualization library, lonboard.

Demo#

import geopandas as gpd
import pyarrow.parquet as pq
from pyarrow import fs
import pyarrow.dataset as ds
from shapely import wkb

s3  = fs.S3FileSystem(region="us-west-2", anonymous=True)
dataset = pq.ParquetDataset("eodc-public/atl08_parquet/", filesystem=s3,
                            partitioning="hive", filters=[('year', '>=', 2021), ('year', '<=', 2021), ('month', '>=', 11), ('month', '<=', 11)])
table = dataset.read(columns=["h_canopy", "geometry"])
df = table.to_pandas()
df['geometry'] = df['geometry'].apply(wkb.loads)


gdf = gpd.GeoDataFrame(df, geometry='geometry')
null_value = gdf['h_canopy'].max() # can we change this to a no data value?
gdf_filtered = gdf.loc[gdf['h_canopy'] != null_value]
gdf_filtered

gdf_filtered['h_canopy'].hist()

%%time
from lonboard import Map, ScatterplotLayer
from lonboard.colormap import apply_continuous_cmap
from palettable.colorbrewer.diverging import BrBG_10

min_bound = 0
max_bound = 60
h_canopy = gdf_filtered['h_canopy']
h_canopy_normalized = (h_canopy - min_bound) / (max_bound - min_bound)

# From https://developmentseed.org/lonboard/latest/api/layers/scatterplot-layer/#lonboard.ScatterplotLayer.radius_min_pixels:
# radius_min_pixels is "the minimum radius in pixels. This can be used to prevent the circle from getting too small when zoomed out."
layer = ScatterplotLayer.from_geopandas(gdf_filtered, radius_min_pixels=0.5)
layer.get_fill_color = apply_continuous_cmap(h_canopy_normalized, BrBG_10, alpha=0.7)

m = Map(layer)
m.set_view_state(zoom=2)
m

The End!#

What did you think? Have more questions? Come find me in slack (Aimee Barciauskas) or by email at aimee@ds.io.