Data Compression-Aware Performance Analysis of Dask and Spark for Earth Observation Data Processing

Authors

  • Arthur G. Lalayan Institute for Informatics and Automation Problems of NAS RA; National Polytechnic University of Armenia

DOI:

https://doi.org/10.51408/1963-0100

Keywords:

Earth observation, HPC, Spark, Dask, Distributed computing, Data compression

Abstract

High-performance computing is a good choice for handling Big Earth Observation data, allowing the processing of the data in a distributed and performance-efficient way using in-memory computing frameworks. The data compression technique reduces the amount of storage and network transfer time and improves processing performance. The article aims to investigate the effectiveness of widely used distributed data processing frameworks in conjunction with lossless data compression techniques, to find the optimal compression method and processing framework for specific earth observation workflows. Normalized Difference Vegetation Index has been evaluated for the territory of Armenia, obtaining data from the Sentinel satellite and considering the supported compression methods to compare the performance of in-memory Dask and Spark frameworks. Experiments show that the Zstandard compression method and the Dask framework are the best choices for such workflows.

References

O. R. Young, M. Onoda. “Satellite Earth Observations in Environmental Problem-Solving”, In book: Satellite Earth Observations and Their Impact on Society and Policy, pp. 3-27, 2017.

D. A. Chu, Y. J. Kaufman, “Global monitoring of air pollution over land from the Earth Observing System-Terra Moderate Resolution Imaging Spectroradiometer (MODIS)”, Journal of Geophysical Research Atmospheres, vol. 108, no. 21, November 2003.

R.S. dos Santos, “Estimating spatio-temporal air temperature in London (UK) using machine learning and earth observation satellite data”, International Journal of Applied Earth Observation and Geoinformation, vol. 88, June 2020.

T. Krishnamurti and A. Chakraborty, “Impact of Arabian Sea pollution on the Bay of Bengal winter monsoon rains”, Journal of Geophysical Research, vol. 114, March 2009.

R. DeFries and F. Achard, “Earth observations for estimating greenhouse gas emissions from deforestation in developing countries”, Environmental Science & Policy, vol. 10, no. 4, pp. 385–394, June 2007.

Y. J. Kaufman and C. Ichoku, “Fire and smoke observed from the Earth Observing System MODIS instrument–products, validation, and operational use”, International Journal of Remote Sensing, vol. 24, no. 8, pp. 1765–1781, November 2010.

H. D. Guo and L. Zhang, “Earth observation big data for climate change research”, Advances in Climate Change Research, vol. 6, no. 2, pp. 108–117, June 2015.

A. Lewis, S. Oliver and L. Lymburner, “The Australian Geoscience Data Cube Foundations and lessons learned”, Remote Sensing of Environment, vol. 202, pp. 276–292, 2017.

Open data cube, [Online]. Available: https://www.opendatacube.org/

S. Asmaryan and V. Muradyan, “Paving the Way towards an Armenian Data Cube”, Data, vol. 4, no. 1, 2019.

M. Drusch and U. D. Bello, “Sentinel-2: ESA’s Optical High-Resolution Mission for GMES Operational Services”, Remote Sensing of Environment, vol. 120, pp. 25–36, May 2012.

M. Xiangrui, “Mllib: Machine learning in apache spark”, The Journal of Machine Learning Research, vol. 17, no. 1, pp. 1235–1241, 2016.

R. Matthew, “Dask: Parallel computation with blocked algorithms and task scheduling”, Proceedings of the 14th python in science conference, vol. 130, 2015.

H. Astsatryan and A. Kocharyan, “Performance Optimization System for Hadoop and Spark Frameworks”, Cybernetics and Information Technologies, vol. 20, no. 6, pp. 5–17, 2020.

H. Astsatryan and A. Lalayan, “Performance-efficient Recommendation and Prediction Service for Big Data frameworks focusing on Data Compression and In-memory Data Storage Indicators”, Scalable Computing: Practice and Experience, vol. 22, no. 4, pp. 401–412, 2021.

Cloud Optimized GeoTIFF, [Online]. Available: https://www.cogeo.org/

J. Li, “Parallel netCDF: A High-Performance Scientific I/O Interface”, Proceedings of the 2003 ACM/IEEE Conference on Supercomputing, 2003.

D. Mathieu and H. Sasson, “A Performance Comparison of Dask and Apache Spark for Data-Intensive Neuroimaging Pipelines”, 2019 IEEE/ACM Workflows in Support of Large-Scale Science (WORKS), pp. 40–49, 2019.

P. Mehta and S. Dorkenwald, “Comparative evaluation of big-data systems on scientific image analytics workloads”, Proceedings of the VLDB Endowment, vol. 10, no. 11, pp. 1226-1237, 2017.

I. Paraskevakos and A. Luckow, “Task-parallel Analysis of Molecular Dynamics Trajectories”, ICPP 2018: Proceedings of the 47th International Conference on Parallel Processing, no. 49, pp. 1-10, 2018.

Y. Shoukourian and V. Sahakyan, “E-Infrastructures in Armenia: Virtual research environments”, Ninth International Conference on Computer Science and Information Technologies Revised Selected Papers, pp. 1-7, 2013.

B. M. Randles and I. V. Pasquetto, “Using the Jupyter Notebook as a Tool for Open Science: An Empirical Study”, 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL), pp. 1-2, 2017.

Armenian DataCube, [Online]. Available: http://datacube.sci.am/

M. A. Wulder and T. R. Loveland, “Current status of Landsat program, science, and applications”, Remote Sensing of Environment, vol. 225, pp. 127-147, 2019.

Sentinel-2 Cloud-Optimized GeoTIFFs, [Online]. Available: https://registry.opendata.aws/sentinel-2-l2a-cogs

G. Jifu and C. Huang, “A Scalable Computing Resources System for Remote Sensing Big Data Processing Using GeoPySpark Based on Spark on K8s”, Remote Sensing, vol. 14, no. 3, 2022.

Pettorelli, J. O. Vik, “Using the satellite-derived NDVI to assess ecological responses to environmental change", Trends in Ecology & Evolution, vol. 20, no. 9, pp. 503-510, 2005.

S. Oswal, A. Singh, “Deflate compression algorithm", International Journal of Engineering Research and General Science, vol. 4, no. 1, 2016.

M. J. Knieser, F. G. Wolff, „A technique for high ratio LZW compression [logic test vector compression", Automation and Test in Europe Conference and Exhibition, pp. 116-121, 2003.

G. Feng, C. A. Bouman, „Efficient document rendering with enhanced run length encoding", Color Imaging XI: Processing, Hardcopy, and Applications, January 2006.

Y. Collet, M. Kucherawy, “Zstandard Compression and the 'application/zstd' Media Type", RFC Editor, USA, February 2021.

Downloads

Published

2023-05-31

How to Cite

Lalayan, A. G. (2023). Data Compression-Aware Performance Analysis of Dask and Spark for Earth Observation Data Processing. Mathematical Problems of Computer Science, 59, 35–44. https://doi.org/10.51408/1963-0100