"Spooq: A Software Library for ETL Processes in Data Lakes"
, in Masterarbeit am Institut für Wirtschaftsinformatik - Data & Knowledge Engineering, Betreuung: o. Univ.-Prof. Dr. Michael Schrefl, unter Anleitung von Dr. Bernd Neumayr, 1-2021
Spooq: A Software Library for ETL Processes in Data Lakes
Sprache des Titels:
The implementation of ETL processes in data lakes is a complex and intricate process due to heterogeneous open-source software environments, the use of unstructured data, and the schema-on-read principle. This leads to an increased effort for the development of data pipelines compared to traditional data warehouses, which can rely on years of standards and best practices. The increased development effort affects the duration and quality of data integration projects and can even lead to missed business opportunities. This master thesis deals with the implementation of the software library Spooq, which supports data engineers in designing ETL data pipelines in data lakes. The package is based on Apache Spark, which is included in most data lake environments, such as a local Cloudera Hadoop distribution or the cloud-based Azure HDInsight Service. It facilitates testing and documentation and thus enhances the quality of data pipelines. The software library allows data engineers to focus on business logic rather than software code by abstracting Spark?s low-level functions. The use of Spooq results in reduced development effort for data pipelines.