How a upsertable data lakes can simplify the data lake journey

Author: Eric Valenzuela

Migrating data from traditional SQL databases to a data lake presents multiple advantages for organizations.

Data lakes offer the flexibility and scalability to manage growing volumes of data. Data lakes enable organizations to take advantage of disconnected, structured and unstructured data streams such as customer data, IoT sensors, and click streams. Organizations can more cost-effectively store unused data and leverage it down the road for advanced analytics as opportunities arise and as new technologies become available. Furthermore, because they separate storage from compute, data lakes offer far greater flexibility and scalability.

At the same time, migrating to a data lake architecture can be daunting for organizations whose database administrators (DBAs) are accustomed to SQL databases and transactions characterized by atomicity, consistency, isolation, and durability (ACID). Because data lakes are not ACID-compliant, simple DBA tasks such as updating or deleting records are not as straightforward, since data lake files cannot be changed or updated directly. Instead, these tasks require complex scripting and file handling.

The good news is that traditional DBAs can perform these operations with a similar level of skill and effort using what we’re calling “upsertable data lakes”—essentially ACID-compliant, modern table formats that combine the best features of data lakes and data warehouses. In this article, we’ll explore how upsertable data lakes such as Delta Lake and Hudi can ease your organization’s transition to a data lake architecture, all while leveraging the experience and knowledge of your existing DBAs.

Some challenges of data lakes

Managing a data lake is different from managing a relational database and requires different skillsets. Data lakes require data validations in multiple places, and preventing failures necessitates the handling of files and complex scripting that is unfamiliar to traditional, SQL-centric DBAs.

Then there are the challenges of cleansing or deleting data to comply with data retention and privacy laws such as GDPR. In a relational database, these tasks can be accomplished easily through SQL commands. But in a data lake, it can be a complex endeavor, requiring skillsets that fall outside a traditional DBA’s wheelhouse. And because files are immutable, there is some risk of instability during transitions if transactions are not ACID compliant.

How upsertable data lakes reduce complexity

Because they fuse the best of data lakes and data warehouses, upsertable data lakes such as Delta Lake and Hudi provide a more familiar interface for DBAs. Upsertable data lakes take what DBAs love about data warehouses and apply them to a data lake setting, which can make the transition to a data lake architecture easier. Thanks to a rapidly closing feature gap between upsertable data lakes and relational databases, the data lake transition is a much more familiar journey.

To illustrate this feature familiarity, let’s say a DBA needs to delete or update a record. Performing this simple task in a traditional data lake can require the creation of complex scripts. But with an upsertable data lake, it’s a simple matter of running a delete statement or an update statement, as would happen in a relational database.

Furthermore, due to the immutable nature of data lakes files, changing the record requires the creation of a new file with the change, and the deletion of the old one. During this process, oftentimes the data will be in an unstable state. This is where an upsertable data lake comes in like a hero, bringing ACID compliance to transactions to guarantee data validity. Thanks to ACID compliance, an upsertable data lake abstracts all the file handling underneath the hood, which means DBAs no longer have to think about the file handling. The data is never in an unstable state.

Contact Beyondsoft for a data architecture health check

Where are you on your data lake journey? Beyondsoft has performed hundreds of data migrations and big data projects for large enterprise customers. Our certified practitioners have hands-on, best-practice knowledge of all the major platforms, including AWS and Azure.

We can partner with you to analyze your data architecture and business requirements to help you determine if an upsertable data lake is the right fit and determine the best migration strategy. To learn more about how Beyondsoft can help you with your data lake journey, contact us today.

How we do it

Our success factors over the years are a testament to driving your return on investment. Singapore is our global head office and we have 15 regional offices around the world.

Nearly 3 decades of strong IT consulting and services

40+ global delivery networks across four continents

Certifications* in CMMI 5, ISO 9001, ISO 45001, and ISO 27001

~30,000 global experts

Microsoft Azure Expert MSP

ISO 9001 and 45001 (certificates issued to Beyondsoft International (Singapore) Pte Ltd). ISO 27001 (certificates issued to Beyondsoft International (Singapore) Pte Ltd, Beyondsoft (Malaysia) Sdn. Bhd., and Beyondsoft Consulting Inc., Bellevue, WA, USA)