Seamlessly Migrate Your Apache Parquet Information Lake to Delta Lake

Advertisements

[ad_1]

Apache Parquet is likely one of the hottest open supply file codecs within the large knowledge world in the present day. Being column-oriented, Apache Parquet permits for environment friendly knowledge storage and retrieval, and this has led many organizations over the previous decade to undertake it as a vital strategy to retailer knowledge in knowledge lakes. A few of these corporations went one step additional and determined to make use of Apache Parquet recordsdata as ‘database tables’ – performing CRUD operations on them. Nonetheless, Apache Parquet recordsdata, being simply knowledge recordsdata, with none transaction logging, statistics assortment and indexing capabilities aren’t good candidates for ACID compliant database operations. Constructing such tooling is a monumental activity that may require an enormous improvement group to develop on their very own and to keep up it. The outcome was an Apache Parquet Information Lake. It was a makeshift answer at its greatest, affected by points similar to unintentional corruption of tables arising from brittle ACID compliance.

The answer got here within the type of the Delta Lake format. It was designed to unravel the precise issues Apache Parquet knowledge lakes have been riddled with. Apache Parquet was adopted as the bottom knowledge storage format for Delta Lake and the lacking transaction logging, statistics assortment and indexing capabilities have been inbuilt, offering it with the a lot wanted ACID compliance and ensures. Open supply Delta Lake, underneath the Linux Basis, has been going from power to power, discovering broad utilization within the trade.

Over time, organizations have realized vital advantages shifting their Apache Parquet knowledge lake to Delta Lake, however it takes planning and collection of the appropriate strategy emigrate the information. There may even be situations the place a enterprise wants the Apache Parquet Information Lake to co-exist even after migrating the information to Delta Lake. For instance, you might need an ETL pipeline that writes knowledge to tables saved in Apache Parquet Information lake and you must carry out an in depth influence evaluation earlier than steadily migrating the information to Delta Lake. Till such time, you must hold their Apache Parquet Information Lake and Delta Lake in sync. On this weblog we’ll focus on just a few comparable use circumstances and present you the best way to deal with them.

Benefits of shifting from Apache Parquet to Delta Lake

  • Any dataset composed solely of Apache Parquet recordsdata, with none transaction log monitoring ‘what has modified’, results in a brittle conduct with respect to ACID transactions. Such conduct could trigger inconsistent reads throughout appending and modification of current knowledge. If write jobs fail mid-way, they might trigger partial writes. These inconsistencies could make stakeholders lose belief on the information in regulated environments that require reproducibility, auditing, and governance. Delta Lake format, in distinction, is a totally ACID compliant knowledge storage format.
  • The time-travel function of Delta Lake allows groups to have the ability to observe variations and the evolution of knowledge units. If there are any points with the information, the rollback function offers groups the flexibility to return to a previous model. You possibly can then replay the information pipelines after implementing corrective measures.
  • Delta Lake, owing to the bookkeeping processes within the type of transaction logs, file metadata, knowledge statistics and clustering strategies, results in a big question efficiency enchancment over Apache Parquet based mostly knowledge lake.
  • Schema enforcement rejects any new columns or different schema adjustments that are not suitable together with your desk. By setting and upholding these excessive requirements, analysts and engineers can belief that their knowledge has the very best ranges of integrity, and purpose about it with readability, permitting them to make higher enterprise choices. With Delta Lake customers have entry to easy semantics to manage the schema, which incorporates schema enforcement, that stops customers from by accident polluting their tables with errors or rubbish knowledge.
    Schema evolution enhances enforcement by making it simple for meant schema adjustments to happen mechanically. Delta Lake makes it easy to mechanically add new columns of wealthy knowledge when these columns belong.

You possibly can check with this Databricks Weblog collection to grasp Delta Lake inside performance.

Issues earlier than migrating to Delta Lake

The methodology that must be adopted for the migration of Apache Parquet Information Lake to Delta Lake will depend on one or many migration necessities that are documented within the matrix under.

Necessities ⇨

Strategies ⇩

Full overwrite at supply Incremental with append at supply Duplicates knowledge Maintains knowledge construction Backfill knowledge Ease of use
Deep CLONE Apache Parquet Sure Sure Sure Sure Sure Simple
Shallow CLONE Apache Parquet Sure Sure No Sure Sure Simple
CONVERT TO DELTA Sure No No Sure No Simple
Auto Loader Sure Sure Sure No Non-obligatory Some configuration
Batch Apache Spark job Customized logic Customized logic Sure No Customized logic Customized logic
COPY INTO Sure Sure Sure No Non-obligatory Some configuration

Desk 1 – Matrix to indicate choices for migrations

Now let’s focus on the migration necessities and the way that impacts the selection of migration methodologies.

Necessities

  • Full overwrite at supply: This requirement specifies that the information processing program fully refreshes the information in supply Apache Parquet knowledge lake each time it runs and knowledge must be fully refreshed within the goal Delta Lake after the conversion has begun
  • Incremental with append at supply: This requirement specifies that the information processing program refreshes the information in supply Apache Parquet knowledge lake by utilizing UPSERT (INSERT, UPDATE or DELETE) each time it runs and knowledge must be incrementally refreshed within the goal Delta Lake after the conversion has begun
  • Duplicates knowledge: This requirement specifies that knowledge is written to a brand new location from the Apache Parquet Information Lake to Delta Lake. If knowledge duplication just isn’t most popular and there’s no influence to the prevailing functions then the Apache Parquet Information Lake is modified to Delta Lake in place.
  • Maintains knowledge construction: This requirement specifies if the information partitioning technique at supply is maintained throughout conversion.
  • Backfill knowledge: Information backfilling includes filling in lacking or outdated knowledge from the previous on a brand new system or updating outdated information. This course of is often finished after a knowledge anomaly or high quality problem has resulted in incorrect knowledge being entered into the information warehouse. Within the context of this weblog, the ‘backfill knowledge’ requirement specifies the performance that helps backfilling knowledge that has been added to the conversion supply after the conversion has begun.
  • Ease of use: This requirement specifies the extent of consumer effort to configure and run the information conversion.

Methodologies with Particulars

Deep CLONE Apache Parquet

You should utilize Databricks deep clone performance to incrementally convert knowledge from the Apache Parquet Information lake to the Delta Lake. Use this strategy when all of the under standards are glad:

  • you must both fully refresh or incrementally refresh the goal Delta Lake desk from a supply Apache Parquet desk
  • in-place improve to Delta Lake just isn’t attainable
  • knowledge duplication (sustaining a number of copies) is suitable
  • the goal schema must match the supply schema
  • you could have a necessity for knowledge backfill. On this context, it means in future you may have further knowledge coming into the supply desk. Via a subsequent Deep Clone operation, such new knowledge would get copied into and synchronized with the goal Delta Lake desk.
Fig 1: Deep CLONE Apache Parquet table
Fig 1: Deep CLONE Apache Parquet desk

Shallow CLONE Apache Parquet

You should utilize Databricks shallow clone performance to incrementally convert knowledge from Apache Parquet Information lake to Delta Lake, once you:

  • need to both fully refresh or incrementally refresh the goal Delta Lake desk from a supply Apache Parquet desk
  • don’t need the information to be duplicated (or copied)
  • need the identical schema between the supply and goal
  • even have a necessity for knowledge backfilling. It means in future you may have further knowledge coming into the supply aspect. Via a subsequent Shallow Clone operation, such new knowledge would get acknowledged (however not copied) within the goal Delta Lake desk.
Fig 2: Shallow CLONE Apache Parquet table
Fig 2: Shallow CLONE Apache Parquet desk

CONVERT TO DELTA

You should utilize Convert to Delta Lake function you probably have necessities for:

  • solely full refresh (and never incremental refresh) of the goal Delta Lake desk
  • no a number of copies of the information i.e. knowledge must be transformed in place
  • the supply and goal tables to have the identical schema
  • no backfill of knowledge. On this context, it implies that knowledge written to the supply listing after the conversion has began could not replicate within the resultant goal Delta desk.

For the reason that supply is remodeled right into a goal Delta Lake desk in-place, all future CRUD operations on the goal desk must occur by Delta Lake ACID transactions.

Be aware – Please check with the Caveats earlier than utilizing the CONVERT TO DELTA choice. It’s best to keep away from updating or appending knowledge recordsdata throughout the conversion course of. After the desk is transformed, be certain all writes undergo Delta Lake.

Fig 3: Convert to Delta
Fig 3: Convert to Delta

Auto Loader

You should utilize Auto Loader to incrementally copy all knowledge from a given cloud storage listing to a goal Delta desk. This strategy can be utilized for the under circumstances:

  • you could have necessities for both full refresh or incremental refresh of the Delta Lake from Apache Parquet recordsdata saved in cloud object storage
  • in place improve to a Delta Lake desk just isn’t attainable
  • knowledge duplication (a number of copies of recordsdata) is allowed
  • sustaining the information construction (schema) between supply and goal after the migration just isn’t a requirement
  • you wouldn’t have a particular want for knowledge backfilling, however nonetheless need to have it as an choice if want arises sooner or later

COPY INTO

You should utilize COPY INTO SQL command to incrementally copy all knowledge from a given cloud storage listing to a goal Delta desk. This strategy can be utilized for the under circumstances:

  • When you’ve got necessities for both full refresh or incremental refresh of the Delta Lake desk from the Apache Parquet recordsdata saved within the cloud object storage
  • in-place improve to a Delta Lake desk just isn’t attainable
  • knowledge duplication (a number of copies of recordsdata) is allowed
  • adhering to the identical schema between supply and goal after migration just isn’t a requirement
  • you wouldn’t have a particular want for knowledge backfill

Each Auto Loader and COPY INTO permit the customers loads of choices to configure the information motion course of. Confer with this hyperlink when you must resolve between COPY INTO and Auto Loader.

Auto Loader

Batch Apache Spark job

Lastly, you need to use customized Apache Spark logic emigrate to Delta Lake. It offers nice flexibility in controlling how and when completely different knowledge out of your supply system is migrated, however would possibly require in depth configuration and customization to supply capabilities already constructed into the opposite methodologies mentioned right here.

To carry out backfills or incremental migration, you would possibly be capable to depend on the partitioning construction of your knowledge supply, however may additionally want to put in writing customized logic to trace which recordsdata have been added because you final loaded knowledge from the supply. Whereas you need to use Delta Lake merge capabilities to keep away from writing duplicate information, evaluating all information from a big Parquet supply desk to the contents of a big Delta desk is a posh and computationally costly activity.

Confer with this hyperlink for extra info on the methodologies of migrating your Apache Parquet Information Lake to Delta Lake.

Conclusion

On this weblog, we’ve got described numerous choices emigrate your Apache Parquet Information Lake to Delta Lake and mentioned how one can decide the appropriate methodology based mostly in your necessities. To be taught extra concerning the Apache Parquet to Delta Lake migration and the best way to get began, please go to the guides (AWS, Azure, GCP). In these Notebooks we’ve got offered just a few examples so that you can get began and take a look at completely different choices for migration. Additionally it’s at all times advisable to comply with optimization greatest practices on Databricks after you migrate to Delta Lake.

[ad_2]