From Hive Tables to Iceberg Tables: Trouble-Free

Advertisements

[ad_1]

Introduction

For greater than a decade now, the Hive desk format has been a ubiquitous presence within the massive knowledge ecosystem, managing petabytes of knowledge with outstanding effectivity and scale. However as the info volumes, knowledge selection, and knowledge utilization grows, customers face many challenges when utilizing Hive tables due to its antiquated directory-based desk format. A number of the widespread points embody constrained schema evolution, static partitioning of knowledge, and lengthy planning time due to S3 listing listings.

Apache Iceberg is a contemporary desk format that not solely addresses these issues but in addition provides further options like time journey, partition evolution, desk versioning, schema evolution, sturdy consistency ensures, object retailer file format (the flexibility to distribute information current in a single logical partition throughout many prefixes to keep away from object retailer throttling), hidden partitioning (customers don’t must be intimately conscious of partitioning), and extra. Due to this fact, Apache Iceberg desk format is poised to interchange the standard Hive desk format within the coming years. 

Nevertheless, as there are already 25 million terabytes of knowledge saved within the Hive desk format, migrating current tables within the Hive desk format into the Iceberg desk format is critical for efficiency and price. Relying on the scale and utilization patterns of the info, a number of completely different methods may very well be pursued to attain a profitable migration. On this weblog, I’ll describe just a few methods one may undertake for varied use circumstances. Whereas these directions are carried out for Cloudera Information Platform (CDP), Cloudera Information Engineering, and Cloudera Information Warehouse, one can extrapolate them simply to different providers and different use circumstances as effectively.

There are few eventualities that one would possibly encounter. A number of of those use circumstances would possibly suit your workload and also you would possibly have the ability to combine and match the potential options supplied to fit your wants. They’re meant to be a common information. In all of the use circumstances we try emigrate a desk named “occasions.”

Method 1

You’ve the flexibility to cease your shoppers from writing to the respective Hive desk in the course of the period of your migration. That is supreme as a result of this would possibly imply that you just don’t have to change any of your shopper code. Typically that is the one selection obtainable in case you have a whole bunch of shoppers that may probably write to a desk. It may very well be a lot simpler to easily cease all these jobs reasonably than permitting them to proceed in the course of the migration course of.

In-place desk migration

Answer 1A: utilizing Spark’s migrate process

Iceberg’s Spark extensions present an in-built process known as “migrate” emigrate an current desk from Hive desk format to Iceberg desk format. In addition they present a “snapshot” process that creates an Iceberg desk with a special title with the identical underlying knowledge. You might first create a snapshot desk, run sanity checks on the snapshot desk, and be certain that every little thing is so as. 

As soon as you’re glad you may drop the snapshot desk and proceed with the migration utilizing the migrate process. Needless to say the migrate process creates a backup desk named “events__BACKUP__.” As of this writing, the “__BACKUP__” suffix is hardcoded. There may be an effort underway to let the consumer go a customized backup suffix sooner or later.

Needless to say each the migrate and snapshot procedures don’t modify the underlying knowledge: they carry out in-place migration. They merely learn the underlying knowledge (not even full learn, they simply learn the parquet headers) and create corresponding Iceberg metadata information. For the reason that underlying knowledge information usually are not modified, it’s possible you’ll not have the ability to take full benefit of the advantages supplied by Iceberg instantly. You might optimize your desk now or at a later stage utilizing the “rewrite_data_files” process. This will likely be mentioned in a later weblog. Now let’s focus on the professionals and cons of this method.

PROS:

  • Can do migration in levels: first do the migration after which perform the optimization later utilizing rewrite_data_files process (weblog to observe). 
  • Comparatively quick because the underlying knowledge information are saved in place. You don’t have to fret about creating a brief desk and swapping it later. The process will try this for you atomically as soon as the migration is completed. 
  • Since a Hive backup is obtainable one can revert the change totally by dropping the newly created Iceberg desk and by renaming the Hive backup desk (__backup__) desk to its unique title.

CONS:

  • If the underlying knowledge is just not optimized, or has a variety of small information, these disadvantages may very well be carried ahead to the Iceberg desk as effectively. Question engines (Impala, Hive, Spark) would possibly mitigate a few of these issues by utilizing Iceberg’s metadata information. The underlying knowledge file areas is not going to change. So if the prefixes of the file path are widespread throughout a number of information we might proceed to undergo from S3 throttling (see Object Retailer File Layout to see the right way to configure it correctly.) In CDP we solely help migrating exterior tables. Hive managed tables can’t be migrated. Additionally, the underlying file format for the desk must be one in every of avro, orc, or parquet.

Observe: There may be additionally a SparkAction within the JAVA API

Answer 1B: utilizing Hive’s “ALTER TABLE” command

Cloudera applied a simple solution to do the migration in Hive. All it’s a must to do is to change the desk properties to set the storage handler to “HiveIcebergStorageHandler.”

The professionals and cons of this method are basically the identical as Answer 1B. The migration is completed in place and the underlying knowledge information usually are not modified. Hive creates Iceberg’s metadata information for a similar precise desk.

Shadow desk migration

Answer 1C: utilizing the CTAS assertion

This answer is most generic and it may probably be used with any processing engine (Spark/Hive/Impala) that helps SQL-like syntax. 

You’ll be able to run primary sanity checks on the info to see if the newly created desk is sound. 

As soon as you’re glad together with your sanity checking you might rename your “occasions” desk to a “backup_events” desk after which rename your “iceberg_events” to “occasions.” Needless to say in some circumstances the rename operation would possibly set off a listing rename of the underlying knowledge listing. If that’s the case and your underlying knowledge retailer is an object retailer like S3, that can set off a full copy of your knowledge and may very well be very costly. If whereas creating the Iceberg desk the placement clause is specified, then the renaming operation of the Iceberg desk is not going to trigger the underlying knowledge information to maneuver. The title will change solely within the Hive metastore. The identical applies for Hive tables as effectively. In case your unique Hive desk was not created with the placement clause specified, then the rename to backup will set off a listing rename. In that case, In case your filesystem is object retailer based mostly, then it could be finest to drop it altogether. Given the nuances round desk rename it’s vital to check with dummy tables in your system and examine that you’re seeing your required conduct earlier than you carry out these operations on vital tables.

You’ll be able to drop your “backup_events” if you want. 

Your shoppers can now resume their learn/write operations on the “occasions” they usually don’t even must know that the underlying desk format has modified. Now let’s focus on the professionals and cons of this method.

PROS:

  • The newly created knowledge is effectively optimized for Iceberg and the info will likely be distributed effectively.
  • Any current small information will likely be coalesced routinely.
  • Widespread process throughout all of the engines.
  • The newly created knowledge information may make the most of Iceberg’s Object Retailer File Structure, in order that the file paths have completely different prefixes, thus lowering object retailer throttling. Please see the linked documentation to see the right way to make the most of this characteristic. 
  • This method is just not essentially restricted to migrating a Hive desk. One may use the identical method emigrate tables obtainable in any processing engine like Delta, Hudi, and so on.
  • You’ll be able to change the info format say from “orc” to “parquet.’’

CONS

  • This can set off a full learn and write of the info and it could be an costly operation. 
  • Your complete knowledge set will likely be duplicated. You must have enough cupboard space obtainable. This shouldn’t be an issue in a public cloud backed by an object retailer. 

Method 2

You don’t have the posh of lengthy downtime to do your migration. You wish to let your shoppers or jobs proceed writing the info to the desk. This requires some planning and testing, however is feasible with some caveats. Right here is a technique you are able to do it with Spark. You’ll be able to probably extrapolate the concepts offered to different engines.

  • Create an Iceberg desk with the specified properties. Needless to say it’s a must to preserve the partitioning scheme the identical for this to work accurately. 

 

  • Modify your shoppers or jobs to put in writing to each tables so that they write to the “iceberg_events” desk and “occasions” desk. However for now, they solely learn from the “occasions” desk. Seize the timestamp from which your shoppers began writing to each of the tables.
  • You programmatically record all of the information within the Hive desk that have been inserted earlier than the timestamp you captured in step 2.
  • Add all of the information captured in step 3 to the Iceberg desk utilizing the “add_files” process. The “add_files” process will merely add the file to your Iceberg desk. You additionally would possibly have the ability to make the most of your desk’s partitioning scheme to skip step 3 totally and add information to your newly created Iceberg desk utilizing the “add_files” process.

  • For those who don’t have entry to Spark you would possibly merely learn every of the information listed in step 3 and insert them into the “iceberg_events.”
  • When you efficiently add all the info information, you may cease your shoppers from studying/writing to the previous “occasions” and use the brand new “iceberg_events.”

Some caveats and notes

  • In step 2, you may management which tables your shoppers/jobs must write to utilizing some flag that may be fetched from exterior sources like setting variables, some database (like Redis) pointer, and properties information, and so on. That manner you solely have to change your shopper/job code as soon as and don’t must preserve modifying it for every step. 
  • In step 2, you’re capturing a timestamp that will likely be used to calculate information wanted for step 3; this may very well be affected by clock drift in your nodes. So that you would possibly wish to sync all of your nodes earlier than you begin the migration course of. 
  • In case your desk is partitioned by date and time (as most actual world knowledge is partitioned), as in all new knowledge coming will go to a brand new partition on a regular basis, you then would possibly program your shoppers to start out writing to each the tables from a selected date and time. That manner you simply have to fret about including the info from the previous desk (“occasions”) to the brand new desk (“Iceberg_events”) from that date and time, and you may make the most of your partitioning scheme and skip step 3 totally. That is the method that needs to be used every time doable. 

Conclusion

Any giant migration is hard and must be thought by fastidiously. Fortunately, as mentioned above there are a number of methods at our disposal to do it successfully relying in your use case. When you have the flexibility to cease all of your jobs whereas the migration is occurring it’s comparatively easy, however if you wish to migrate with minimal to no downtime then that requires some planning and cautious pondering by your knowledge format. You need to use a mixture of the above approaches to finest fit your wants. 

To be taught extra:

  1. For extra on desk migration, please confer with respective on-line documentations in Cloudera Information Warehouse (CDW) and Cloudera Information Engineering (CDE).
  2. Watch our webinar Supercharge Your Analytics with Open Information Lakehouse Powered by Apache Iceberg. It features a stay demo recording of Iceberg capabilities.
  3. Attempt Cloudera Information Warehouse (CDW), Cloudera Information Engineering (CDE), and Cloudera Machine Studying (CML) by signing up for a 60 day trial, or take a look at drive CDP. It’s also possible to schedule a demo by clicking right here or if you have an interest in chatting about Apache Iceberg in CDP, contact your account group.  

[ad_2]