How Cargotec makes use of metadata replication to allow cross-account knowledge sharing



It is a visitor weblog put up co-written with Sumesh M R from Cargotec and Tero Karttunen from Knowit Finland.

Cargotec (Nasdaq Helsinki: CGCBV) is a Finnish firm that makes a speciality of cargo dealing with options and companies. They’re headquartered in Helsinki, Finland, and operates globally in over 100 international locations. With its main cargo dealing with options and companies, they’re pioneers of their subject. By their distinctive place in ports, at sea, and on roads, they optimize international cargo flows and create sustainable buyer worth.

Cargotec captures terabytes of IoT telemetry knowledge from their equipment operated by quite a few clients throughout the globe. This knowledge must be ingested into a knowledge lake, remodeled, and made obtainable for analytics, machine studying (ML), and visualization. For this, Cargotec constructed an Amazon Easy Storage Service (Amazon S3) knowledge lake and cataloged the info property in AWS Glue Information Catalog. They selected AWS Glue as their most well-liked knowledge integration instrument resulting from its serverless nature, low upkeep, skill to regulate compute sources prematurely, and scale when wanted.

On this weblog, we talk about the technical challenges confronted by Cargotec in replicating their AWS Glue metadata throughout AWS accounts, and the way they navigated these challenges efficiently to allow cross-account knowledge sharing.  By sharing their story, we hope to encourage readers going through comparable challenges and supply insights into how our companies might be custom-made to fulfill your particular wants.


Like many purchasers, Cargotec’s knowledge lake is distributed throughout a number of AWS accounts which can be owned by totally different groups. Cargotec wished to discover a answer to share datasets throughout accounts and use Amazon Athena to question them. To share the datasets, they wanted a solution to share entry to the info and entry to catalog metadata within the type of tables and views. Cargotec’s use instances additionally required them to create views that span tables and views throughout catalogs. Cargotec’s implementation covers three discrete AWS accounts, 25 databases, 150 tables, and 10 views.

Resolution overview

Cargotec required a single catalog per account that contained metadata from their different AWS accounts. The answer that greatest match their wants was to duplicate metadata utilizing an in-house model of a publicly obtainable utility referred to as Metastore Migration utility. Cargotec prolonged the utility by altering the general orchestration layer by including an Amazon SQS notification and an AWS Lambda. The method was to programmatically copy and make obtainable every catalog entity (databases, tables, and views) to all shopper accounts. This makes the tables or views native to the account the place the question is being run, whereas the info nonetheless stays in its supply S3 bucket.

Cargotec’s answer structure

The next diagram summarizes the structure and total move of occasions in Cargotec’s design.

Solution Architecture

Catalog entries from a supply account are programmatically replicated to a number of goal accounts utilizing the next sequence of steps.

  1. An AWS Glue job (metadata exporter) runs every day on the supply account. It reads the desk and partition info from the supply AWS Glue Information Catalog. For the reason that goal account is used for analytical functions and doesn’t require real-time schema adjustments, the metadata exporter runs solely as soon as a day. Cargotec makes use of partition projection, which ensures that the brand new partitions can be found in real-time.
  2. The job then writes the metadata to an S3 bucket in the identical account. Please word that the answer doesn’t contain motion of the info throughout accounts. The goal accounts learn knowledge from the supply account S3 buckets. For steerage on establishing the correct permissions, please see the Amazon Athena Person Information.
  3. After the metadata export has been accomplished, the AWS Glue job pushes a notification to an Amazon Easy Notification Service (Amazon SNS) matter. This message incorporates the S3 path to the most recent metadata export. The SNS notification is Cargotec’s customization to the prevailing open-source utility.
  4. Each goal account runs an AWS Lambda perform that’s notified when the supply account SNS matter receives a push. In brief, there are a number of subscriber Lambda capabilities (one per goal account) for the supply account SNS subjects that get triggered when an export job is accomplished.
  5. As soon as triggered, the Lambda perform then initiates an AWS Glue job (metadata importer) on the respective goal account. The job receives as enter the supply account’s S3 path to the metadata that has been not too long ago exported.
  6. Based mostly on the trail supplied, the metadata importer reads the exported metadata from the supply S3 bucket.
  7. The metadata importer now makes use of this info to create or replace the corresponding catalog info within the goal account.

All alongside the way in which, any errors are printed to a separate SNS matter for logging and monitoring functions. With this method, Cargotec was capable of create and eat views that span tables and views from a number of catalogs unfold throughout totally different AWS accounts.


The core of the catalog replication utility is 2 AWS Glue scripts:

  • Metadata exporter – An AWS Glue job that reads the supply knowledge catalog and creates an export of the databases, tables, and partitions in an S3 bucket within the supply account.
  • Metadata importer – An AWS Glue job that reads the export that was created by the metadata exporter and applies the metadata to focus on databases. This code is triggered by a Lambda perform as soon as information are written to S3. The job runs within the goal account.

Metadata exporter

This part offers particulars on the AWS Glue job that exports the AWS Glue Information Catalog into an S3 location. The supply code for the applying is hosted the AWS Glue GitHub. Although this may occasionally have to be custom-made to fit your wants, we are going to go over the core parts of the code on this weblog.

Metadata exporter inputs

The applying takes just a few job enter parameters as described beneath:

  • --mode key accepts both to-s3 or to-jdbc. The latter is used when the code is transferring the metadata straight right into a JDBC Hive Metastore. Within the case of Cargotec, since we’re transferring the metadata to information on S3, the worth for --mode will stay to-s3.
  • --output-path accepts an S3 location to which the exported metadata ought to be written. The code creates subdirectories similar to databases, tables, and partitions.
  • --database-names accepts a semicolon-separated listing of databases on the supply catalog that have to be replicated to the goal

Studying the catalog

The metadata concerning the database, tables, and partitions are learn from the AWS Glue catalog.

dyf = glue_context.create_dynamic_frame.from_options(
            connection_options = {
                            'catalog.title': ‘datacatalog’,
                            'catalog.database': database,
                            'catalog.area': area

The above code snippet reads the metadata into an AWS Glue DynamicFrame. The body is then transformed to a Spark DataFrame. It’s filtered into particular person DataFrames primarily based on it being both a part of a database, desk, or partition. A schema is connected to the info body utilizing one of many beneath:

        StructField('items', ArrayType(
        StructField('type', StringType(), False)

        StructField('database', StringType(), False),
        StructField('type', StringType(), False),
        StructField('items', ArrayType(DATACATALOG_TABLE_ITEM_SCHEMA, False), True)

        StructField('database', StringType(), False),
        StructField('table', StringType(), False),
        StructField('items', ArrayType(DATACATALOG_PARTITION_ITEM_SCHEMA, False), True),
        StructField('type', StringType(), False)

For particulars on the person merchandise schema, check with the schema definition on GitHub.

Persisting the metadata

After changing to a DataFrame with schema, it’s endured to the S3 location marked by the output-path parameter

databases.write.format('json').mode('overwrite').save(output_path + 'databases')
tables.write.format('json').mode('overwrite').save(output_path + 'tables')
partitions.write.format('json').mode('overwrite').save(output_path + 'partitions')

Exploring the output

Navigate to the S3 bucket that incorporates the output location, and it’s best to be capable to see the output metadata in format. An instance export for a desk would seem like the next code snippet.

    "database": "default",
    "kind": "desk",
    "merchandise": {
        "createTime": "1651241372000",
        "lastAccessTime": "0",
        "proprietor": "spark",
        "retention": 0,
        "title": "an_example_table",
        "tableType": "EXTERNAL_TABLE",
        "parameters": {
            "totalSize": "2734148",
            "EXTERNAL": "TRUE",
            "last_commit_time_sync": "20220429140907",
            "spark.sql.sources.schema.half.0": "{redacted_schema}",
            "numFiles": "1",
            "transient_lastDdlTime": "1651241371",
            "spark.sql.sources.schema.numParts": "1",
            "spark.sql.sources.supplier": "hudi"
        "partitionKeys": [],
        "storageDescriptor": {
            "inputFormat": "org.apache.hudi.hadoop.HoodieParquetInputFormat",
            "compressed": false,
            "storedAsSubDirectories": false,
            "location": "s3://redacted_bucket_name/desk/an_example_table",
            "numberOfBuckets": -1,
            "outputFormat": "",
            "bucketColumns": [],
            "columns": [{
                    "name": "_hoodie_commit_time",
                    "type": "string"
                    "name": "_hoodie_commit_seqno",
                    "type": "string"
            "parameters": {},
            "serdeInfo": {
                "serializationLibrary": "",
                "parameters": {
                    "": "false",
                    "path": "s3://redacted_bucket_name/desk/an_example_table",
                    "serialization.format": "1"
            "skewedInfo": {
                "skewedColumnNames": [],
                "skewedColumnValueLocationMaps": {},
                "skewedColumnValues": []
            "sortColumns": []

As soon as the export job is full, the output S3 path will probably be pushed to an SNS matter. A Lambda perform on the goal account processes this message and invokes the import AWS Glue job by passing the S3 import location.

Metadata importer

The import job runs on the goal account. The code for the job is accessible on GitHub. As with the exporter, you might have to customise it to fit your particular necessities, however the code as-is ought to work for many situations.

Metadata importer inputs

The inputs to the applying are supplied as job parameters. Under is an inventory of parameters which can be used for the import course of:

  • --mode key accepts both from-s3 or from-jdbc. The latter is used when migration is from a JDBC supply to the AWS Glue Information Catalog. At Cargotec, the metadata is already written to Amazon S3, and therefore the worth for this secret is at all times set to from-s3.
  • --region key accepts a sound AWS Area for the AWS Glue Catalog. The goal Area is specified utilizing this key.
  • --database-input-path key accepts the trail to the file containing the database metadata. That is the output of the earlier import job.
  • --table-input-path key accepts the trail to the file containing the desk metadata. That is the output of the earlier import job.
  • --partition-input-path key accepts the trail to the file containing the partition metadata. That is the output of the earlier import job.

Studying the metadata

The metadata, as beforehand mentioned, are information on Amazon S3. They’re learn into particular person spark knowledge frames with their respective schema info

databases = sql_context.learn.json(path=db_input_dir, schema=METASTORE_DATABASE_SCHEMA)
tables = sql_context.learn.json(path=tbl_input_dir, schema=METASTORE_TABLE_SCHEMA)
partitions = sql_context.learn.json(path=parts_input_dir, schema=METASTORE_PARTITION_SCHEMA)

Loading the catalog

As soon as the spark knowledge frames are learn, they’re transformed to AWS Glue DynamicFrame after which loaded to the catalog, as proven within the following snippet.

               'catalog.title': datacatalog_name, 
               'catalog.area': area
                'catalog.title': datacatalog_name, 
                'catalog.area': area
                 'catalog.title': datacatalog_name, 
                 'catalog.area': area

As soon as the job concludes, you’ll be able to question the goal AWS Glue catalog to make sure the tables from the supply have been synced with the vacation spot. To maintain issues easy and straightforward to handle, as a substitute of implementing a mechanism to determine tables that change over time, Cargotec updates the catalog info of all databases or tables which can be configured within the export job.


Although the setup works successfully for Cargotec’s present enterprise necessities, there are just a few drawbacks to this method, that are highlighted beneath:

  1. The answer includes code. Customizations have been made to the prevailing open-source utility to have the ability to publish an SNS notification as soon as an export is full and a Lambda perform to set off the import course of.
  2. The export course of on the supply account is a scheduled job. Therefore there isn’t any real-time sync between the supply and goal accounts. This was not a requirement for Cargotec’s enterprise course of.
  3. For tables that don’t use Athena partition projection, question outcomes could also be outdated till the brand new partitions are added to the metastore by means of MSCK REPAIR TABLE, ALTER TABLE ADD PARTITION, AWS Glue crawler, and so forth.
  4. The present method requires syncing all of the tables throughout the supply and goal. If the requirement is to seize solely those that modified as a substitute of a scheduled every day export, the design wants to alter and may benefit from the Amazon EventBridge integration with AWS Glue. An instance implementation of utilizing AWS Glue APIs to determine adjustments is proven in Determine supply schema adjustments utilizing AWS Glue.


On this weblog put up, we’ve explored an answer for cross-account sharing of information and tables that makes it potential for Cargotec to create views that mix knowledge from a number of AWS accounts. We’re excited to share Cargotec’s success and consider the put up has supplied you with invaluable insights and inspiration on your personal initiatives.

We encourage you to discover our vary of companies and see how they will help you obtain your objectives. Lastly, for extra knowledge and analytics blogs, be at liberty to bookmark the AWS Weblogs.

Concerning the Authors

Sumesh M R is a Full Stack Machine Studying Architect at Cargotec. He has a number of years of software program engineering and ML background. Sumesh is an skilled in Sagemaker and different AWS ML/Analytics companies. He’s keen about knowledge science and likes to discover the most recent ML libraries and strategies. Earlier than becoming a member of Cargotec, he labored as a Resolution Architect at TCS. In his spare time, he likes to play cricket and badminton.

 Tero Karttunen is a Senior Cloud Architect at Knowit Finland. He advises shoppers on architecting and adopting Information Architectures that greatest serve their Information Analytics and Machine Studying wants. He has helped Cargotec of their knowledge journey for greater than two years. Exterior of labor, he enjoys working, winter sports activities, and role-playing video games.

Arun A Okay is a Huge Information Specialist Options Architect at AWS.  He works with clients to offer architectural steerage for working analytics options on AWS Glue, AWS Lake Formation, Amazon Athena, and Amazon EMR. In his free time, he likes to spend time together with his family and friends.