Apache Iceberg

Apache Iceberg is an open table format for huge analytic datasets. Iceberg adds tables to compute engines including Spark, Trino, PrestoDB, Flink, Hive and Impala using a high-performance table format that works just like a SQL table.

Apache Iceberg manages the relationship between the event timestamp column and the date. The partitioning is managed by Apache Iceberg. Additional levels of partitioning can be performed, and these are tacked on snapshot via metadata files. as-of-timestamp – selects the current snapshot at a timestamp, in milliseconds.

Iceberg is a high-performance format for huge analytic tables. Iceberg brings the reliability and simplicity of SQL tables to big data, while making it possible for engines like Spark, Trino, Flink, Presto, Hive and Impala to safely work with the same tables, at the same time.

Reliability and performance

Iceberg was developed for huge tables. Iceberg is used in production where a single table can contain tens of petabytes of data and even these huge tables can be read without a distributed SQL engine.

Scan planning is fast – a distributed SQL engine isn’t needed to read a table or find files

Advanced filtering – data files are pruned with partition and column-level stats, using table metadata

Iceberg was designed to solve correctness problems in eventually-consistent cloud object stores.

Works with any cloud store and reduces NN congestion when in HDFS, by avoiding listing and renames

Serializable isolation – table changes are atomic and readers never see partial or uncommitted changes

Multiple concurrent writers use optimistic concurrency and will retry to ensure that compatible updates succeed, even when writes conflict.

Open standard

Iceberg has been designed and developed to be an open community standard with a specification to ensure compatibility across languages and implementations.

Apache Iceberg is open source, and is developed at the Apache Software Foundation.

Apache Iceberg table format is now in use and contributed to by many leading tech companies like Netflix, Apple, Airbnb, LinkedIn, Dremio, Expedia, and AWS.

Like so many tech projects, Apache Iceberg grew out of frustration, an open table format for huge analytic datasets. It’s based on an all-or-nothing approach: An operation should complete entirely and commit at one point in time or it should fail and make no changes to the table. Anything in between leaves a lot of clean-up work. The idea was to keep data in directories and be able to prune out the directories you don’t need. That allows Hive tables to have fast queries on really large amounts of data.

Netflix open-sourced the project in 2018 and donated it to the Apache Software Foundation. It emerged from the Incubator as a top-level project . Its contributors include AirBnB, Amazon 

Web Services, Alibaba, Expedia, Dremio and others.

The key problems Iceberg tries to address are:

  • using data lakes at scale (petabyte-scalable tables)
  • data & schema evolution and
  • consistent concurrent writes in parallel

AWS Integrations

Apache Iceberg integration has multiple AWS service integrations with query engines, catalogs and infrastructure to run.

AWS supports integrations with the following engines and setting up custom catalogs.

  • Spark – Spark 3.0 and AWS client version 2.15.40 supports integration with Apache Iceberg
  • Flink – AWS Flink module supports creation of iceberg tables for Flink SQL client
  • Apache Hive – AWS module with Hive included with dependencies enables to create iceberg tables

Amazon Athena

Amazon Athena is an interactive query service that makes it easy to analyze data directly in Amazon Simple Storage Service (Amazon S3) using standard SQL. With a few actions in the AWS Management Console, you can point Athena at your data stored in Amazon S3 and begin using standard SQL to run ad-hoc queries and get results in seconds.

Athena is server less, so there is no infrastructure to set up or manage, and you pay only for the queries you run. Athena scales automatically—running queries in parallel—so results are fast, even with large datasets and complex queries.

Conclusion :

As organizations move towards data-driven decision making, the importance of lake house style architectures are increasing rapidly. Apache Iceberg being a new open table format which can scale and evolve seamlessly, provides key benefits over its predecessor Apache Hive.

Apache Iceberg is best suited for batch and micro batch processing of datasets. The growing open source community and integrations from multiple cloud providers makes it easier to integrate Apache Iceberg on to existing architecture effectively.

For more details contact info@vafion.com

Follow us on Social media  : Twitter |  Facebook | Instagram | Linkedin

Similar Posts:

    No similar blogs

Related Posts

Stay UpdatedSubscribe and Get the latest updates from Vafion