Apache Hudi is an open-source data lake platform that brings ACID transactional guarantees to the data lake. Our teams have had a great experience using Hudi in a high-volume, high-throughput scenario with real-time inserts and upserts. We particularly like the flexibility Hudi offers for customizing the compaction algorithm which helps in dealing with “small files” problems. Apache Hudi falls in the same category as Delta Lake and Apache Iceberg. They all support similar features, but each differs in the underlying implementations and detailed feature lists.
Apache Hudi introduced easy updates and deletes to S3-based Data Lake architectures, and native CDC ingestion patterns.Apache Hudi is a data Lake technology that has been in use since 2016, Originally built by Uber. Hudi offers for customizing the compaction algorithm which helps in dealing with “small files” problems. Apache Hudi falls in the same category as Delta Lake and Apache Iceberg. They all support similar features, but each differs in the underlying implementations and detailed feature lists.
A short history of Apache Hudi
In the Early 2010s, Hudi was created at Uber and entered production in 2016. To power and manage the 100PB of data that underpins essential business activities for trips, riders, and customers, Uber needed a cloud-based data lake. The data engineering team at Uber had a specific requirement to feed changes from relational SQL databases into a data lake using binlogs or change data capture (CDC).
Uber released Hudi as Open-Source software in 2019 and started a community around it. Uber then submitted Hudi to the Apache Software Foundation. Hudi was developed by Uber, and since then, other sizable data-rich companies like Walmart and Disney have adopted it as their main data platform.
Challenges
Hudi originated from the requirement to feed changes from traditional, relational SQL databases into a data lake platform for long retention and to support analytical queries at scale. Uber faced challenges with their ever-growing data platform requirements, such as incremental data processing, data versioning, and efficient data ingestion.
Nowadays, any commonly used query engine supports querying Hudi tables – Trino, PrestoDB, Spark and more.
Real-time Data Ingestion
Hudi doesn’t quite provide real-time data ingestion, but it’s closer than other data lake platforms. Users typically need a variety of tools to manage OLTP sources efficiently while consuming data from them. Bulk-upload ingests can be replaced by resource-efficient Hudi functions, such as Upsert for RDBMS. In the end, Hudi’s DeltaStreamer toolset enables simple scale-out to include more and more sources.
New Batch and Streaming Architecture
Hudi introduces data streaming principles to data lake storage, which allows data to be ingested which is faster than traditional architectures. It also allows for the development of incremental processing pipelines which are hugely faster than traditional batch processing jobs. Hudi doesn’t require server resources up front, it provides greater analytics performance with less operational cost.
Hudi allows two types of deletes :
- Soft delete – Fields you are seeking to delete can be set to null. Soft deletes can be rolled back using the record that is maintained.
- Hard delete – Are permanent deletes to erase any trace of the field from the table.
Updates
Hudi comes with two options when it comes to updating :
‘Insert’ is a fast operation that can be used to update or insert new values to records that are tagged. It is exceptionally fast and is best used in circumstances where you can tolerate multiple duplicates in a table. There is also a BULK_INSERT operation, which is a scalable version of insert capable of handling hundreds of terabytes of load.
‘Upsert’ is the alternative to insert, which is default in Hudi. It is largely the same as the insert operation, but it performs an index lookup before any updates. This is slower than insert but is better for storage optimization and rightsizing of files. It will not generate any duplicates.
Apache Hudi: A Stream-based approach
By using a proprietary ingestion technology called DeltaStreamer, Hudi sets itself apart from other modern data platforms. This gives them shared features while allowing Hudi to ingest data from many sources. Hudi has the unique ability to manage data in object storage at the individual record level while DeltaStreamer manages the ingestion. Data streaming and CDC (Change Data Capture) become straightforward as a result.
Apache Flink support
There’s no doubt that Flink has altered data processing and data architectures. It serves as both a processing engine and a framework for in-memory computations over bounded and unbounded data streams. Unbounded data streams lack a defined endpoint while bounded data streams have a predetermined start point and end.
Conclusion
There’s no doubt that Hudi is a powerful tool that has set the scene for modern data lake and warehouse platforms.Also an open-source data management framework used to simplify incremental data processing and data pipeline development. This framework more efficiently manages business requirements like data lifecycle and improves data quality.
For more details contact info@vafion.com
Follow us on Social media : Twitter | Facebook | Instagram | Linkedin
Similar Posts:
- No similar blogs