Alex Merced

Posted on Jun 16

Apache Data Lakehouse Weekly: June 9 to 16, 2026

#data #dataengineering #news #opensource

This was a week of votes that passed and arguments that opened. Iceberg shipped three formal decisions while debating how to cut read latency for V4. Polaris worked through the unglamorous plumbing of error codes, persistence, and retention. Parquet reopened the oldest question in its history, what a version number should mean, while shipping a new release at the same time. Arrow advanced variant support and welcomed a new language binding. DataFusion grew its leadership and set a roadmap. Underneath all of it ran one shared headache that touched four projects at once: who pays for the CI compute.

Read the five lists together and a theme jumps out. This was a maturation week, not a launch week. Almost nothing shipped that a marketing team would put on a banner. What shipped instead was the deep, careful work that decides whether these projects can hold up production workloads for the next several years. Error codes, retention policies, field-id semantics, version numbering, test backends, CI budgets. None of it is exciting. All of it is the difference between software you experiment with and software you bet a business on. If you run a lakehouse, weeks like this one are the weeks that earn your trust, even though they make for quiet headlines.

Apache Iceberg

Iceberg spent the week closing votes. Three separate decisions reached a result, each one a small commitment that shapes the format and its implementations for years.

The C++ implementation crossed a milestone. After a bumpy run through release candidates, Apache Iceberg C++ 0.3.0 RC3 passed its vote, with Junwang Zhao driving the release and Gang Wu thanking the voters once it cleared. Binding and non-binding plus-ones came in from Kevin Liu, Renjie Liu, Neelesh Salian, Alex Stephen, and others, several of whom ran the verify script on macOS before signing off. The road there ran through an earlier RC2 attempt that did not make it, which is the normal rhythm of a young codebase finding its footing. A native C++ Iceberg matters because it removes the JVM from the picture for engines and tools written in C++ and Rust, and it feeds directly into work like iceberg-cpp reaching V3 feature completeness.

The spec gained a small but real clarification. Kevin Liu ran a vote to clarify the day partition transform result type as date, and it passed with six binding and seven non-binding plus-ones. Fokko Driesprong, Szehon Ho, Amogh Jahagirdar, Gang Wu, and many others backed it. The change reads like a footnote, but ambiguity in a spec is where implementations drift apart. Pinning the day transform to produce a date type keeps every engine reading and writing the same thing. The same week, a parallel discussion between Andrei Tserakhau and Kevin Liu picked at the related Avro schema question for the day partition field, the kind of follow-on detail that surfaces once you settle the type.

The most forward-looking vote set up deletion vectors for the next era. Ryan Blue's vote to add the draft bitmap spec to git passed with thirteen plus-ones, nine of them binding, from Daniel Weeks, Anoop Johnson, Szehon Ho, Fokko Driesprong, and more. Getting a draft into the repository does not finalize anything. It gives the community a shared artifact to argue over instead of scattered ideas. Bitmaps underpin the deletion vector approach that makes row-level deletes fast, and a written draft is how that work moves from concept to spec.

While those votes closed, the V4 performance conversation got concrete. Varun Lakhyani opened a discussion about combining three GET calls for Parquet reads, the serial requests for root manifest, data file, and metadata that add latency on small files. Russell Spitzer called this critical for the Root Manifest and indexing work, noting that serial GETs on a small file are a latency killer when the V4 goal is to cut it. Daniel Weeks weighed in that the fix belongs at the FileIO layer rather than buried in Parquet-specific code, since encryption and metrics handling live there too. Spitzer flagged that an Iceberg-first solution appeals, though Parquet Java itself could grow APIs to support the pattern the way the Rust and C++ implementations already do. This is the read-path work that decides whether V4 feels fast in practice.

Schema evolution raised a thornier design question. Sung Yun opened a discussion about a write-path gap for field-id-bound policy during schema evolution, and Prashant Singh laid out the crux. Engines and catalogs resolve column names to field IDs before they persist a policy-to-table mapping, so attaching a policy needs the column to exist. Singh drew the parallel to how Iceberg assigns field IDs at table creation and treats them as the source of truth across renames. The group had modeled this exact scenario when designing ReadRestrictions, choosing to return a name and leave the metadata representation to the catalog. The debate is about where governance metadata lives and whether the spec should say more, a question that grows louder as policies and labels move into the catalog.

Two REST catalog threads pushed in the same direction. Sung Yun and Alexandre Dutra discussed a REST spec change for passing arbitrary information to a request signer, and EJ Wang opened a thread on table and column label metadata in the REST catalog. Both reflect a catalog that keeps taking on more responsibility for security and governance, not just table location.

The Spark connector story got a planning thread. Anurag Mantripragada, Cheng Pan, and Szehon Ho worked through a Spark versioning strategy for accelerated Spark releases. The group leaned toward supporting the last Spark LTS plus the two latest minors, with a plan to merge Spark 4.2 support first and start a separate vote to drop Spark 4.0 after a community sync. Cheng Pan tied the cadence question to Iceberg's own release rhythm, since a roughly three-month cycle keeps the supported version range manageable. Keeping pace with Spark without carrying every old version forever is a steady maintenance tax, and the community is choosing how to pay it.

Maintenance and patches rounded out the week. Amogh Jahagirdar and others discussed 1.11.1 and 1.10.3 patch releases, with the 1.11.x branch created the prior week and several correctness fixes lined up as backport candidates. Matt Butrovich pointed to a green PR fixing manifest delete file size after a table rewrite, and Amogh added a fix for default value handling against Parquet metrics. These are the unglamorous correctness fixes that keep production tables trustworthy. The community also looked ahead socially, with a discussion about Iceberg Summit 2027 drawing input from Danica Fine, Jean-Baptiste Onofré, Bill Zhang, and Kevin Liu.

Step back and the Iceberg week tells a clear story about where the project sits. The Java implementation is in steady maintenance mode, shipping patch releases and trimming CI cost, while the energy moves to two frontiers. One frontier is V4, where read latency, the root manifest, deletion vectors, and indexing all aim to make large tables fast at query time. The other is the language frontier, where C++ and Rust implementations grow toward feature parity so engines outside the JVM can read and write Iceberg natively. The day-partition clarification, the bitmap draft, and the field-id policy debate all feed those frontiers. None of them grab headlines on their own. Together they decide whether Iceberg stays the default open table format as the workload mix shifts toward low-latency and AI-driven reads.

The catalog threads deserve a second look because they point at a bigger shift. The REST signer change and the label metadata thread both move responsibility into the catalog rather than the table files. A few years ago the table format held almost everything and the catalog was a thin pointer. The center of gravity is moving. Security policy, labels, request signing, and governance increasingly live at the catalog layer, which is exactly why Polaris had such a busy week. The two projects are growing into each other.

It helps to say plainly what V4 is chasing, since several threads orbit it. The goal is low-latency reads on large tables. Today an Iceberg read can mean several serial round trips: fetch the root metadata, fetch a manifest, fetch the data file, and so on. On a small file over object storage, those serial GETs dominate the time, because the network round trip costs more than the actual read. The call-combining discussion, the root manifest work, deletion vectors, and indexing all attack pieces of that problem. Combine the calls and you cut round trips. Get the root manifest right and you find the data you need faster. Use deletion vectors and you skip the slow merge-on-read path. The reason this matters now is that the workload mix is shifting. More queries are interactive, more are driven by AI agents firing many small reads, and more expect sub-second answers. Iceberg won the batch analytics world. V4 is the bet that it can win the low-latency world too, and the threads this week are the early, unglamorous moves in that game.

Apache Polaris

Polaris had its busiest week of the five, and almost all of it was the careful, detailed work of turning a young catalog into a dependable one. The threads were less about big new features and more about getting the hard parts right.

The most active debate was also the smallest in scope: what status code to return when a table or view rename conflicts. The thread on rename conflict status codes ran long because the choice carries real weight for clients. Dmitri Bourlatchkov pushed to start with a simple 503, and Yufei Gu agreed, leaning the same way after reading the RFCs. Nándor Kollár argued that both 429 and 503 are imperfect, since 429 signals a client sending too many requests while the rename conflict is more of a server-side condition. The group settled toward 503 plus retry as the least-bad option, with server-side retries left as a later addition if users complain. A status code looks trivial until you remember that every client in the ecosystem has to handle whatever you pick.

A larger architecture thread tackled forwarding for Iceberg scan and commit operations. Alexandre Dutra and Romain Manni-Bucau worked through the design for forwarding use cases, including a side debate about GraalJS, which entered the picture because of Ranger and carries an 85-megabyte cost against an already large Polaris image. Dutra noted the long-term plan removes GraalJS once Ranger moves to a sidecar-style deployment. The detail matters because catalog image size and startup cost shape how cheaply Polaris runs in a container, and every dependency is a tradeoff between capability and weight.

Persistence drew two connected threads. Alexandre Dutra opened a discussion on supporting H2 in persistence and a related one on deprecating TreeMapMetaStore. Russell Spitzer gave the history: TreeMapMetaStore looked similar to FoundationDB from an API view, so it served as a test backend while the original backend was built. The group reached general agreement to deprecate it in favor of a JDBC plus H2 solution for tests, with Dutra flagging a few tricky spots in polaris-core where tests lean on TreeMapMetaStore as a convenience with no obvious replacement. Cleaning up test infrastructure is the kind of work that pays off invisibly, by making every future change easier to verify.

Retention and observability got attention too. Yong Zheng and Adnan Hemani discussed a mechanism to purge the events and metrics table, with Hemani stressing that retention boundaries are necessary for the events system to scale and that maintenance jobs must always respect pre-set limits so an admin does not accidentally delete data. Zheng planned to start with generic cronjob support in Helm so newer maintenance jobs plug in cleanly. These threads connect to a broader push around metrics reporting and REST endpoints for table metrics and events, where Dmitri Bourlatchkov, EJ Wang, and Yufei Gu have been shaping how Polaris exposes operational data.

Two threads pointed at where Polaris wants to grow. Adam Christian and Adnan Hemani advanced a proposal for semantic layer support, working through how a Dataset model nests under a Table or View and how its descriptions relate to Iceberg table property comments. A semantic layer in the catalog is a notable ambition, since it moves Polaris past pure metadata toward business meaning. Separately, EJ Wang and Adnan Hemani continued the OpenLineage proposal, agreeing to preserve the endpoint shape OpenLineage clients expect while clarifying the provider boundary behind it. Wang framed the lineage work as one usable vertical slice rather than independently mergeable parts that do nothing on their own.

Scale and security threads filled out the week. A GitHub-sourced discussion on the feasibility of one realm per tenant at 10,000 tenants tested how Polaris multi-tenancy holds up at scale, and a thread on a GCP counterpart to AWS STS session tags worked through credential vending across clouds. The cloud-portability question keeps recurring because a catalog that only works well on one cloud is a catalog with a ceiling.

The metrics and events work ran deeper than one thread. Alongside the purge discussion, Dmitri Bourlatchkov, EJ Wang, and Yufei Gu shaped a proposal for REST endpoints for table metrics and events and a related discussion on filters for Iceberg metrics reporting. Read these together and a picture forms. Polaris is building a full observability story: collect metrics and events, filter what gets reported, expose it through REST, and purge it on a retention schedule. That is the difference between a catalog you can run as a hobby and a catalog you can run as production infrastructure with audit and capacity planning built in.

Governance of the project itself drew a long thread. The discussion about actions on the merge button ran past twenty messages, with Adnan Hemani, Alexandre Dutra, Jean-Baptiste Onofré, and Robert Stupp working through how the project handles its GitHub merge workflow. A connected thread on fine-grain branch and tag creation control sorted out who can do what in the repository. These process threads look like housekeeping, but a young top-level project has to write down its rules, and the time spent here saves friction later. The discussion on multiple StorageConfigurationInfos per catalog between Alexandre Dutra, Dmitri Bourlatchkov, and Robert Stupp rounded out the architecture work, tackling how one catalog handles more than one storage backend.

The throughline for Polaris is maturation. Almost none of this week's work was a flashy new feature. It was error codes, persistence cleanup, retention policy, observability endpoints, multi-cloud credentials, and repository governance. That is exactly the work a catalog has to do to earn production trust. The community is choosing depth over breadth right now, and for anyone planning to run Polaris as their lakehouse catalog, that is the right order of operations.

The semantic layer proposal is the one thread that points the other way, toward ambition rather than hardening, and it is worth watching for what it signals. A catalog that only tracks tables and their locations is a metadata store. A catalog that understands datasets, their descriptions, and how they nest under tables and views is starting to hold business meaning, not just physical layout. If Polaris follows that thread, it stops competing only with other catalogs and starts touching the territory of semantic layers and metrics stores. That is a big stretch for a young project, and the careful way Adam Christian and Adnan Hemani worked through how a Dataset model relates to Iceberg table property comments suggests the community knows it. Pair the semantic layer ambition with the lineage work, and you can see Polaris reaching to be the place teams ask not just where a table lives but what it means and where its data came from. Whether it gets there is a multi-quarter question. That it is trying tells you how much the catalog layer is heating up.

Apache Arrow

Arrow ran quieter than Iceberg or Polaris this week, but its threads carried weight that reaches across the ecosystem.

Variant type support was the headline. Gang Wu reported that several efforts are working on the variant type in Arrow C++, with his colleague Zehua working on it for a while and iceberg-cpp depending on it to reach V3 feature completeness. Neelesh Salian welcomed getting variant into Arrow C++ so downstream projects benefit, and Micah Kornfield joined the discussion. Gang made a careful point about AI-generated code: variant is a complex feature that demands full spec compliance and native C++ performance, and it takes time to meet the Arrow C++ bar even when models produce decent code. The thread also surfaced a coordination problem, since duplicate efforts were underway, and the goal was to collaborate rather than build the same thing twice. Variant is the connective tissue here, a semi-structured type that Parquet, Iceberg, and Arrow all need to agree on, so getting the Arrow C++ implementation right unblocks the whole chain.

Arrow also gained a new language binding. Sutou Kouhei confirmed that the Arrow Erlang repository transferred to apache/arrow-erlang, following the donation vote, with Benjamin Philip handling the transfer and Kou preparing the repository for its next steps. A new binding widens Arrow's reach into the Erlang and Elixir world, which carries a strong community in telecom and distributed systems. Every binding turns Arrow from a library into a lingua franca that more languages can share without copying data.

The third Arrow thread was an infrastructure one shared with the rest of the foundation, covering consumption of ASF shared GitHub-hosted runners. Antoine Pitrou, Robert Thomson, and Sutou Kouhei worked through how much CI compute Arrow draws. That theme appears again below, because it hit nearly every project at once.

Arrow's quieter week still carries outsized weight for the stack. Arrow is the in-memory format that lets these projects pass data without serializing and copying at every boundary. When an Iceberg reader hands columns to a query engine, Arrow is often the shape those columns take. So an Arrow decision about variant is not an Arrow-only decision. It sets the in-memory representation that Parquet readers, Iceberg engines, and DataFusion query plans all inherit. Gang Wu's caution about meeting the C++ performance bar reflects that responsibility. A slow or incorrect variant in Arrow C++ would ripple into every tool that depends on it. The Erlang binding points the other way, outward, growing the set of languages that can speak Arrow natively and share data with the rest of the ecosystem without a translation tax.

The point Gang Wu made about AI-generated code deserves a moment on its own, because it captures a real tension in open source right now. Models can produce code that looks correct and even passes a first read. But a feature like variant has to match a written spec exactly and run at native C++ speed, and clearing that bar takes human review, benchmarking, and iteration that a quick generation does not provide. Arrow holds that line because everything downstream inherits its mistakes. It is a useful reminder that the hardest part of adding a feature to foundational software is not writing the first version. It is making the version correct and fast enough that thousands of dependent projects can trust it without checking. That standard is why the variant work moves deliberately, and why moving deliberately is the right call.

Apache Parquet

Parquet had the single most active thread of any project this week, and it was a big one: the future of how Parquet versions itself. The format also shipped a release in the middle of the debate, which made for a fitting contrast between long-term design and near-term delivery.

The discussion on the future of Parquet versioning ran past sixty messages and pulled in Russell Spitzer, Micah Kornfield, Andrew Lamb, Antoine Pitrou, Daniel Weeks, and many more. The core tension is old and real. Parquet has added features faster than its version story can describe them, and readers need a clear way to know which features a file uses. Micah Kornfield floated splitting the idea into two notions: a primary specification version that risks using features not yet widely adopted, and presets that give users a different way to configure feature bundles. Russell Spitzer argued the group is overcomplicating it, since everyone understands versions, and urged picking something simple and moving rather than deliberating. His point was practical: the worst case is making a different choice later, which beats sitting stuck and blocking progress on new encodings and footers. This debate decides how the ecosystem talks about Parquet capability for years, so the heat is earned.

A connected thread tried to make the current state legible. Andrew Lamb and Antoine Pitrou worked on documenting which features land in which versions of Parquet, the kind of reference that turns tribal knowledge into something a new implementer can read. The Parquet Footer Working Group held its second session, with Antoine Pitrou and Jiayi Wang continuing work on footer design, which ties directly to the read-latency goals showing up over in Iceberg.

While the versioning debate raged, the format shipped. Gang Wu drove the vote for Parquet Format 2.13.0 RC0, which passed with three binding plus-ones from Micah Kornfield, Andrew Lamb, and Gang, plus three non-binding from Neelesh Salian, Ed Seidl, and Russell Spitzer. Ed Seidl's note that the release brings usable float statistics, something the community waited on, captured the practical payoff. The release was announced once it cleared. A release that lands mid-debate is a healthy sign, since it shows the project can ship incremental value while it argues about the larger structure.

New logical types kept coming. Burak Yavuz moved the File logical type forward, submitting reference implementation PRs against parquet-format, parquet-java, and arrow-rs after the design doc settled, with Daniel Weeks recapping the discussion around the metadata and content_type fields. Rok Mihevc opened a discussion to introduce a FIXED_SIZE_LIST logical type, useful for fixed-length vectors of the kind that show up everywhere in machine learning feature data. Will Edwards and Jiayi Wang also dug into a clarification on row-group and column-chunk layout, and an INT96 statistics discussion drew Ryan Blue, Ed Seidl, and others. The steady stream of logical types shows Parquet adapting to AI-era data shapes, where fixed-size vectors and richer file references are common.

Look closer at those logical types and you can read where the data world is heading. A File logical type lets a Parquet column point at an external file with a content type attached, which is how a table starts to hold images, audio, PDFs, and other unstructured payloads next to its scalar columns. A FIXED_SIZE_LIST type stores a vector of known length, which is exactly the shape of an embedding. Put the two together and Parquet is quietly growing the vocabulary it needs to store the inputs and outputs of machine learning, not just the rows of a sales report. The format that won the analytics world by being a fast columnar store is stretching to hold the messy, high-dimensional data that AI workloads run on. That is a deliberate direction, and the people doing the work, Burak Yavuz and Rok Mihevc among them, are building it one careful logical type at a time.

The versioning debate deserves a last word because it is really a debate about trust. A Parquet file written today might be read five years from now by a tool nobody has built yet. The version story is the promise that file makes to its future readers about which features they need to understand it. Micah Kornfield's split between a spec version and presets tries to separate two questions that got tangled: what a file can contain versus what a given writer chooses to turn on. Russell Spitzer's counter is that perfect is the enemy of shipped, and a clear-enough answer now beats a perfect answer that arrives after another year of new encodings pile up undescribed. Both are right, which is why the thread ran past sixty messages. The resolution will shape how every engine in the ecosystem advertises and detects Parquet capability, and that is worth getting close to right even under deadline pressure.

The contrast inside the Parquet week is the real lesson. The project argued for sixty-plus messages about a deep structural question while simultaneously shipping 2.13.0 with usable float statistics and pushing three new logical types through reference implementations. A less healthy project would let the big debate freeze the small deliveries. Parquet kept both moving. That ability to ship incremental value while wrestling with long-term design is what separates a format people depend on from one they merely tolerate.

Apache DataFusion

DataFusion's week was about people and direction. The project grew its leadership and set its sights on the back half of the year.

The headline was leadership growth. Andrew Lamb announced that Matt Butrovich joined the DataFusion PMC, drawing congratulations from Bruce Ritchie, Andy Grove, and a long list of contributors. Lamb's playful aside, that this time he really did mean PMC, hinted at the usual good-natured confusion that comes with back-to-back committer and member announcements. Neil Conway also drew recognition across committer and PMC threads the same week. A project that keeps promoting active contributors is a project with a healthy pipeline, and DataFusion's steady cadence of new committers and members is a strong signal under the hood.

Direction came through two threads. Andrew Lamb filed a discussion to coordinate the 2026 Q3 to Q4 roadmap, inviting the community to weigh in on where to take the project through a GitHub tracking issue. Lamb also ran a crowdsourcing thread for the ASF board report, the routine governance work that keeps a top-level project accountable. The community even fielded a PlusOne.apache.org interview thread with Rich Bowen, the kind of outreach that tells the broader foundation what DataFusion is up to. For a query engine that more lakehouse tools build on every quarter, a clear roadmap is a gift to everyone downstream who plans around it.

If you have not tracked DataFusion closely, here is why its quiet governance week still matters. DataFusion is a query engine written in Rust, built on Arrow, that other projects embed to run SQL and DataFrame workloads without writing an execution engine from scratch. It is the engine inside a growing list of databases and tools, which means its roadmap is not an internal matter. When DataFusion decides what to build in the back half of 2026, it sets the menu of features that every downstream product inherits. A team shipping a new analytics database on top of DataFusion plans its own year around that roadmap. So the Q3-to-Q4 thread, dull as a planning document sounds, is one of the more widely felt decisions in the Rust data world.

The people story matters for the same reason. A query engine is only as healthy as the bench of maintainers who can review and merge the hard changes. Promoting Matt Butrovich to the PMC and recognizing Neil Conway across committer and member threads widens that bench. Each new maintainer is one more person who can shepherd a tricky optimizer change or a new operator without waiting on a single overloaded reviewer. For downstream projects betting their execution layer on DataFusion, the depth of that maintainer pool is a risk metric, and this week it got a little deeper. The unglamorous work of governance and promotion is how an embedded engine earns the trust to sit at the center of other people's products.

Cross-Project Themes

Two patterns connected the lists this week, and both tell you something you cannot see by reading any single project.

The first was a shared infrastructure squeeze. Iceberg, Arrow, DataFusion, and Polaris all ran threads about consumption of ASF shared GitHub-hosted runners in the same window, with Bob Thomson and Robert Thomson surfacing the question across projects. When four busy projects independently confront their CI compute budget at once, it is not four problems. It is one foundation-level constraint reaching every active community at the same time. Iceberg even ran a parallel thread on reducing CI runner time by running JDK 21 only on main and nightly, a direct response to the same pressure. The lesson for anyone running a large open source project is that CI cost is now a first-class governance topic, not an afterthought.

The second pattern was the variant type and read performance moving in lockstep across the stack. Arrow worked on variant in C++, Parquet shipped logical types and debated versioning, and Iceberg pushed on combining read calls and finalizing its bitmap draft. These are not separate efforts. A variant value written in Parquet, described by Arrow, and read by Iceberg has to mean the same thing at every layer, and iceberg-cpp reaching V3 feature completeness depends on Arrow C++ getting variant right. The read-latency work threads through too, since the Iceberg call-combining discussion and the Parquet footer working group both chase the same goal of fewer, faster reads. The lakehouse is one system wearing four project names, and weeks like this make the seams visible.

The third pattern was maturation showing up everywhere at once. Iceberg shipped patch releases and trimmed CI. Polaris wrote retention policies and cleaned up test backends. Parquet documented which features live in which versions. DataFusion promoted maintainers and crowdsourced a board report. Read in isolation, each looks like ordinary housekeeping. Read together, they show a whole ecosystem crossing the same threshold in the same quarter, from the phase where you add features to the phase where you harden them. That synchronization is not a coincidence. These projects share contributors, share a release cadence, and share the same production users pushing them toward reliability. When the lakehouse stack matures, it tends to mature all at once, because the pressure comes from the same place: real workloads that need it to not break.

A human pattern ran under both. The same contributors show up across projects. Gang Wu drove an Iceberg C++ release, voted on a Parquet release, and weighed in on Arrow variant. Andrew Lamb led DataFusion governance and voted on the Parquet release. Russell Spitzer argued Parquet versioning, shaped Iceberg's read path, and recalled Polaris history. Matt Butrovich fixed Iceberg manifests and joined the DataFusion PMC. The lakehouse ecosystem is held together by people who treat all of it as one project, and that overlap is why the formats stay compatible.

What This Means If You Run a Lakehouse

Mailing list threads can feel far from a production system, so here is the practical read for anyone whose job depends on this stack.

If you run Iceberg in production, the patch-release work matters most to you this week. The 1.11.1 and 1.10.3 fixes target real correctness bugs around manifest delete file sizes after rewrites and default value handling against Parquet metrics. Those are the kinds of bugs that quietly produce wrong results or bloated metadata, so plan to pick up the patch releases when they land. The V4 read-path work is further out, but it tells you where performance gains will come from next year, which is worth knowing if you are sizing hardware or planning a migration.

If you are evaluating Polaris as your catalog, this week is reassuring. The observability work, retention policies, persistence cleanup, and multi-cloud credential threads are exactly the boxes a platform team checks before trusting a catalog with production tables. A year ago Polaris was a promising young project. The work landing now is the work that turns promising into dependable. If you held off because it felt early, the maturation curve is bending in the right direction.

If you build on Parquet, which is nearly everyone, the versioning debate is worth following even though it will not change your files tomorrow. The outcome decides how tools advertise and detect capability, and that affects whether a file written by one engine reads cleanly in another. The new logical types are a longer-horizon signal: Parquet is preparing to hold embeddings and file references, so if your roadmap includes AI features on top of your tables, the format is growing toward you.

If you embed DataFusion or use a tool that does, watch the Q3-to-Q4 roadmap thread. It is the clearest public statement of what the engine will gain in the back half of the year, and planning your own work against it saves you from building something the engine is about to provide for free.

Looking Ahead

Next week the open questions hang on the votes that did not close. Watch the Parquet versioning thread for a decision, since Russell Spitzer's push to pick something simple and move may finally break the deadlock. Watch the Iceberg V4 read-path work, where the call-combining discussion and the bitmap draft both feed the latency goal. Watch Polaris turn its persistence and retention discussions into merged PRs, and watch the semantic layer proposal for how far the catalog stretches beyond metadata. The Spark 4.0 removal vote in Iceberg should also appear after the community sync. None of these are flashy. All of them shape the lakehouse you build on next year.

The deeper thing to watch is whether the catalog keeps absorbing responsibility. This week Iceberg moved label metadata and request signing toward the REST catalog, while Polaris built observability endpoints, retention, and a semantic layer proposal. Both projects are pushing the same direction: the catalog stops being a thin pointer to table locations and becomes the place where governance, security, lineage, and even business meaning live. If that trend holds, the catalog you pick will matter as much as the table format you pick, maybe more. That is a real shift from how teams thought about this stack two years ago, when the format was everything and the catalog was an afterthought. Keep an eye on it, because it changes how you should evaluate the whole lakehouse.

One more thing worth tracking is the CI compute question, dull as it sounds. Four projects hit it the same week, which means the foundation is feeling a real constraint. How the ASF and these communities resolve who pays for shared runners will shape how fast they can ship. Open source velocity is not free, and this week made the bill visible. The resolution will not make headlines, but it will quietly set the pace of everything else on this list.

Resources & Further Learning

Get Started with Dremio

Try Dremio Free — Build your lakehouse on Iceberg with a free trial
Build a Lakehouse with Iceberg, Parquet, Polaris & Arrow — Learn how Dremio brings the open lakehouse stack together

Free Downloads

Apache Iceberg: The Definitive Guide — O'Reilly book, free download
Apache Polaris: The Definitive Guide — O'Reilly book, free download

Books by Alex Merced

DEV Community