AEC firms are starting to recognise the value they could extract from their data, if only it weren’t scattered across countless systems, formats and project contributors. The answer for many is a shift away from proprietary files in favour of cloud-based databases that promise genuine ownership of data and better control over it
As companies in the AEC industry digitise, it’s increasingly recognised that their most valuable asset is not to be found in drawings, models, or even the tools used to produce them. Their most valuable asset is the data buried inside every project – data that captures geometry, relationships, parameters, costs, clashes, RFIs, site records and more.
This data represents the accumulation of knowledge that a firm may have spent decades developing and supporting. Yet much of it is scattered across different formats and systems, often locked behind proprietary file structures and metered cloud APIs.
This familiar situation is accompanied by an uncomfortable truth: employees at these firms can open their files, they can view their models, but the firm does not meaningfully own or control the data within them. If employees want to analyse it, run automation on, or train an AI model using it, they must purchase additional subscriptions from their software provider, or even additional applications. As the industry chases meaningful digital transformation, this dependency has increasingly become a strategic liability.
Prefer a quick overview?
Jump to the executive summary at the end
But there is an alternative. AEC firms can benefit by shifting from files to their own in-house, cloud-based databases, gaining long-overdue control of their data and possibly freeing themselves from proprietary bottlenecks.
The answer may lie in data lakes and data lake houses, which offer an open data architecture where project data can live as governed, queryable, interoperable information, rather than isolated blobs inside Revit RVTs, AutoCAD DWGs or proprietary cloud databases. This is the landscape into which the industry is now moving.
RIP files
For thirty years, the AEC industry has revolved around files: RVTs, DWGs, IFCs, 3DMs, PDFs, COBie spreadsheets and thousands of others, all acting as containers for design intelligence. In the desktop era, this was entirely logical. Everything had to be saved, packaged, versioned and emailed. Files were the only workable abstraction.
But the reality of modern project delivery is vastly different from that of the 1990s. Today’s projects involve thousands of files, generated by hundreds of tools, frequently being used by people working across multiple firms. The result is a digital environment that is fragmented, brittle and slow, one in which coordination headaches, data duplication and time delays are not bugs in the system but features of the file-based architecture itself.
Meanwhile, cloud platforms, APIs and increasingly, artificial intelligence (AI) are becoming the defining technologies of modern workflows – and these technologies do not want files. They want structured, granular, persistent data that is streamable, queryable, validated and can be used by multiple systems simultaneously.
A monolithic RVT file cannot support real-time analysis, multimodal AI or firm-wide automation. It was never designed to perform that way.
In fact, firms are realising that their true competitive asset isn’t a file at all, but the data inside it, with its representations of objects, parameters, schemas and historical decisions. A file is simply a container that slows everything down, and because most file formats are proprietary, the data ends up trapped inside someone else’s business model.
Few voices have been more consistent, or more technically grounded on this point than Greg Schleusner, principal and director of design technology at HOK, who also represents the Industry Data Consortium (IDC). For years, Schleusner has argued that the AEC industry must stop treating BIM as a file-based activity and start treating it as data infrastructure.
At NXT BLD, Schleusner laid out the problem plainly. Revit knows everything about an object the moment it is drawn, he told attendees, but that intelligence is locked inside a file until someone performs an export. That could be hours, days or weeks too late. As he put it: “It’s never been an issue getting the metadata out from Revit. It’s always just been the geometry that’s been the slow part.”
Schleusner began his presentation by analysing how the media & entertainment industry has solved similar problems. Pixar’s USD format became the standard for exchanging complex geometric and scene information across tools. Its depth far exceeds IFC, but it has no concept of BIM data. To tackle this issue, Nvidia and Autodesk’s Alliance for OpenUSD aims to extend USD into AEC, but this work is still in progress and is still far from replacing BIM data requirements.
HOK’s Greg Schleusner speaking at NXT BLD 2022.
The more radical step in Schleusner’s research is his call to stop thinking about BIM models as files at all. Instead of exporting federated models, or waiting for ‘Friday BIM drops’, he proposes streaming every BIM object – every wall, window, beam, annotation – into an open database the moment it is authored. Each element would have its own identity, lifecycle and change history and would be immediately available for clash detection, rule validation, energy checks or analytics while the model is still being authored.
This vision is now driving a broader conversation about how AEC firms should manage their project data, and why data lakes and lake houses are becoming unavoidable.
Discover what’s new in technology for architecture, engineering and construction — read the latest edition of AEC Magazine
👉 Subscribe FREE here
Data lakes 101
To understand why the ‘data lake house’ has become such an important issue in AEC, it’s worth stepping back and looking at how other industries have navigated the data problem. The first major attempt to manage organisational data at scale arrived in the late 1980s with the data warehouse. Warehouses were designed for one job: to pull clean, structured information out of operational databases through a strict ETL (extract, transform and load) pipeline and then serve it up as predefined reports.
The data warehouse did this well, but only within narrow boundaries. Data warehouses were expensive, rigid and completely unprepared for the tidal wave of semi-structured and unstructured content that would later define the digital world, including images, logs, documents, telemetry, and later, multimodal information.
By the early 2000s, the so-called Big Data era hit. Organisations in every sector began generating vast amounts of high-velocity, highly varied data that had no obvious schema and traditional warehouses were overwhelmed. Their rigid structure was a poor fit for unpredictable information.
The tech industry response was the data lake, an architectural about-face: instead of structuring data before storage, firms dumped all data – structured, semistructured, unstructured – into a cheap cloud object store such as Amazon’s S3, and only transformed it later as needed. This ELT approach (extract, load, transform) offered enormous flexibility and scale, giving rise to a marketing narrative that data lakes were the future.
But early data lakes quickly developed severe problems. Without governance or schema control, they quickly became data swamps – vast, murky repositories, in which inconsistent, duplicated and unvalidated data accumulated in an uncontrolled manner. Querying could be painfully slow. Trust in data deteriorated. And critically, data lakes lacked transactional integrity: a system could be reading data as another was rewriting it, resulting in broken or inconsistent results.
In short, data lakes solved storage issues but broke reliability, governance and performance – three qualities that AEC firms need more than most.
The data lake house emerged as the solution to this tension. It combines the low cost, infinitely scalable storage of a data lake with the structure, reliability and transactional control of a warehouse. It does this by adding a relational-style metadata and indexing layer directly over open format files stored in cloud object storage.
HOK’s Greg schleusner spoke at NXT BLD 2025.
Watch the full presentation here
This hybrid design is the critical step that can turn a loose collection of files into something that behaves like a ‘proper’ database. With this metadata layer in place, a lake house can guarantee ACID (atomicity, consistency, isolation, durability) transactions, meaning that multiple systems can read and write simultaneously. It enforces schema, so project data follows well-defined structures. It maintains full audit trails, so lineage and accountability are preserved. And it allows BI tools, analysis engines and AI models to run directly on the live dataset, rather than duplicating extracts and creating conflicting versions.
For AEC, this is not just convenient. It is foundational. AEC project data is inherently multimodal and includes solid models, meshes, drawings, schedules, reports, energy data, specifications, RFIs, documents, photos and point clouds, as well as the metadata that ties them together. A single Revit file can contain thousands of elements with their own parameters, relationships and behaviours. Trying to run AI, automation or cross-disciplinary analysis on this information using file-based workflows is like trying to do real-time navigation with a paper map that’s updated once a week.
In short, the lake house shifts the paradigm. A project dataset sits in one place, in open formats, behaving like a continually updated, queryable database. The file no longer defines the project; the data does.
At the base of most lake houses sits Apache Parquet, a columnar storage format that has become the industry standard. Parquet stores data by column rather than by row. That may sound minor, but it is transformative for analytical workloads. Most queries only need a few columns, so engines can read exactly what they need and skip everything else, reducing I/O dramatically. This is crucial in AEC, where models can contain thousands of parameters, but only a handful are needed for any given check.
Parquet’s openness is equally important. It avoids the proprietary format trap that has hamstrung AEC for decades. Once your project data is in Parquet, any tool in the open ecosystem, from Python and Rust libraries to cloud engines like Databricks, Snowflake or open-source query engines, can read it natively. You no longer need to negotiate with vendors or wait for APIs to mature. The data is yours.
If Parquet provides the necessary storage, then Apache Iceberg provides the intelligence. Iceberg is an open table format originally developed at Netflix to bring reliability, versioning and high performance to massive data lakes. It adds a metadata layer that tracks the state of a table using snapshots. Each snapshot refers to a manifest list, which itself points to a set of manifest files acting as indexes for the underlying Parquet data.
Moving from siloed, proprietary files to an open, unified, AI-ready lake house is the clearest path to future-proofing an AEC firm
A manifest is effectively a catalogue: it lists which Parquet files belong to a table, how they are partitioned, what columns they contain and how the dataset has changed over time. Rather than scanning thousands of files to answer a query, Iceberg reads the manifests and instantly understands the structure.
This is an elegant solution with significant consequences. It delivers consistent views, because every query targets a specific snapshot, ensuring results remain complete and coherent even as new data is being written. It provides true transactional safety, where any change is either fully committed as a new snapshot or not committed at all. And it supports genuine schema evolution, allowing columns to be added, removed or renamed, without having to rewrite terabytes of Parquet files, an essential capability for long-lived AEC datasets. Together, Parquet and Iceberg deliver the reliable, unified project database that the AEC industry has never had before.
Big AEC benefits
Once AEC project data sits inside an open lake house – structured, governed and queryable – the benefits begin to accumulate quickly. The most immediate shift is that data finally becomes decoupled from the authoring tools that produced it. Instead of each BIM package jealously guarding its own silo of geometry and metadata, the project information lives in a neutral space where any tool can read it.
This single change unlocks capabilities that, until now, have been aspirational rather than practical. Firms can build their own QA systems that directly interrogate geometric and metadata standards across every project model, regardless of whether the source was Revit, Tekla Structures, Archicad or anything else. There is no export step, no format translation, no broken parameters. The data is just there, in a clean schema, ready to be queried.
With proper schema enforcement, elements, parameters, cost codes and property sets follow firm-wide standards. That finally makes automated reporting reliable, rather than a brittle sequence of half-working scripts – something the industry has talked about for a decade.
Once a dataset becomes trustworthy, AI and machine learning models become dramatically more effective. Instead of scraping data from a handful of projects, a firm can train predictive systems on its entire project history. Models can forecast costs from early design parameters, identify risky design patterns from past change orders, predict schedule risks or optimise building layouts for energy performance. These capabilities are not theoretical. They are exactly the sort of AI workloads that other industries have been running for years, but which have been hamstrung in AEC due to poor data foundations.
The same architecture also enables genuinely federated collaboration. Instead of exchanging bloated files, firms can give project partners secure, query-level access to precisely the objects or datasets they require, all drawn from a live single source of truth. A clash engine could read from the same table as a cost tool, which could read from the same table as an AI model or an internal search engine. That interoperability is the essence of BIM 2.0: a move away from document exchange and toward continuous data exchange.
In short, the lake house doesn’t just solve a technical problem. It opens a strategic opportunity: for firms to build their own intellectual property, automation tools and data-driven insights on top of a foundation that is finally theirs.
Automation penalties
As the industry accelerates toward automation-heavy workflows, the commercial incentives for large software vendors are beginning to shift in uncomfortable ways. Automation reduces the number of manual, named-user licences — the traditional revenue backbone of the design software business. And history suggests vendors rarely accept declining per-seat income without looking for compensatory levers elsewhere.
In today’s tokenised subscription world, that compensation mechanism may lead to higher token prices, steeper token consumption rates for automated processes, and an overall rebalancing designed to recover revenue lost to more efficient, machine-driven workflows. In effect, the more automation delivers value to practices, the more vendors will seek to recapture that value through metered usage.
This is precisely why the conversation around data lakes and lakehouse architectures matters so much. Owning your data is no longer a philosophical stance, it’s a strategic defence. If automation becomes a toll road, then firms need control of the highway. By centralising and owning their data, and by running automation on their own plain field rather than someone else’s, practices can decouple innovation from vendor metering and protect themselves from being priced out of the very efficiencies’ automation is meant to provide.
Data extraction
The next challenge is practical: how does the industry transition from thousands of RVTs, DWGs, IFCs and other formats to an environment in which project data lives as granular, structured, queryable information?
For past projects, there is no shortcut. Firms must extract their model archives – geometry, metadata, relationships – and convert them into open formats that the lake house can govern. This is labour-intensive but unavoidable if firms want their historical data to fuel analytics and AI.
But the real transformation begins with live projects. At HOK, Schleusner has no interest in continuing to export files forever. He is designing a future in which BIM data streams from authoring tools as it is created and directly into the lake. Instead of waiting days or weeks for federated models or periodic exports, the goal is a steady flow of BIM objects, with each wall, room, door, system and annotation arriving in the lake house in real time.
This turns the lake house from an archive into a live, evolving representation of the project. Real-time clash detection stops being a dream and becomes standard practice. Rules-based validation can run continuously instead of catching issues once a week. Analytics and AI can operate on the dataset as it changes, not after the fact.
The lakehouse replaces brittle integrations and repetitive exports with a single source of truth that every tool, and every emerging multimodal AI system, can build upon – and your practice will own its own data.
But there are practical barriers to all this. The first stumbling block for the live streaming of BIM objects – inevitably – is Revit. When Schleusner approached Autodesk to ask whether Revit could stream objects as they are created, as opposed to files being saved, the answer was an unambiguous ‘No’. Revit’s underlying architecture was simply not built for this. The geometry engine and much of the historic core remain predominantly single threaded, making real-time serialisation and broadcast of object-level changes impractical without significant performance penalties. In other words, the model cannot currently emit deltas as they occur.
Yet several vendors have found ways to work around these constraints. Rhino. Inside can interrogate Revit geometry dynamically. Motif claims to capture element-level deltas as they change. Speckle has developed its own incremental update mechanism capable of extracting small, structured updates, rather than monolithic payloads. Christopher Diggins, founder of Ara 3D and a contributor to the BIM Open Schema effort, has also demonstrated experimental object streaming from Revit and recently released a free Parquet exporter for Revit 2025.
A step in the right direction for Autodesk is its granular data access in Autodesk Construction Cloud (ACC), which generates separate feeds for geometry and metadata – but only after the file is uploaded and processed. This is useful for downstream analysis, but it is not true real-time streaming.
Meanwhile, Graphisoft is in the early stages of developing its own data lake infrastructure to support all of the many Nemetschek brands and their schemas. It’s certainly a trend that now is pervasive in the core AEC software suppliers.
As Schleusner puts it: “We don’t want to do what we currently have to do, which is every design or analysis tool running its own export. That’s just dumb.”
What the industry needs, he argues, is single export with multi-use. Data should be extracted once into an open, authoritative environment from which every tool can read and act. By putting BIM data into a shared platform, every tool, internal or external, can consume that data dynamically, without half the industry re-serialising or rewriting half the model every time they need to run a calculation, test an option, or update a view.
His experiments have ranged from SAT and STL to IFC and mesh formats, but none have provided the fidelity and openness he needs. His preference today is for open BREP – rich, precise, and free from proprietary constraints.
This is where the next piece of the puzzle appears: the Industry Data Consortium (IDC). This group is emerging as the most significant collective data initiative the AEC sector has seen. It is a public-benefit corporation, comprising many of the largest AEC firms, primarily drawn from the American Institute of Architects (AIA) Large Firm Roundtable. These firms are pooling resources to solve shared data problems that no individual firm could tackle alone.
Schleusner joined the IDC’s executive committee nearly three years ago and brought with him a clear, technically grounded vision: to create a vendor-neutral foundation for project data that enables the streaming, storage and governance of BIM objects in an open, queryable architecture.
He is candid about why the industry hasn’t done this before: “The reason this has not been done or thought of being done today is because there’s no open schema that can actually hold drawing information, solid model representation and mesh representation.”
In his research, the closest fit he has found so far is Bentley Systems’ iModel, a schema and database wrapper that can store BIM geometry, metadata, drawings and meshes while supporting incremental updates. Crucially, iModel is now open source. That gives the IDC something the industry has never had: an adaptable, extensible schema that can act as the wrapper for all the lovely data.
There are caveats, of course. Solid models in iModel use Siemens’ Parasolid kernel, which is still proprietary. Some translation challenges remain. But as a starting point for an industry-wide intermediary layer, it is far further along than anything Autodesk, Trimble or Graphisoft have offered, although Bentley will still need to do some re-engineering.
The IDC is no longer in its prototype phase. It is actively building real tools for its member firms: extraction utilities, schema definitions, lake house integrations, and proof-of-concept pipelines that show Revit, Archicad, Tekla Structures and other tools publishing into a shared, vendor-neutral space.
The goal is not another file format. It is a live representation of BIM objects that can feed clash engines, QA systems, cost tools, search engines and AI pipelines without rewriting half the model each time.
The IDC also plans to support AI directly. “We’re going to start hosting an open-source chat interface that can connect to IDC, provide data and individual firms’ data, keeping them firewalled,” says Schleusner.
Another technology layer, Lance DB, is also being evaluated. This is emerging as one of the most compelling formats for a modern AEC lake house, because it solves a fundamental problem into which the industry is running headlong: multimodal data at scale.
BIM models are only one part of a project’s digital footprint. The real world adds drawings, specifications, RFIs, emails, photos, drone footage, point clouds, and increasingly, AI-generated embeddings. Traditional columnar formats like Parquet handle tabular data well, but struggle when you need to store and version media, vectors and other nontabular assets in the same unified system.
Lance was designed for this exact world. It brings the performance of a high-speed analytics engine, supports zero-copy data evolution, and treats images, video and embeddings as first class citizens. Netflix built it because its data is inherently multimodal – but so is that of AEC firms. A lake house built on Lance can finally treat all project information, geometry, documents and media, as one coherent, queryable dataset.
This is the first genuine attempt to build a shared AEC data infrastructure, one that is not driven by a vendor, but instead by firms who actually produce the work.
Another uncomfortable truth is that even if the industry succeeds in building a lake house for BIM data, model geometry and parameters alone are not enough to power meaningful holistic AI of project data. Having a model only will miss the context that lives in issues, approvals, RFIs, change orders, design intent and all those emails.
Worse, the data that does exist is scattered across owners, architects, engineers, specialists, CDEs and contractors. No single party holds the entire picture.
As Virginia Senf at Speckle explains: “It may be the large general contractors and top-tier multidisciplinary consultants who are best positioned to assemble project datasets, because owners are now demanding outcomes, not drawings.
“AECOM’s recent shift toward consultancy reflects this. But even if you gather everything, historical models are often inconsistent or simply wrong, a lot of legacy BIM data is unsuitable for analytics or AI at all.”
The way forward
The shift to a data lake house isn’t an IT upgrade. It is a re-platforming of the AEC business model. Firms have spent decades selling hours, yet the real value they have generated – the patterns, insights, decisions and accumulated knowledge encoded in their project data – remains locked inside proprietary files and vendor ecosystems.
A lakehouse finally gives firms a way to monetise what they actually know. Data stops being a dormant archive and becomes a living asset that can predict outcomes, guide design intelligence, improve bids and reduce risk.
What makes this moment significant is that the architecture is now proven. Open formats such as Parquet and Iceberg have stabilised. Cloud object storage is cheap and mature. Tools capable of extracting BIM data into open schemas exist. And, crucially, the first coordinated industry effort, the IDC, is bringing firms together to build a shared, vendor-neutral foundation for the next decade of digital practice.
Moving from siloed, proprietary files to an open, unified, AI-ready lake house is the clearest path to future-proofing an AEC firm. Especially when software companies are increasingly looking to toll ‘automated’ seat licences, which are used to batch process workflows using APIs to access the data.
The lakehouse replaces brittle integrations and repetitive exports with a single source of truth that every tool, and every emerging multimodal AI system, can build upon – and your practice will own its own data.
If BIM 1.0 was about authoring tools, BIM 2.0 is about the data itself — structured, queryable and controlled by the firms who produce it.
The IDC architecture is currently in development and will be adopted by its members when it’s ready. Wider distribution is currently being considered. For now, a number of very large firms have their own experiments internally and are deploying resources to build their own lakehouse stacks, mainly to own their own IP, run their own applications and experiment with creating bespoke AI tools and agents.
AEC Magazine will continue to explore this topic from a number of different angles in 2026 and it will certainly be a hot topic at NXT BLD in London – 13-14 May.
To join in this development work, membership is available through the IDC.
Executive summary for normal humans
To many in the AEC industry, data lakes and lake houses sound very much like approaches about which only a CIO or software engineer would care. But the truth is far simpler: this is about finally getting control of your own project information, and stopping the madness of exporting, duplicating and re-formatting the same models over and over again.
Think of a BIM file as a shipping container. Everything you need is technically inside it, but you can only open it from one end and moving it around is slow and clumsy. If you need one box from the back, you still have to haul the entire container to take out its contents.
A lake house is the opposite. It behaves like a warehouse in which every object inside a project – every wall, room, door, parameter, schedule item, photo, scan or RFI – sits neatly on a shelf. It is indexed, searchable and instantly accessible. Nothing has to be unwrapped, exported or repackaged. Every tool, whether internal or external, sees the same live information at the same time. This immediately solves three of the most familiar pain points in BIM delivery.
The first is speed: in a file-based world, clashes are found tomorrow, or Friday, or at the coordination meeting. In a lake house, clashes appear while someone is still modelling the duct. Checking rules, validating properties, or running energy assessments can happen continuously, not in slow cycles defined by exports.
Second comes ownership: right now, firms only ‘own’ their data in theory. In practice, it sits inside proprietary formats and cloud silos, to which access is metered, limited or simply not available. A lake house flips that. The data sits in open formats you can control yourself. Vendors don’t get to decide what you can do with your own project information.
The third issue is AI deployment: every firm wants to apply AI to its project history, but almost no firm can, because the data is so scattered. A lake house finally puts all data in one well-governed venue, so that AI tools can use it. The future will be AI agents working on your project information doing cost prediction, design optimisation, risk profiling, automated documentation – all the things that many professionals in the AEC industry would love to see.
This shift isn’t about technology for its own sake. It’s about reducing rework, stopping duplication, ending lock-in, improving quality and giving firms back control of the knowledge they already produce. For an industry built on coordination, clarity and shared understanding, it sounds like a transformation we need.
Explainer: Netflix’s lake house solution
Some years ago, Netflix faced almost exactly the same problem the AEC industry struggles with today: mountains of fragmented, multimodal data, scattered across incompatible systems. Its crisis wasn’t content creation. It was scale, complexity and chaos – the same forces reshaping digital AEC.
Netflix’s media estate spanned petabytes of wildly different content: video, audio, images, subtitles, descriptive text, logs, user-generated tags and, increasingly, AI-generated embeddings. Each data type lived in its own silo. Data scientists spent more time hunting and cleaning data than training models. Collaboration slowed. Infrastructure costs soared. No one had a unified view of the organisation’s own assets.
To solve this, Netflix built a ‘media data lake’, using an open multimodal lake house architecture centred on the Lance format. Lance allowed Netflix to store every data type in one system, with relational-style metadata, versioning and schema evolution layered directly over cloud object storage. Critically, Lance enabled zerocopy data evolution — teams could add new AI features, such as embeddings or captions, without rewriting the underlying petabytes of source video.
The parallel with AEC is obvious. AEC’s data is just as varied and just as with pre-lake house Netflix, this information is locked inside singlepurpose tools. Every application exports its own version of the truth. Every team maintains its own copy. Every AI initiative begins with cleaning up someone else’s chaos.
The Industry Data Consortium’s pursuit of “single export, multi-use” is effectively the AEC version of Netflix’s journey. By extracting project data into open Parquet tables and managing them via Iceberg, firms gain a single source of truth that supports transactional integrity, schema governance and reliable engineering workflows.
Suddenly, AI pipelines, energy report parsers, internal search tools and firm-wide assistants become possible — because the data is unified, structured and finally under the firm’s control.
Explainer: Parquet and Iceberg
Parquet and Iceberg form the backbone of the AEC data lake house. Why?
Apache Parquet has emerged as the AEC industry’s open file format of choice, storing data by column rather than by row – a simple shift that transforms performance. Most analysis only needs data from a handful of columns, so engines can read exactly what they need and skip everything else. When working with AEC data – thousands of BIM parameters, millions of elements – this reduction in I/O is essential.
Apache Iceberg, originally developed at Netflix, brings database-like intelligence to raw Parquet files. It adds a metadata layer that tracks tables using snapshots, each one referring to a set of manifest files. A manifest is essentially an index: a lightweight catalogue listing exactly which Parquet files belong to the table, what they contain, and how the dataset is partitioned.
Instead of scanning thousands of files to understand the table, Iceberg simply reads the manifests. This gives the lake house the qualities AEC desperately needs consistent views, transactional safety, version control and the ability to evolve schemas without rewriting terabytes of data.