Using the Data Lakehouse for Vault CRM

Customers can access a Data Lakehouse containing a complete and up-to-date copy of their Vault CRM data. The Data Lakehouse retrieves all Vault CRM data using Vault Platform’s Direct Data API. Retrieved data publishes as Apache Iceberg™ tables on S3, enabling customers to easily query Vault CRM data in place (Zero-Copy), or copy out to an external data warehouse.

Every CRM Vault has its own Data Lakehouse. Data is read-only and any metadata or data changes in Vault CRM are updated in the Data Lakehouse within 30 minutes, ensuring queries are accurate and up-to-date.

Interacting with the Data Lakehouse

Customers can interact with the Data Lakehouse in the following ways to access or move data.

Zero-Copy

The Zero-Copy model eliminates the need for customers to ingest or duplicate data, removing ETL pipelines, while ensuring customers always have the most current Vault CRM dataset. The Data Lakehouse uses a Veeva-managed partitioning scheme to ensure consistency across all Vaults and compatibility with Iceberg-aware engines. Tables are partitioned by record ID, which supports efficient object-level updates and predictable refresh cycles. This partitioning strategy is optimized for universal consumption rather than workload-specific performance. Customers requiring alternative partitioning can use the Copy-Out model for their storage, analytics, or performance needs.

Copy-Out

The Copy-Out model enables customers to bring Vault CRM data into their own warehouses to support other use cases. Copy-Out allows customers to apply domain-specific partitioning for performance optimization, or to tightly integrate Vault CRM data with existing enterprise data. This is recommended for long-term data archiving scenarios, where customers may need extended retention or cost-efficient storage tiers. This approach provides customers full control while still using the Data Lakehouse as the authoritative Vault CRM data source.

Viewing Deleted Records in the Data Lakehouse

For downstream data reconciliation, deleted records in Vault CRM are exposed in a single, consolidated table named z_deleted_object_records. Each row in the table represents a Delete event, representing the deletion history of the Vault up to four months in the past. The table contains the following columns:

  • id – The ID of the deleted record
  • deleted_date – The timestamp when the delete occurred in Vault CRM
  • object – The API name of the corresponding object

Time Travel

The Data Lakehouse uses Apache Iceberg tables, which support snapshot-based Time Travel. This allows customers to query Vault CRM data as it existed at a previous point in time by referencing a past snapshot or timestamp. Time Travel is useful for reproducing historical analyses, validating changes, and comparing object states over time without maintaining separate historical tables. Query syntax varies by engine, but typically involves specifying a snapshot ID or timestamp.

Enabling and Connecting to the Data Lakehouse

Enabling and connecting to the Data Lakehouse for a Vault CRM instance is not a simple configuration change like other Vault CRM features. It requires cross-functional coordination across your Business, Security, IT, and Data Warehouse teams, as well as with Veeva Support. Because of this, the Data Lakehouse can only be enabled upon request and on a per-Vault basis. It is recommended to plan in advance and align the necessary teams before submitting a request.

Customers can query Vault CRM data directly from Veeva-managed S3 storage using any query engine integrated with Apache Iceberg™ tables that support AWS Glue Catalog and IAM Role-based authentication.

Currently supported examples include:

*See Connecting with an IAM User for more information

The steps below describe the general enablement steps for customers connecting to the CRM Data Lakehouse using the following:

  • Snowflake on AWS or Azure
  • Databricks on AWS
  • Databricks on Azure - Requires additional steps details. See Databricks on Azure for more information.
  1. Open a Veeva Support ticket requesting the Data Lakehouse be enabled for a given Vault CRM instance. Provide the following information:

    • Vault Name(s)
    • Vault ID(s)
    • Iceberg-aware query engine(s)
    • Cloud Host(s) for your query engine(s)

    Multiple Vaults can be mapped to a single Databricks or Snowflake instance or can be assigned to individual instances. For example, three US Vaults could share the same Databricks instance, while another EU Vault might use a different one. Or, all four Vaults could use a single Databricks instance.

  2. Veeva follows up and shares additional information.
  3. Set up your Data Lakehouse query engine integration using Veeva-provided details. For example:

    • Snowflake – Create an External Volume, AWS Glue Catalog Integration and Iceberg Tables
    • Databricks (AWS) – Create Storage Credential, External Location, AWS Glue Connection, and Foreign Catalogs
    • Microsoft Fabric – Configure a Spark Note
  4. Send Veeva the generated Trust Policy.
  5. Veeva completes the setup using your Trust Policy.
  6. Validate that setup is complete by successfully running SQL queries against your Vault CRM Data Lakehouse.

    For example, SELECT * FROM ACCOUNT__V LIMIT 10;

Connecting with an IAM User

Some query engines do not support cross-account IAM Roles or native AWS Glue Catalog integration. In these cases, access to the Vault CRM Data Lakehouse is provided using Veeva-issued IAM User credentials and requires additional Spark-based configuration. The steps below describe the general enablement steps for customers connecting to the CRM Data Lakehouse using an IAM Role:

  1. Open a Veeva Support ticket requesting the Data Lakehouse be enabled for a given Vault CRM instance. Provide the following information:

    • Vault Name(s)
    • Vault ID(s)
    • Iceberg-aware query engine(s)
    • Cloud Host(s) for your query engine(s)

    Multiple Vaults can be mapped to a single compute instance or can be assigned to individual instances. For example, three US Vaults could share the same compute instance, while another EU Vault might use a different one. Or, all four Vaults could use a single compute instance.

  2. Veeva will send IAM User information and additional steps to complete the integration.
  3. Set up your Data Lakehouse integration using Veeva-provided configuration guidance
  4. Validate that setup is complete by successfully running SQL queries against your Vault CRM Data Lakehouse.

    For example, SELECT * FROM ACCOUNT__V LIMIT 10;