Using the Data Lakehouse for Vault CRM

Customers can access a Data Lakehouse containing a complete and up-to-date copy of their Vault CRM data. The Data Lakehouse retrieves all Vault CRM data using Vault Platform’s Direct Data API. Retrieved data publishes as Apache Iceberg™ tables on S3, enabling customers to easily query Vault CRM data in place, or copy out to an external data warehouse.

Every CRM Vault has its own Data Lakehouse. Data is read-only and any metadata or data changes in Vault CRM are updated in the Data Lakehouse within 30 minutes, ensuring queries are accurate and up-to-date.

Connecting and Running Queries

Admin users can connect their existing data warehouse platforms to the Data Lakehouse to run queries against their Vault CRM data, or use any query engine that supports Apache Iceberg™, for example, Databricks or Snowflake.

Viewing Deleted Records in the Data Lakehouse

For downstream data reconciliation, Deleted records in Vault CRM are exposed in a single, consolidated table named z_deleted_object_records. Each row in the table represents a Delete event, representing the deletion history of the Vault up to four months in the past. The table contains the following columns:

  • id – The ID of the deleted record
  • deleted_date – The timestamp when the delete occurred in Vault CRM
  • object – The API name of the corresponding object

Enabling the Data Lakehouse

Enabling the Data Lakehouse for a Vault CRM instance is not a simple configuration change like other Vault CRM features. It requires cross-functional coordination across your Business, Security, IT, and Data Warehouse teams, as well as with Veeva Support. Because of this, the Data Lakehouse can only be enabled upon request and on a per-Vault basis. It is recommended to plan in advance and align the necessary teams before submitting a request.

The steps below describe the general enablement steps for customers connecting to the CRM Data Lakehouse using the following:

  • Snowflake on AWS or Azure
  • Databricks on AWS
  • Databricks on Azure - Requires additional steps details. See Databricks on Azure for more information.
  1. Open a Veeva Support ticket requesting the Data Lakehouse be enabled for a given Vault CRM instance. Provide the following information:

    • Vault Name(s)
    • Vault ID(s)
    • Iceberg-aware query engine(s)
    • Cloud Host(s) for your query engine(s)

    Multiple Vaults can be mapped to a single Databricks or Snowflake instance or can be assigned to individual instances. For example, three US Vaults could share the same Databricks instance, while another EU Vault might use a different one. Or, all four Vaults could use a single Databricks instance.

  2. Veeva follows up and shares additional information.
  3. Set up your Databricks or Snowflake integration:

    • Snowflake – Create an External Volume, Catalog Integration (AWS Glue) and Iceberg Tables using Veeva-provided SQL files
    • Databricks on AWS – Create Storage Credential, External Location, AWS Glue Connection, and Foreign Catalog using Veeva-provided detail
  4. Send Veeva the generated Trust Policy.
  5. Veeva completes the setup using your Trust Policy.
  6. Validate that setup is complete by successfully running SQL queries against your Vault CRM Data Lakehouse.

    For example, SELECT * FROM ACCOUNT__V LIMIT 10;

Databricks on Azure

Databricks on Azure requires customers to connect to the Data Lakehouse using a Spark session. Customers will connect a Notebook to a configured Spark Cluster to run Spark SQL queries.

  1. Open a Veeva Support ticket requesting the Data Lakehouse be enabled for a given Vault CRM instance. Provide the following information:

    • Vault Name(s)
    • Vault ID(s)
    • Iceberg-aware query engine(s)
    • Cloud Host(s) for your query engine(s)

    Multiple Vaults can be mapped to a single Databricks instance or can be assigned to individual instances. For example, three US Vaults could share the same Databricks instance, while another EU Vault might use a different one. Or, all four Vaults could use a single Databricks instance.

  2. Veeva will send IAM User information and additional steps to complete the integration.
  3. Set up your Databricks integration by creating a Compute Spark Cluster and Notebook connection using Veeva-provided python script.

  4. Validate that setup is complete by successfully running SQL queries against your Vault CRM Data Lakehouse.

    For example, spark.sql(“SELECT * FROM glue_catalog.tenant_id.account__v LIMIT 10”)