March 13, 2025

August 20, 2025
Special thanks to Josh Patterson and Patterson Consulting for their collaboration and insights as secondary authors on this piece. You can explore more of their work at pattersonconsultingtn.com
All code for this post can be found here: Ingesting Insurance Claim Data from Gmail into Databricks
Insurance teams depend on timely, accurate claim data to make fast decisions. However, when Third-Party Administrators send claim data as spreadsheets by email, often with mismatched schemas, it creates bottlenecks that slow things down. Engineers end up cleaning and restructuring files instead of building reliable workflows, and business leaders wait longer for insights they need today.
In this post, we’ll show how Tabsdata and Databricks can empower a single engineer to automate an entire insurance claim workflow end-to-end: from collecting claim files straight out of Gmail, to standardizing and enriching them, to publishing clean, versioned data into Databricks for analysis. The result is a workflow that saves time, reduces errors, and gives your claims team the data they need when they need it.
Tabsdata follows a Pub/Sub approach that makes data integration more scalable and flexible than traditional data integration pipelines. Instead of extracting and loading data directly into your data platform, you publish datasets into tables within the Tabsdata Server. From there, Databricks (or any other platform) can subscribe to those tables.
Between publishing and subscribing, you can add transformation steps to clean, enrich, or reshape the data as needed. Each step in this workflow is implemented as a Tabsdata Function, which is a simple Python function that either reads from or loads data into Tabsdata Tables.
Functions can be triggered manually or automatically when new data needs to be processed. Every run of a function creates a new version of its table, and all versions of all tables are stored in the server. That means you can sample past versions of your tables at any point in time, making time travel, auditing, and debugging straightforward and easy.
Tabsdata provides seamless ingestion and transformation capabilities while giving you significant advantages in lineage, governance, and orchestration.
Every Tabsdata Table is automatically versioned and stores the full history of both its data and schema. When a new version of a table is created, Tabsdata infers the schema directly from the data being written, removing the need to define or enforce a schema in advance. This approach preserves the metadata and semantics of your source data, gives you clear visibility into how that data evolves over time, and enables you to address schema drift proactively instead of managing it defensively. Together these features create a complete, meaningful timeline that shows how your data changes over time, when those changes occur, and what those changes mean.
These versioned tables form the backbone for how work is orchestrated in Tabsdata. Since Tabsdata Functions only interact with Tables, either by reading from them or writing to them, this allows Tabsdata to infer execution order and automatically orchestrate your workflow. When a table gets new data, any functions that depend on that Table run automatically. This declarative orchestration removes the need to manually create and maintain fragile DAGs and gives you greater confidence in the reliability of your data.

To automate TPA claims processing, we’ll need to:
The policy data we are ingesting will have the following schema:
Our claim data files will have a non-unified schema. However, we will aim to standardize all claim data to the following schema
We first install Tabsdata to our local machine:
then start our server with:
Login:
Lastly, We create a collection called claim_processing which serves as the logical container where we will store our functions and tables.
Inside this collection, we’ll build six functions: two publishers, three transformers, and one subscriber.
To enrich claims data, we need the latest policy records. This publisher uses the Tabsdata MySQL connector to pull policy data from MySQL and loads it into the policy_dim table.
This function connects to Gmail and extracts all unread CSV attachments. Using the extracted data as input, the publisher standardizes each file’s schema, concatenates all files into a single dataset, deduplicates the results, and loads the final unified table into the Tabsdata Table claims_fact_today.
For this tutorial, we have used our SourcePlugin to define a connector for Gmail called GmailPlugin. You have the flexibility to define your own connectors, or use our built in connectors.
The claims_fact_today table only holds the latest batch of claim data. This transformer takes the claims_fact_today table and appends it to a running master table called claims_fact_ master .
This transformer takes claims_fact_master and policy_dim tables as input, joins policy_dim to claims_fact_master, coalesces overlapping fields, and outputs the data into a new table called claims_fact_master_enriched.
Business units often need tailored claim subsets. This takes the claims_fact_master_enriched table as input and creates three tables as output:
Finally, we push our processed tables into Databricks for analysts and business users.
With a few lines of code, we were able to create an ingestion and transformation workflow for Insurance teams to use.
Registering these functions was the only thing we had to do in order to make this workflow operational.
Due to Tabsdata's declarative architecture, many of the complex aspects of building data pipelines are handled automatically under the hood. Tables are automatically created on function registration, Table schema is inferred from the data committed into the table at function invocation, schema is scoped to table versions, and every table version created is automatically cached and available to sample at any time. This means that even if our source data experiences schema drift, not only will Tabsdata adapt to that schema, it will also track that schema drift in the table's lineage history, giving us the deeper context into when and how our data is changing.
Before any of this data is even in Databricks, Insurance teams can use the Tabsdata UI to view any of the Tabsdata Tables generated by the execution of our workflow. Within the UI, they can also run SQL queries on the data, view the schema, and access historical versions of the data from previous invocations of our workflow.
