March 13, 2025
.png)
September 9, 2025
All code for this post can be found here: Masking and Subscribing Customer Data with Tabsdata
When building reliable data integration workflows, it’s just as much about not sharing certain data as it is about sharing it. Sometimes your team has sensitive data that needs to have strict governance around how it is used and distributed.
In this simulated scenario, I am a system administrator for a B2C company that oversees customer data in a MySQL database. We store a wide assortment of data on our customers ranging from name, email, address, IP address, Social Security Number, as well as some non-PII fields like the balance of their account and loyalty points. I received a request from a different team that wants to do some analysis on loyalty and account balances. I am more than happy to fulfill the request, but I don’t want to expose PII data on these customers, especially if the team doesn’t need that data.
Using Tabsdata, I’ll build a workflow that ingests customer data from MySQL, applies PII masking, and then publishes the masked data to AWS Glue Iceberg tables for my team
In traditional data integration setups, executions run in isolated, stateless environments. Within each execution, you define steps/nodes that take inputs, do some type of work on those inputs, and produce outputs. These nodes are then manually chained together so that when one node finishes it's execution, it both passes its output to the next node and triggers that node to run as the next step in the workflow. It's kind of like a game of hot potato or telephone.
When this workflow runs, any data generated only persists until the execution ends, unless it is explicitly cached within the execution. This makes the workflow itself a black box where observability is limited and debugging is difficult.
Tabsdata’s architecture is designed to solve this exact problem. When you spin up a Tabsdata server, it becomes a repository that permanently stores tabular data in objects called Tables. Inside this server, you still build nodes called Functions that take inputs and produce outputs, but these inputs and outputs are executed against Tables, not other functions. When a function runs, it writes its output into a Table, and any function that depends on that Table will detect the new data and run automatically. This makes Tables the central hub where all data flows through.
Through this design, Tables serve a dual purpose. First, they retain full state and version history, so you can access the schema and data of a table at any point in time. Second, they act as breakpoints between functions. Because functions only interact with Tables, orchestration happens naturally: a function that depends on a Table listens for new commits and automatically triggers when new data is available. To create a new function, you don’t need to manually wire it into your workflow. Just specify the Table it should read from, and it will automatically run whenever new data is committed to that Table.
The pipeline consists of three parts:
I start by registering a publisher that queries the raw_customer_data table from the tabsdata_db schema and loads it into the raw_customer_data tabsdata table
Next, I create a transformer that masks personally identifiable information (PII). In this example, I identify the list of columns I want to mask and replace every character in those columns with an asterisk.
Finally, I configure a subscriber to export the masked data to S3, integrated with an AWS Glue-backed Iceberg catalog.
My masked customer data table is permanently available within Tabsdata for Subscription, but just to have the option, I write back my masked data into a new MySQL table called masked_customer_data .
With just a few lines of code, I was able to build a workflow that masks and sends customer data over to analytics teams. Now I can be confident that my teams are maximizing our data without compromising on security.
One thing that really stood out to me was how easy it was to subscribe my masked data into additional destinations. Initially, when prototyping, I only wanted to push my data into AWS. However, for fun I added a MySQL subscriber to introduce a little bit more complexity. Because all functions run decoupled, I just registered my MySQL subscriber to use the masked_customer_data Tabsdata table as input and it automatically ran whenever I ran my publisher.