September 9, 2025

Hands-on with Tabsdata: Masking and Subscribing Customer Data with Tabsdata

All code for this post can be found here: Masking and Subscribing Customer Data with Tabsdata

When building reliable data integration workflows, it’s just as much about not sharing certain data as it is about sharing it. Sometimes your team has sensitive data that needs to have strict governance around how it is used and distributed.

In this simulated scenario, I am a system administrator for a B2C company that oversees customer data in a MySQL database. We store a wide assortment of data on our customers ranging from name, email, address, IP address, Social Security Number, as well as some non-PII fields like the balance of their account and loyalty points. I received a request from a different team that wants to do some analysis on loyalty and account balances. I am more than happy to fulfill the request, but I don’t want to expose PII data on these customers, especially if the team doesn’t need that data.

Using Tabsdata, I’ll build a workflow that ingests customer data from MySQL, applies PII masking, and then publishes the masked data to AWS Glue Iceberg tables for my team

Why Tabsdata?

In traditional data integration setups, executions run in isolated, stateless environments. Within each execution, you define steps/nodes that take inputs, do some type of work on those inputs, and produce outputs. These nodes are then manually chained together so that when one node finishes it's execution, it both passes its output to the next node and triggers that node to run as the next step in the workflow. It's kind of like a game of hot potato or telephone.

When this workflow runs, any data generated only persists until the execution ends, unless it is explicitly cached within the execution. This makes the workflow itself a black box where observability is limited and debugging is difficult.

Tabsdata’s architecture is designed to solve this exact problem. When you spin up a Tabsdata server, it becomes a repository that permanently stores tabular data in objects called Tables. Inside this server, you still build nodes called Functions that take inputs and produce outputs, but these inputs and outputs are executed against Tables, not other functions. When a function runs, it writes its output into a Table, and any function that depends on that Table will detect the new data and run automatically. This makes Tables the central hub where all data flows through.

Through this design, Tables serve a dual purpose. First, they retain full state and version history, so you can access the schema and data of a table at any point in time. Second, they act as breakpoints between functions. Because functions only interact with Tables, orchestration happens naturally: a function that depends on a Table listens for new commits and automatically triggers when new data is available. To create a new function, you don’t need to manually wire it into your workflow. Just specify the Table it should read from, and it will automatically run whenever new data is committed to that Table.

Overview

The pipeline consists of three parts:

  1. Publisher: Reads raw customer data from a MySQL table into Tabsdata.
  2. Transformer: Applies masking to sensitive fields such as names, emails, and SSNs.
  3. Subscribers: Writes the masked data back to MySQL and AWS

Publishing from MySQL

I start by registering a publisher that queries the raw_customer_data table from the tabsdata_db schema and loads it into the raw_customer_data tabsdata table


@td.publisher(
    source=td.MySQLSource(
        uri=MYSQL_URI,
        query=["SELECT * FROM raw_customer_data"],
        credentials=td.UserPasswordCredentials(MYSQL_USERNAME, MYSQL_PASSWORD),
    ),
    tables=["raw_customer_data"],
)
def mysql_pub(tf1: td.TableFrame):
    return tf1
  

Masking PII Fields

Next, I create a transformer that masks personally identifiable information (PII). In this example, I identify the list of columns I want to mask and replace every character in those columns with an asterisk. 


  @td.transformer(
    input_tables=["raw_customer_data"],
    output_tables=["masked_customer_data"],
)
def mask_trf(tf: td.TableFrame):
    masking_columns = cols_to_mask = [
        "first_name",
        "last_name",
        "ip_address",
        "phone_number",
        "email",
        "date_of_birth",
        "SSN",
        "Address",
        "City",
        "Postal_Code",
        "notes_extra",
    ]
    for i in masking_columns:
        tf = tf.with_columns(td.col(i).cast(td.String).str.replace_all(".", "*"))
    return tf

Exporting to S3 with Iceberg

Finally, I configure a subscriber to export the masked data to S3, integrated with an AWS Glue-backed Iceberg catalog.


  @td.subscriber(
    tables=["masked_customer_data"],
    destination=td.S3Destination(
        uri=[
            f"{AWS_S3_URI}/masked_customer_data/masked_customer_data-$EXPORT_TIMESTAMP.parquet"
        ],
        region=AWS_REGION,
        credentials=s3_credentials,
        catalog=td.AWSGlue(
            definition={
                "name": "default",
                "type": "glue",
                "client.region": "us-east-2",
            },
            tables=[f"{AWS_GLUE_DATABASE}.masked-customer-data"],
            auto_create_at=[AWS_S3_URI],
            if_table_exists="replace",
            credentials=s3_credentials,
        ),
    ),
)
def s3_sub(masked_customer_data: td.TableFrame):
    return masked_customer_data

[BONUS] Writing Masked Data Back to MySQL

My masked customer data table is permanently available within Tabsdata for Subscription, but just to have the option, I write back my masked data into a new MySQL table called masked_customer_data .


  @td.subscriber(
    tables=["masked_customer_data"],
    destination=td.MySQLDestination(
        uri=MYSQL_URI,
        destination_table=["masked_customer_data"],
        credentials=td.UserPasswordCredentials(MYSQL_USERNAME, MYSQL_PASSWORD),
        if_table_exists="replace",
    ),
)
def mysql_sub(tf1: td.TableFrame):
    return tf1

Results

With just a few lines of code, I was able to build a workflow that masks and sends customer data over to analytics teams. Now I can be confident that my teams are maximizing our data without compromising on security.

Closing Thoughts

One thing that really stood out to me was how easy it was to subscribe my masked data into additional destinations. Initially, when prototyping, I only wanted to push my data into AWS. However, for fun I added a MySQL subscriber to introduce a little bit more complexity. Because all functions run decoupled, I just registered my MySQL subscriber to use the masked_customer_data Tabsdata table as input and it automatically ran whenever I ran my publisher.