tabsdata.com

April 7, 2025

Part 1: Simplifying Data Engineering — Freeing teams from pipeline firefighting

By:

Arvind Prabhakar

🔹 This article is part of the ongoing series: How Pub/Sub for Tables Fixes What Data Pipelines Broke.

The Problem: Data Engineers Are Drowning in Pipelines

For years, pipelines were the default answer to every integration need. Need to ingest a new source? Build a pipeline. Need to transform, enrich, or publish data? Add another one. But today, pipelines have become the problem.

They fail for many reasons. Bad data ingestion, transformation errors, infrastructure hiccups, evolving systems, or unmanaged schema changes can all bring things to a halt. Sometimes it’s a downstream dependency that wasn’t ready. Other times it’s a quiet change upstream that breaks everything without warning. The more pipelines you build, the more exposed you are to these risks. Each one adds complexity, tight coupling, and operational overhead.

We’ve thrown observability and orchestration at the symptoms. The core issue still remains. This isn’t a tooling problem. It’s a structural one. The foundation needs to change.

Why Pipelines Create Structural Fragility

Pipelines look clean in architecture diagrams, but in practice they create a web of brittle, opaque dependencies. What they move is not just data, but also the problems hidden within it.

When a source system ingests bad data, the pipeline spreads that issue downstream. Garbage in becomes garbage everywhere. Most of the time, this happens because assumptions break silently. A system gets upgraded. A patch changes a field. A business process shifts how an entity behaves. These changes are rarely coordinated, yet pipelines rely on everything staying the same.

The real issue is that pipelines are procedural. They bundle logic, timing, and handoffs into one fragile structure. Assumptions get embedded in code, hidden in jobs, and passed between teams. Instead of building, engineers spend time reacting to breakages and tracking down what changed.

This is exactly why so many architectures start with raw data layers. They try to collect unprocessed facts, run checks, and reconstruct what the source systems meant. But this process is slow and lossy. It’s not a real solution. It’s a workaround born from a broken integration model.

Tabsdata Replaces Pipelines with a Declarative Model

Pipelines are the wrong abstraction. Tabsdata replaces them with something fundamentally better.

At the heart of this model are publisher functions, which are free-form Python functions, that operate on TableFrames, which are DataFrame-like representations of structured data. These functions can embed business logic, validation, or shaping logic as needed. You write what matters, not boilerplate for orchestration. Connectivity to external systems is transparently handled by Tabsdata. When a publisher reads a TableFrame sourced from, say, Salesforce, it doesn’t need to know where the data came from. It simply receives the TableFrame and processes it.

Tabsdata also includes transformer functions and subscriber functions, which extend this model end-to-end. Transformer functions can combine or reshape multiple published tables to create new ones. Subscriber functions produce output TableFrames, which Tabsdata maps transparently to destinations like data lakes, warehouses, or APIs. There’s no connector logic to manage, just composable functions that define what flows in and out.

The result is a model where domain teams own the responsibility of publishing clean, structured, and meaningful data — and downstream teams can build confidently on top of it.

What Simpler Looks Like

With Tabsdata, data no longer travels through a maze of pipelines and handoffs. It flows through declarative, versioned tables that are published, transformed, and subscribed to with minimal ceremony. The complexity isn’t hidden -- it’s eliminated.

Here’s the difference:

Before (Pipelines):

Custom ETL jobs for every source
Schema drift tracked manually
Pipeline failures buried in orchestration logs
Constant Slack threads with domain teams

After (Tabsdata):

Domain teams publish clean, structured data as Tables
Engineers build transformer functions instead of stitching pipelines
All data is versioned, observable, and reproducible
Downstream consumers subscribe without creating new dependencies

This is what simplification actually looks like. No orchestration. No rework. No data lost in translation.

No More Reconstructing Reality

Most data architectures assume that raw data must be staged, cleaned, and reassembled to reflect how the business actually works. This is where Bronze layers and ingestion zones come in - they try to recreate the reality of upstream systems after the fact.

But that process is guesswork. It assumes downstream teams understand the context behind the data, which they usually don’t. Business processes evolve, entities shift, and systems change in ways that aren’t visible downstream. What you get is a patchwork of interpretations, not truth.

Tabsdata avoids this entirely. When a domain team publishes a table, they are publishing their own current reality. There’s no need to infer or reverse-engineer its meaning. The data comes structured, versioned, and ready to use. This reduces duplication, eliminates interpretation drift, and gives both producers and consumers the ability to move independently.

Pub/Sub for Tables Is Not Pub/Sub for Events

It’s easy to hear “Pub/Sub” and think of Kafka, Pulsar, or other event-driven systems. But Tabsdata applies the model differently. It does not operate on messages or streams. It operates on entire tables.

Traditional Pub/Sub systems rely on implicitly ordered events, delivered through topics. Stream processors extend this with flows of events, designed for real-time change propagation and not for delivering structured, versioned data teams can build on.

Tabsdata works at the level of whole tables, not individual events. Each time a publisher function runs, it emits a complete version of a table. This is not a batch of messages. It is a fully materialized table with structure and metadata. There are no topics or streams. The model centers on tables and the functions that interact with them.

Functions can access the latest or any historical version of a table. This supports reproducibility, time-based logic, lineage, and provenance. When a published table changes, Tabsdata automatically updates dependent subscribers and transformation outputs.

This isn’t just another spin on event streams. It’s a new foundation for how teams share and consume data.

Engineering Time, Redeemed

When the system is clean, engineering work gets better. Teams stop spending their time tracing failures and maintaining brittle pipelines. Instead, they focus on building durable, meaningful transformations that actually move the business forward.

Tabsdata gives engineers a simpler surface to work with. No orchestration logic. No hidden dependencies. No reverse-engineering what a data source might mean. Each function is a self-contained, testable unit that reads tables and writes tables, nothing more.

The result is a model that’s easier to reason about, faster to evolve, and more resilient to change. Engineers can spend less time patching things and more time delivering things. And that shift shows up in speed, trust, and outcomes across the board.

A Cleaner Foundation

What Tabsdata offers is a different foundation for data collaboration entirely. One where data producers can publish what they know. One where consumers can rely on what they receive. And one where the system keeps itself consistent without constant engineering effort.

The complexity didn’t disappear. It was just moved to the right place. Instead of hiding it inside orchestration, Tabsdata surfaces it through clean abstractions that teams can understand and own.

This is only the beginning. In the next article, we’ll go deeper into how Tabsdata supports transparency, ownership, and evolution across teams — all without the usual coordination overhead.