Scale Your Python Workflows with Dataflow

SQL has become the go-to tool for data processing for a reason: it scales effortlessly, especially when combined with powerful SQL engines like Snowflake or BigQuery. You can query just a few rows or terabytes of data using the same familiar SQL code, letting the database handle the heavy lifting while you focus on your logic.

Python, on the other hand, has a scaling problem. While it’s beloved for its simplicity and flexibility, it doesn’t seamlessly handle large datasets. Once your data no longer fits in memory, scaling Python workflows means rewriting your code to use distributed frameworks like Spark or Beam and spending significant time setting up and maintaining infrastructure.

At Twirl, we believe the size of your data shouldn’t limit you. We want to empower data scientists, machine learning engineers, and data analysts to focus on what they do best, regardless of data volume. That’s why we’re launching Twirl’s Dataflow integration. It’s designed to make the switch from in-memory to distributed processing seamless, and lets you use familiar Python syntax to process massive datasets, with Dataflow automatically scaling your jobs across multiple workers. And as usual in Twirl, you still get a local development mode that works across languages and databases, flexible and reactive scheduling, end-to-end lineage and monitoring.

How Twirl’s Dataflow integration works

Our new integration is designed to abstract away the complexity of distributed compute and save you from having to rewrite your code in a new framework. Twirl lets you write the code in normal, non-Beam Python and then uses Dataflow to distribute the work across workers. Here’s how it works:

1. Declare a Dataflow job

Start by defining a DataflowGroupByJob in your manifest.py. This step tells Twirl how to partition your data for distributed processing:

job=twirl.DataflowGroupByJob(
update_method=twirl.UpdateMethod.APPEND,
    group_by_key=["tenant_id", "status"],
    order_by_key=["event_timestamp"],
)

2. Write your job logic as usual

Next, write your job logic in job.py just as you would for an in-memory Python job. You can assume the input contains only one group of data at a time:

import pandas as pd
def job(input: pd.DataFrame) -> pd.DataFrame:
    input["event_week"] = input['event_timestamp'].dt.to_period('W').apply(lambda r: r.start_time)
    input["event_count"] = input.groupby("event_week")["event_week"].transform("count")
    output = input[[
        "tenant_id",
        "status",
        "event_week",
        "event_count",
    ]]
    return output

Twirl will wrap your code up as a Beam pipeline using the groupby key you specified in your manifest.py before running it.

3. Configure your runtime

Finally, configure a Beam runtime in your project_config.py. This connects Twirl to Google Dataflow and provides the necessary runtime environment:

beam_runtime=twirl.GcpBeamRuntime(
    project="twirldata-demo",
    location="europe-west1",
    bucket="twirldata-demo",
    job_worker_account="twirl-runner@twirldata-demo.iam.gserviceaccount.com",
)

With this setup, Twirl automatically submits your job to Dataflow, where it runs at scale without additional effort from you.

Local Development with Dataflow Jobs

Twirl’s local development mode, which lets you run entire pipelines safely before deploying to production, also works for Dataflow jobs. This means you can iterate quickly and debug your code before deploying to a cloud environment.

To run a Dataflow job locally, simply use the normal twirl run command. The output will look something like this:

$ twirl run bigquery/aggregates/job_run_event_statistics_large_df/
[12:12:51] INFO     🌀 Twirl starting... (twirl_version=0.11.1)
           INFO     ▶️   Running command in 'Mode.DEV'
[12:12:56] INFO     🌱 Updating tables... selection=['bigquery/aggregates/job_run_event_statistics_large_df/'] mode=dev
           INFO     ⚙️   Updating: bigquery/aggregates/job_run_event_statistics_large_df (mode=dev)...
[12:13:50] INFO     ✅ Updating: bigquery/aggregates/job_run_event_statistics_large_df (mode=dev) succeeded in 54s!
           INFO     🌻 Updated all tables!

Twirl will automatically execute your job using Beam’s DirectRunner, which runs locally on your machine. Inputs will be read from your production environment, and outputs will be written to your dev schema, where you can inspect the results. We recommend using a dev_where or dev_sample clause on your inputs to Dataflow jobs, to limit the amount of data that is processed when running locally.

Once you’re ready to deploy your job to production, Twirl will seamlessly switch to using Google Dataflow to execute your job at scale.

Conclusion

With Twirl’s Dataflow integration, you can seamlessly scale your Python-based data processing pipelines to handle massive datasets. By abstracting away the complexities of distributed computing, Twirl allows you to write clean, concise, and efficient code using familiar Python syntax.

Whether you’re a data scientist, machine learning engineer or data engineer, Twirl’s Dataflow integration can help you:

Simplify complex workflows: Focus on your core business logic, not infrastructure.
Accelerate development: Iterate quickly with local development and testing.
Scale effortlessly: Handle massive datasets with ease.

Ready to try a new way of developing data pipelines? Sign up for a demo at twirldata.com to talk to us and request a trial.