Data Infrastructure Engineer

Posted 07 Apr 2020

Stripe

San Francisco IT Jobs (Big Data Analyst Jobs)


As a platform company powering businesses all over the world, Stripe processes payments, runs marketplaces, detects fraud, helps entrepreneurs start an internet business from anywhere in the world. Stripe’s Data Infrastructure Engineers build the platform, tooling, and pipelines that manage that data.

At Stripe, decisions are driven by data. Because every record in our data warehouse can be vitally important for the businesses that use Stripe, we’re looking for people with a strong background in big data systems to help us build tools to scale while maintaining correct and complete data. You’ll be creating best in class libraries to help our users fully leverage open source frameworks like Spark. You’ll be working with a variety of teams, some engineering and some business, to provide tooling and guidance to solve their data needs. Your work will allow teams to move faster, and ultimately help Stripe serve our customers more effectively. You will:

Create libraries and tooling that make distributed batch computation easy to create and test for all users across Stripe
Become expert in and contribute to open source frameworks such as Spark, Iceberg to address issues our users at Stripe encounter
Create APIs to help teams materialize data models from production services into readily consumable formats for all downstream data consumption
Create libraries that allow engineers to define the dependency structure of workloads to be scheduled with Airflow
Create libraries that enable engineers at Stripe to easily interact with various serialization frameworks (e.g. thrift, bson, protobuf)
Create observability tooling to help our users easily debug, understand, and tune their Spark jobs
Build and maintain our in house job observability service as well as workflow metadata management service
Leveraging batch computation frameworks and our workflow management platform (Airflow) to assist other teams in building out their data pipelines

We’re looking for someone who has:

A strong engineering background in data infrastructure, and experience working in large open source projects such as Apache Spark.
Experience developing and maintaining distributed systems built with open source tools.
Experience building libraries and tooling that provide beautiful abstractions to users
Experience optimizing the end-to-end performance of distributed systems.
Experience in writing and debugging ETL jobs using a distributed data framework (Spark)

Nice to haves:

You’re a committer to Apache Spark, Apache Hadoop, or other open source data technologies
Experience with Airflow or other similar scheduling tools

It’s not expected that you’ll have deep expertise in every dimension above, but you should be interested in learning any of the areas that are less familiar. Some things you might work on:

Write a unified data access layer that abstracts away regionality, underlying s3 layout, table format technology
Continuing to lower the latency and bridge the gap between our production systems and our data warehouse by rethinking and optimizing our core data ingestion jobs
Create robust and easy to use unit testing infrastructure for batch processing pipelines
Build a framework and tools to re-architect data pipelines to run more incrementally
Apply Now