About the Role
We are looking for a Senior Data Engineer to join our Data Platform team and take ownership of building and scaling our AWS-based data lakehouse. You will architect and deliver robust, production-grade data pipelines, work closely with data scientists, analytics engineers, and product teams, and set the technical direction for how data flows across the organization. This is a hands-on engineering role — you will write production code in Java and Python every day, while also contributing to platform design decisions, mentoring junior engineers, and driving best practices around data quality, reliability, and governance.
Responsibilities
Data Lakehouse Architecture & Development
- Design and build scalable medallion-architecture data lakehouses (Bronze / Silver / Gold) on AWS S3 using Apache Iceberg table format.
- Develop and maintain high-throughput ETL/ELT pipelines using AWS Glue, EMR (Spark), and Lambda.
- Implement schema evolution, partitioning strategies, and compaction processes for Iceberg tables to optimize storage and query performance.
- Write production-quality pipeline code in both Java and Python, selecting the appropriate language per performance and maintainability requirements.
Real-Time & Batch Streaming
- Build and operate event-driven data pipelines using Amazon Kinesis Data Streams, Kinesis Firehose, or Apache Kafka (MSK).
- Design exactly-once and at-least-once processing semantics for streaming workloads using Apache Flink or Spark Structured Streaming on EMR.
AWS Platform Engineering
- Manage infrastructure as code using AWS CDK or Terraform for repeatable, auditable data platform deployments.
- Optimize cost and performance across AWS services including S3, Glue, Athena, Redshift Spectrum, EMR, Lambda, Step Functions, and EventBridge.
- Implement data security best practices: IAM least-privilege policies, KMS encryption, VPC networking, and Lake Formation fine-grained access control.
- Build and maintain CI/CD pipelines for data workloads using AWS CodePipeline, GitHub Actions, or equivalent.
Data Quality & Governance
- Implement data quality frameworks (e.g., Great Expectations, Deequ) and integrate validation steps into pipeline orchestration.
- Define and enforce data contracts between producing and consuming systems.
- Contribute to data cataloguing and lineage tracking using AWS Glue Data Catalog or Apache Atlas.
Collaboration & Technical Leadership
- Partner with data scientists, ML engineers, and analysts to understand data requirements and deliver performant, well-documented datasets.
- Mentor mid-level and junior engineers through code reviews, design discussions, and pair programming.
- Document architecture decisions (ADRs) and contribute to internal engineering knowledge base.
Required Qualifications
Experience
- 5+ years of professional data engineering experience, with at least 3 years on AWS cloud platforms.
- Proven track record of delivering production data pipelines at scale (TB+ datasets, highthroughput SLAs).
- Experience with data lakehouse architectures — medallion pattern, open table formats (Iceberg preferred; Delta Lake or Hudi acceptable).
Programming Languages
- Java: Strong command of Java (8+) for Spark jobs, custom Iceberg connectors, and performance-critical pipeline components. Familiarity with Maven/Gradle build systems.
- Python: Proficient in Python 3 for AWS Glue scripts, orchestration logic, data quality checks, and automation tooling. Experience with pandas, PySpark, boto3, and packaging best practices.
AWS Core Services
- Storage & Compute: S3, Glue (jobs, crawlers, Data Catalog), EMR (Spark/Flink), Lambda, EC2.
- Streaming: Kinesis Data Streams, Kinesis Firehose, or MSK (Managed Kafka).
- Orchestration: Step Functions, MWAA (Managed Airflow), or EventBridge Scheduler.
- Querying: Athena, Redshift, or Redshift Spectrum.
- Security & Governance: IAM, KMS, Lake Formation, Secrets Manager, VPC.
- DevOps: AWS CDK or CloudFormation; CodePipeline or equivalent CI/CD tools.
Data Processing Frameworks
- Apache Spark (PySpark and/or Spark Java API) — distributed transformations, performance tuning, memory management.
- Apache Iceberg — table maintenance, time travel, snapshot management, partition evolution.
- SQL — advanced SQL for data transformation, window functions, CTEs, query optimization.
Preferred / Nice to Have
- AWS Certified Data Engineer – Associate or AWS Certified Solutions Architect certification.
- Experience with dbt for SQL-based transformation layers on top of the lakehouse.
- Familiarity with ML platform integration: feature stores (SageMaker Feature Store), model serving data needs, or MLflow experiment tracking.
- Experience with real-time OLAP engines such as Apache Druid or ClickHouse.
- Contributions to open-source data tooling or internal platform libraries.
- Exposure to data mesh or data product thinking — defining domain ownership and data contracts.
Tech Stack at a Glance
Languages
Java (8+), Python 3
Cloud Platform
AWS (S3, Glue, EMR, Kinesis, Athena, Lambda, Step Functions, Lake Formation, CDK)
Processing
Apache Spark, Apache Flink, Spark Structured Streaming
Table Format
Apache Iceberg (primary), Delta Lake / Hudi (familiarity)
Streaming
Amazon Kinesis, MSK (Kafka), Kinesis Firehose
Orchestration
Apache Airflow (MWAA), AWS Step Functions
IaC & CI/CD
AWS CDK / Terraform, GitHub Actions / CodePipeline
Pay: From $130,000.00 per year
Benefits:
- 401(k)
- 401(k) matching
- Dental insurance
- Health insurance
- Life insurance
- Paid time off
- Parental leave
- Retirement plan
- Vision insurance
Language:
Ability to Commute:
- Irvine, CA 92618 (Required)
Work Location: In person