Data Engineer | Dayhoff Labs

Dayhoff Labs

Member of Technical Staff - Data Engineer

Location: London, UK
Start Date: Immediate
Position Type: Full-time

About the Role

We're seeking a data engineer with exceptional expertise in chemistry and biochemistry data systems to drive breakthrough discoveries through robust data infrastructure. This role seeks a candidate who is passionate about building scalable data solutions, iterates quickly on complex pipelines, and isn't afraid to challenge conventional data architectures. You'll design and implement data systems that seamlessly integrate diverse chemical and biological datasets, working in a highly collaborative environment where data insights directly inform experimental design and computational research.

What You'll Do

Design and implement scalable data pipelines for processing large-scale chemical and biochemical datasets from diverse sources including experimental results, literature, and public databases
Build and maintain robust database architectures that integrate structured data from biochemistry databases (KEGG, Rhea, ChEBI, PubChem, UniProt) with unstructured experimental data
Deploy data infrastructure and APIs for internal research teams and collaborate intensively with software and product teams for external data product deployment
Work hand-in-hand with computational scientists and AI/ML researchers to ensure data pipelines support advanced analytics and machine learning workflows
Take ownership of challenging data integration problems with heterogeneous formats and incomplete metadata, delivering clean, accessible datasets under tight timelines
Pioneer novel approaches to chemical data standardization, annotation, and quality control that push the boundaries of biochemical data management

What We're Looking For

Essential Experience:

Education: MS/PhD in bioinformatics, computational biology, cheminformatics, or related field, OR Bachelor's degree with 4+ years relevant experience in biochemical data engineering
Database Expertise: Strong hands-on experience with relational and NoSQL databases, particularly for biological data
Biochemistry Databases: Demonstrated experience working with major biochemistry databases and APIs (KEGG, Rhea, ChEBI, PubChem, UniProt, MetaCyc)
Data Pipeline Development: Proven track record building production-scale ETL/ELT pipelines for scientific data using tools like Apache Airflow, Prefect, or similar workflow orchestration systems
Programming: Proficiency in Python and SQL, with experience in scientific data libraries (pandas, BioPython, RDKit for chemical informatics)
Big Data Technologies: Experience with distributed computing frameworks (Spark, Dask) and cloud platforms (AWS, GCP, Azure) for processing terabyte-scale datasets

Highly Preferred:

Experience with chemical informatics tools and molecular representation formats (SMILES, InChI, SDF, MOL files)
Knowledge of ontologies and controlled vocabularies in chemistry/biology (Gene Ontology, ChEBI ontology, EC numbers)
Experience with graph databases and network analysis for metabolic pathways and chemical reaction networks
Background in data quality assessment, statistical validation, and automated anomaly detection for scientific datasets
Experience with containerization (Docker/Kubernetes) and CI/CD pipelines for data products
Track record of building APIs and data products used by computational scientists and experimentalists

Essential Qualities:

High Agency: You see data bottlenecks and solve them without waiting for detailed specifications. You own data quality and accessibility outcomes.
Systems Thinking: You design robust, scalable data architectures and think holistically about data lineage, governance, and reproducibility.
Scientific Mindset: You understand the nuances of scientific data, including experimental variability, metadata requirements, and domain-specific quality control needs.
Problem-Solving: You're resourceful and comfortable working with messy, incomplete datasets from diverse experimental sources and legacy systems.
Collaborative: You thrive in tight feedback loops with scientists, actively seek to understand research needs, and translate complex data requirements into elegant technical solutions.

Why This Role is Different

This isn't a traditional bioinformatics or enterprise data engineering position. You'll be working at the intersection of cutting-edge biochemical research, large-scale data infrastructure, and practical deployment with the freedom to architect novel data solutions. Our team moves quickly, fails fast, and iterates based on real scientific validation. If you're energized by the prospect of seeing your data pipelines enable breakthrough discoveries within weeks rather than months, this role is for you.

The data you manage will directly power AI/ML models for drug discovery, metabolic engineering, and fundamental biochemical research. You'll have the opportunity to work with some of the most comprehensive chemical and biological datasets in the world while building the infrastructure that accelerates scientific discovery.

Apply

Send your resume or CV, a brief cover letter highlighting your most relevant data engineering and biochemistry experience, and links to any relevant code repositories or data products you've built (e.g., GitHub, public APIs) to careers@dayhofflabs.com.