Software Engineer-Data & ML Infrastructure
Company Introduction
At Bot Auto, we are revolutionizing the transportation of goods with our cutting-edge autonomous trucks, enhancing the quality of life for communities around the globe. With the agility of a start-up and the wisdom of seasoned experts, Bot Auto boasts a team that has achieved numerous world-firsts and unparalleled innovations. United by a shared vision, we create miracles and propel the future of transportation. Join us and transform your dreams into reality.
We are seeking a highly skilled and motivated Software Engineer to architect, develop, and scale a robust hybrid-cloud data and machine learning platform from the ground up. This role requires a hands-on coding expert with a deep understanding of distributed storage and compute systems, as well as extensive experience designing and implementing large-scale data and machine learning infrastructures. The ideal candidate will excel in creating efficient data ingestion, transformation, and lake formation pipelines, while managing both relational and NoSQL databases. A strong commitment to data privacy, security, and access control is essential, ensuring the integrity and accessibility of our data lake and database systems.
Key Responsibilities
Data Lake Infrastructure
- Design and implement scalable data infrastructure, including data lakehouse systems leveraging S3, Data Lake, and Data Catalog, with support for diverse data formats such as Parquet, Avro, and JSON.
- Integrate modern data lakehouse frameworks, including Delta Lake, Apache Hudi, and Apache Iceberg, to provide versioning, ACID compliance, and optimized storage for high-performance data analytics.
- Architect, containerize, and orchestrate end-to-end data workflows using Kubernetes (K8s) and distributed computing frameworks, enabling efficient, large-scale data processing and transformation.
Machine Learning & Deep Learning Infrastructure
- Develop and manage a robust feature (training data) store to support machine learning models with high-quality data features.
- Design end-to-end data and ML pipelines, from data preparation to model deployment, enabling automated workflows for rapid experimentation and production.
- Collaborate closely with research scientists to train and optimize deep learning models, supporting model benchmarking, validation, and continuous improvement.
Data Ingestion Framework
- Build a resilient data ingestion framework to handle large, real-time data streams from core database replica logs, enterprise APIs, and GraphQL sources, facilitating data sharing and replication across systems.
- Structure multi-layered data storage within the data lake, creating a comprehensive system that includes raw, curated, and warehousing layers to support various analytical and operational needs.
Core Database Management
- Oversee and optimize core data storage solutions, including relational databases, NoSQL databases, and real-time databases, ensuring reliable and high-performance data access.
- Implement and manage robust data backup, disaster recovery, and high-availability solutions to protect and maintain critical data assets.
Privacy, Security and Access Control
- Secure Data Access: Implement access controls and encryption to safeguard data.
- Privacy Compliance: Ensure data handling meets privacy standards and regulations.
- Monitoring & Alerts: Set up systems to track data access and identify security risks.
- Data Masking: Use techniques to protect sensitive data in analytics and ML workflows.
Qualifications
Required:
- Educational Background: Bachelor’s or Master’s degree in Computer Science, Data Engineering, or a related field, or equivalent experience.
- Experience in Data Infrastructure: Proven experience designing and implementing scalable data lakehouse architectures, including proficiency with cloud storage (e.g., S3), data cataloging, and modern data lake frameworks like Delta Lake, Apache Hudi, or Apache Iceberg.
- Proficiency in Distributed Systems: Strong experience with distributed computing and container orchestration, specifically with Kubernetes (K8s), Spark, and other large-scale data processing frameworks.
- Data Engineering & Machine Learning: Skilled in building and managing feature stores, data pipelines, and ML workflows, with experience in ML model training, benchmarking, and production deployment.
- Strong Programming Skills: Proficiency in programming languages such as Python, SQL, and optionally Scala or Java, along with experience in data frameworks (e.g., Apache Spark, Kafka) and ETL tools.
- Database Management: Solid understanding of relational and NoSQL databases, real-time databases, and core principles of database backup, recovery, and high-availability strategies.
- Data Privacy and Security: Knowledgeable in data privacy standards and regulations (e.g., GDPR, CCPA), with hands-on experience implementing access controls, data encryption, and monitoring.
- Analytical and Problem-Solving Skills: Demonstrated ability to troubleshoot complex data infrastructure issues, optimize performance, and apply innovative solutions in dynamic environments.
- Collaboration and Communication: Strong interpersonal skills to work closely with cross-functional teams, including data scientists, ML researchers, and engineering teams.
Preferred:
- Experience with the Autonomous Driving Industry.
Apply for this job
*
indicates a required field