Compute cluster performance optimization
Lund / Full Time / Lund, Sweden
Backtick Technologies provides consultancy services in software engineering, data engineering, data science, artificial intelligence and related fields. Today, Backtick employs 12 engineers and has customers in the enterprise and startup world alike. In 2023, we’re building a state of the art data platform product, and we could use your help!
Data engineering revolves around using multiple machines to transform and extract value from large datasets. Distributed computing always involves many tradeoffs, and finding the right ones are key to achieving high performance for the workload in question.
Backtick is building a data platform based on Kubernetes, S3-compatible object storage, Apache Spark for compute and open data formats. In doing so, a lot of design decisions have to be made, ranging from what cloud provider to use or the number of executor machines to the optimal size of individual files stored in the data lake.
The project will involve setting up Apache Spark and MinIO on a Kubernetes cluster, and analyzing which parameters have the largest impact on the performance of the system.
Research questions are mostly up to you and the institution. We will help you design the project so that it achieves the necessary academic standards. We are curious about things like:
- How does the ratio of cores to memory on executor nodes affect query performance?
- How large is the impact of the underlying storage hardware?
- What Spark parameters have the largest impact on overall performance?
Who are we?
At Backtick, we’re a mix of innovators, software engineers, data engineers and data scientists. We’re a small consultancy company helping our customers go from data to production ready machine learning systems. We have an office in central Lund, and you are welcome to sit with us. Read more about us at backtick.se
Who are you?
We are looking for two students with exceptional knowledge of Linux, networking and computer hardware, preferably with an interest in distributed computing and big data. You can expect to learn a lot about Kubernetes, distributed storage and distributed computing systems.
Start date & duration
Jan/Feb/Mar 2023, 30 HP (~1 semester)
Introduce yourself in a few lines to:
Johan Henriksson, CTO email@example.com