Thesis Topics

❇️ Last updated: 2025-11-05 12:47:17 CET

Here are several topics that we are actively offering (this list is actively updated). For all the applications, we would require a cover letter, your curriculum-vitae, your latest grades, and your university enrollment proof. Also, it is expected you have at least some knowledge of the topic either through university courses or project experience on Github.

Students from KIT can be consider these topics as Hiwi position.

Topics (only mandatory university internship/ master thesis)

Trust & Failure-Mode Evaluation Framework for Scientific LLM Agents

In this internship, you will help develop a systematic evaluation framework to measure when large language models (LLMs) used in scientific settings are correct, uncertain, or confidently wrong. You will design experiments to probe model reliability on real experimental data (e.g., diagnostic questions from the KATRIN experiment), define metrics for trust and failure modes, and build visual analytics to compare different LLMs under controlled conditions. This project is ideal for students interested in trustworthy AI, model evaluation, uncertainty quantification, or scientific applications of foundation models. By the end, you will have built an evaluation suite that can support a future NeurIPS-level research paper on reliability of autonomous scientific agents.

Domain-Adapted LLMs: Fine-Tuning and Benchmarking for Scientific Reasoning

This internship focuses on adapting and benchmarking small and mid-sized open-source language models (1B–7B parameters) for reasoning tasks in experimental physics. You will fine-tune an LLM using QLoRA or similar methods, evaluate it against larger closed models (e.g. GPT-4, Claude) on domain-specific questions, and measure trade-offs between accuracy, hallucination rate, latency, and compute cost. The project is well-suited for students who want hands-on experience with model training, prompt engineering, HuggingFace tooling, and GPU workflows. The final outcome will be a reproducible benchmark comparing “big general models” vs. “small domain-adapted models,” with direct relevance to deploying scientific AI agents on local HPC systems.

About