LLM Evaluation
by Aleph Alpha

Build domain-specific AI evaluation systems. Close the gap between public benchmark performance and real-world utility through robust, automated model evaluation.
LLM Evaluation track image

You design and implement a complete, automated evaluation suite for large language models that goes far beyond surface-level metrics. Starting from a real-world problem, you curate a domain-specific dataset, define custom metrics, and build tests for bias, adversarial robustness, and real-world applicability. You learn to identify and address common pitfalls, such as unresolved limitations after fine-tuning, metric–goal misalignment, and hallucination persistence. By the end of four weeks, you will have conducted a full evaluation cycle and developed a deep, intuitive understanding of the “solution space” for AI evals. Your work mirrors the role of a conscientious ML engineer, producing artifacts immediately relevant to industry, with the potential to publish results, benchmarks, and best practices on platforms like Hugging Face or even at AI conferences.

Past Dates

8 September 2025