You design and implement a complete, automated evaluation suite for large language models that goes far beyond surface-level metrics. Starting from a real-world problem, you curate a domain-specific dataset, define custom metrics, and build tests for bias, adversarial robustness, and real-world applicability. You learn to identify and address common pitfalls, such as unresolved limitations after fine-tuning, metric–goal misalignment, and hallucination persistence. By the end of four weeks, you will have conducted a full evaluation cycle and developed a deep, intuitive understanding of the “solution space” for AI evals. Your work mirrors the role of a conscientious ML engineer, producing artifacts immediately relevant to industry, with the potential to publish results, benchmarks, and best practices on platforms like Hugging Face or even at AI conferences.