Kubernetes and software engineering practice are quietly revolutionizing data science by providing practitioners with better infrastructure and more disciplined habits, and many tools build on these primitives and practices to make machine learning deployments on Kubernetes simple, portable, and scalable.  However, bringing engineering discipline to data science workflows turns out to be a thorny problem, and reproducible research is harder to achieve than we might assume.  In this talk, we’ll examine the problem of reproducible research from several angles and present tools we’ve built on Kubernetes that address different facets of the problem. You’ll see how to treat Jupyter notebooks as real software artifacts -- not merely as ad hoc environments for discovery -- and learn about what that mindset change entails.  You’ll see how we build workflows from notebooks, how we automatically generate model services with CI/CD pipelines, and the tools we use to generate and track metrics to identify concept drift.  You’ll learn about some surprising challenges of reproducibility and learn why some convenient model operationalization workflows might require heroic practitioner discipline to produce consistent results.

Talk
Intermediate