Kubernetes get the container running platform to use with really easy add or remove resources. For real 12 factor app this is strait forward and the good way to go.
The challenge comes if you have long running (stateful) apps or big data apps like hadoop, spark, flink,...
This talk show strategies to meditate this challenges and take advantages out of it.
For example instead of run small cluster 24h, run a 12 times bigger cluster only for short intervals(~2h). If your jobs are scalable you get you results up to 12 times faster and bigger business value for same resource consumption. Or run a live recording, you need stay until event ends even it is a 24h. Run an AI training which don’t usually work so well with snapshots for recovery.
- use cloud in dynamic way (scale 100x of capacity in minutes is possible)
- per job cluster with the fitting sizing
- leverage multiple node pools/groups
- help from k8s operators for deployment
- how this can work together with workflows like airflow
- k8s cluster auto scaler, how to leverage him and pitfalls
- k8s scheduler, alternatives , options to consider