Checkpointing in Neural Network Models – fault tolerance technique for long running processes

Checkpointing is a process that saves a snapshot of the application’s state at regular intervals, so the application can be restarted from the last saved state in case of failure. This is useful during training of deep learning models, which can often be a time-consuming task. The state of a deep learning model at any point in time is the weights of the model at that time.

How to Check-Point Deep Learning Models in Keras