My custom image model crashed hard after 3 days of training

Honestly, I was trying to train a model on a set of 5000 product photos to get a specific style. It was going fine for about 72 hours, then the whole thing just stopped and gave me a CUDA out of memory error. I had to go back and cut my batch size in half, from 8 to 4, which added another full day to the training time. Tbh, I think my dataset had a few corrupted image files that messed everything up. Has anyone else run into this with PyTorch lately and found a better fix?

2 comments

2 Comments

jake_hall884d ago

Ever try using gradient checkpointing? Worked for me when I had the same issue as you and @jade_hernandez is right, it's just part of the grind. Also running a quick script to find any broken images before training saved me a ton of headache later.

jade_hernandez4d ago

Eh, CUDA errors happen all the time though. Cutting the batch size is just part of the process, not some huge disaster.