6
My custom image model crashed hard after 3 days of training
Honestly, I was trying to train a model on a set of 5000 product photos to get a specific style. It was going fine for about 72 hours, then the whole thing just stopped and gave me a CUDA out of memory error. I had to go back and cut my batch size in half, from 8 to 4, which added another full day to the training time. Tbh, I think my dataset had a few corrupted image files that messed everything up. Has anyone else run into this with PyTorch lately and found a better fix?
2 comments
Log in to join the discussion
Log In2 Comments
jake_hall884d ago
Ever try using gradient checkpointing? Worked for me when I had the same issue as you and @jade_hernandez is right, it's just part of the grind. Also running a quick script to find any broken images before training saved me a ton of headache later.
10
jade_hernandez4d ago
Eh, CUDA errors happen all the time though. Cutting the batch size is just part of the process, not some huge disaster.
1