📢
6
c/ai-innovations•sean_hunt61sean_hunt61•5d ago

My custom image model crashed hard after 3 days of training

Honestly, I was trying to train a model on a set of 5000 product photos to get a specific style. It was going fine for about 72 hours, then the whole thing just stopped and gave me a CUDA out of memory error. I had to go back and cut my batch size in half, from 8 to 4, which added another full day to the training time. Tbh, I think my dataset had a few corrupted image files that messed everything up. Has anyone else run into this with PyTorch lately and found a better fix?
2 comments

Log in to join the discussion

Log In
2 Comments
jake_hall88
Ever try using gradient checkpointing? Worked for me when I had the same issue as you and @jade_hernandez is right, it's just part of the grind. Also running a quick script to find any broken images before training saved me a ton of headache later.
10
jade_hernandez
Eh, CUDA errors happen all the time though. Cutting the batch size is just part of the process, not some huge disaster.
1