Does this still work if you give it a pre-existing many-legged animal image, instead of first prompting it to add an extra leg and then prompting it to put the sneakers on all the legs?
I'm wondering if it may only expect the additional leg because you literally just told it to add said additional leg. It would just need to remember your previous instruction and its previous action, rather than to correctly identify the number of legs directly from the image.
I'll also note that photos of dogs with shoes on is definitely something it has been trained on, albeit presumably more often dog booties than human sneakers.
Can you make it place the sneakers incorrectly-on-purpose? "Place the sneakers on all the dog's knees?"
I'm wondering if it may only expect the additional leg because you literally just told it to add said additional leg. It would just need to remember your previous instruction and its previous action, rather than to correctly identify the number of legs directly from the image.
I'll also note that photos of dogs with shoes on is definitely something it has been trained on, albeit presumably more often dog booties than human sneakers.
Can you make it place the sneakers incorrectly-on-purpose? "Place the sneakers on all the dog's knees?"