If you want to see something rather amusing - instead of using the LLM aspect of Gemini 3.0 Pro, feed a five-legged dog directly into Nano Banana Pro and give it an editing task that requires an intrinsic understanding of the unusual anatomy.
Place sneakers on all of its legs.
It'll get this correct a surprising number of times (tested with BFL Flux2 Pro, and NB Pro).
Does this still work if you give it a pre-existing many-legged animal image, instead of first prompting it to add an extra leg and then prompting it to put the sneakers on all the legs?
I'm wondering if it may only expect the additional leg because you literally just told it to add said additional leg. It would just need to remember your previous instruction and its previous action, rather than to correctly identify the number of legs directly from the image.
I'll also note that photos of dogs with shoes on is definitely something it has been trained on, albeit presumably more often dog booties than human sneakers.
Can you make it place the sneakers incorrectly-on-purpose? "Place the sneakers on all the dog's knees?"
i imagine the real answer is that the edits are local because that's how diffusion works; it's not like it's turning the input into "five-legged dog" and then generating a five-legged dog in shoes from scratch
Sounds like they used GenAI to make them. The "Editor" models (Seedream, Nano-Banana) can easily integrate a fifth limb to create the "dog with awkward walking animation".
I just re-ran that image through Gemini 3.0 Pro via AI Studio and it reported:
I've moved on to the right hand, meticulously tagging each finger. After completing the initial count of five digits, I noticed a sixth! There appears to be an extra digit on the far right. This is an unexpected finding, and I have counted it as well. That makes a total of eleven fingers in the image.
This right HERE is the issue. It's not nearly deterministic enough to rely on.
Thanks for that. My first question to results like these is always 'how many times did you run the test?'. N=1 tells us nothing. N=2 tells us something.
Anything that needs to overcome concepts which are disproportionately represented in the training data is going to give these models a hard time.
Try generating:
- A spider missing one leg
- A 9-pointed star
- A 5-leaf clover
- A man with six fingers on his left hand and four fingers on his right
You'll be lucky to get a 25% success rate.
The last one is particularly ironic given how much work went into FIXING the old SD 1.5 issues with hand anatomy... to the point where I'm seriously considering incorporating it as a new test scenario on GenAI Showdown.
Some good examples there. The octopus one is at an angle - can't really call that one pass (unless the goal is "VISIBLE" tentacles).
Other than the five-leaf clover, most of the images (dog, spider, person's hands) all required a human in the loop to invoke the "Image-to-Image" capabilities of NB Pro after it got them wrong. That's a bit different since you're actively correcting them.
Multimodal certainly helps but "pretty well" is a stretch. I'd be curious to know what multimodal model in particular you've tried that could consistently handle generative prompts of the above nature (without human-in-the-loop corrections).
For example, to my knowledge ChatGPT is unified and I can guarantee it can't handle something like a 7-legged spider.
In fact, one of the tests I use as part of GenAI Showdown involves both parts of the puzzle: draw a maze with a clearly defined entrance and exit, along with a dashed line indicating the solution to the maze.
Only one model (gpt-image-1) out of the 18 tested managed to pass the test successfully. Gemini 3.0 Pro got VERY close.
super cool! Interesting note about Seedream 4 - do you think awareness of A* actually could improve the outcome? Like I said, I'm no AI expert, so my intuitions are pretty bad, but I'd suspect that image analysis + algorithmic pathfinding don't have much crossover in terms of training capabilities. But I could be wrong!
Great question. I do wish we had a bit more insight into the exact background "thinking" that was happening on systems like Seedream.
When you think about posing the "solve a visual image of a maze" to something like ChatGPT, there's a good chance it'll try to throw a python VM at it, threshold it with something like OpenCV, and use a shortest-path style algorithm to try and solve it.
I too have been throwing messages in bottles into a silent sea for a pretty long time, but I think I'm okay with that. It doesn't help if you also have difficulty adhering to quintessential blog SEO best practices.
1. Consistent theme - A diverse set of interests and a lethal dose of ADD make this virtually impossible
2. Consistent updates - My articles tend to be rather unusual, and I'll often combine them with customized interactive layouts. Even a monthly post would be pretty ambitious for me.
On a slightly related note, I'm hoping that zines [1] see a resurgence in popularity as I could see it being a good point of entry towards possibly gaining readership for those whose sites are inadvertently running in stealth mode.
Wow, thanks for link to Paged Out. Any ideas how to discover any more tech focused zines? We have a zine culture in UK but afaik it’s more culture / music focused (v happy to be wrong here)
Reminds me of an excerpt from Tom Wolfe’s book The Right Stuff in which fighter pilots perceived doctors as the enemy, and heaven forfend you saw a psychiatrist!
A man could go for a routine physical one fine day, feeling like a million dollars, and be grounded for fallen arches. It happened!—just like that! (And try raising them.) Or for breaking his wrist and losing only part of its mobility. Or for a minor deterioration of eyesight, or for any of hundreds of reasons that would make no difference to a man in an ordinary occupation. As a result all fighter jocks began looking upon doctors as their natural enemies. Going to see a flight surgeon was a no-gain proposition.
This reminds me of when a friend became a cop. One day I saw him or I thought I saw him from far away but I couldn’t tell him that I wasn’t sure it was him because I couldn’t recognise him because of my myopia and, since I sometimes drive without my glasses on - what if one day he caught me?
It always baffles me how blasé people are about driving safety. The rules for driving aren't even that hard to follow. Yet people just seem constitutionally unable to do so.
> I wanted her take on Wanderfugl , the AI-powered map I've been building full-time.
I can at least give you one piece of advice. Before you decide on a company or product name, take the time to speak it out loud so you can get a sense of how it sounds.
I grew up in Norway and there's this idea in Europe of someone who breaks from corporate culture and hikes and camps a lot (called wandervogel in german). I also liked how when pronounced in Norwegian or Swedish it sounds like wander full. I like the idea of someone who is full of wander.
In Swedish the G wouldn't be silent so it wouldn't really be all that much like "wonderful"; "vanderfugel" is the closest thing I could come up with for how I'd pronounce it with some leniency.
The weird thing is that half of the uses of the name on that landing page spell it as "Wanderfull". All of the mock-up screencaps use it, and at the bottom with "Be one of the first people shaping Wanderfull" etc.
Also, do it assuming different linguistic backgrounds. It could sound dramatically different by people that speak English but as second language, which are going to be a whole lot of your users, even if the application is in English.
I'm a native speaker of English, northern California dialect. I pronounce every one of those letters, to varying degrees. Some just affect the mouth shape by subtle amounts, but it is there.
Just FYI, I would read it out loud in English as “wander fuggle”. I would assume most Americans would pronounce the ‘g’.
I thought ‘wanderfugl’ was a throwback to ~15 years ago when it was fashionable to use a word but leave out vowels for no reason, like Flickr/
/Tumblr/Scribd/Blendr.
Thanks for flagging. We're not open-source — the GitHub link shouldn't have been on the site. Removing it now.
We offer a private SDK for customers. If you want to test it, you can go to the website and create your account or ping me at sukin@safekeylab.com
reply