It's also a matter of better input. If I feed machine A the text string "windows", what should the response even be? There's no context from which to build anything. Compare that to feeding machine B the prompt "I would like to know what kinds of windows I can buy for my house. The dimensions are 48x24". We can't really conclude that machine B is better, simply because the two machines aren't taking the same test.