Most people seem to be expecting some kind of quantitative analysis: N developer...

Most people seem to be expecting some kind of quantitative analysis: N developers undertook M tasks with and without access to a given AI tool, here is the statistical evidence that shows (or fails to show) the effect, and this result is valid across other projects and tools.

In practice, arriving at this ideal scenario can be very challenging. Actually feasible experiments will be necessarily narrow, with the expectation that their results can be (roughly) extrapolated outside of their specific experimental setup.

Another valid approach would be to carry out qualitative research, for example a case study. This typically requires the study of one (or a few) developers and their specific contexts in great detail. The idea is that a deep understanding of how one person navigates their work and their tools would provide us with insights that might be related to our specific situation.

Personally, in this particular area, I tend to prefer detailed qualitative accounts of how other developers are working on similar projects and with similar tools as me.

But in any case, both approaches are valid and complementary.