It certainly wasn't easy sushi.

Nor was it controlled to the point that it would pass peer review in a relevant technical journal. But it was good enough to seriously raise my confidence level in the purchase I made.
The testing involved three people (myself included), and some planning. The room had reasonable acoustics to begin with (a friend's "listening" room with at least partially effective room treatments), and the source gear was competent.
As far as the logistics of swapping speakers, we gave that much consideration and this is the plan we came up with (I know, it's kinda involved, but we were really trying to do this "right"):
(1) We assigned a number to each of the seven pairs, 1 through 7.
(2) One listener was then blindfolded.
(3) Dice were rolled to choose randomly two of the seven pairs.
(4) The two pairs were both placed into a reasonable position (with both left speakers side by side - with just a few inches between them - and the same for the right). Which was on the outside or inside was determined by the order their number was rolled. We wanted two side by side so that cable swapping could be done quickly, and so that direct comparisons between "A" and "B" could be made.
(5) After that "trial" with that listener, the two pairs were returned to another room with all the other speakers. Which two pairs were being listened to by that particular listener in that particular trial was not revealed until the end of the test. (in this way, it was only single blind, but with a random pairing and ordering).
(6) The next listener was blinded, and the other two of us repeated steps 3, 4, and 5. We did this over several days, each trial taking roughly half an hour, with dozens upon dozens of trials, until each of us had heard each speaker paired with each of the others, and most had heard several identical pairings with the "outside/inside" order reversed. It wasn't exhaustive of the possible combinations, but it was complete enough that after examining the results we were comfortable in interpreting them to show a consistent trend.
Level matching was the one thing we could never figure out. We didn't have identical pre and power amps at our disposal, which would have made this easier. The only thing we could think of was to allow the blinded listener to have complete control of the remote for volume. The level was set to zero between each swap so that relative sensitivity would be more difficult to discern, and the listener was encouraged to vary the volume often to cover the full dynamic range of the speaker... and hopefully mask to some degree the bias that mismatched levels introduces.
In the end, our preferences could be attributed to the interplay of the speaker's sensitivity and the volume's chosen, but since this at least reflects upon a quality that one is concerned with in real listening conditions in their homes, we felt it was a possibility we could live with.