craig,
You can safely ignore tony's complaint, as your proposed test method does offer that control. As you have pointed out, you only need to verify that the speakers are the same either before [/b]or[/b] after they have both played for the same length of time. It doesn't really matter which.
The only disadvantage of comparing after both are broken in is that if the still sound different, the test really doesn't offer any conclusive results one way or the other. If they continue to sound the same, or if one is different and then in time sounds the same, then the test produces meaningful results. If they remain different, all you really know is that the speaker matching was poor and that they may or may not have drifted relative to one another during testing.
A point I'd like to stress, however, is that it isn't sufficient to simply note "same or different" for each trial or session. After all, a bias that leads you to believe break-in is real might have you thinking "different" at first and "same" later on.
You need to be able to offer proof that your ability to distinguish a difference is real, not imagined. The only way to do so, in the context of your test, is to consistently identify the two speakers absolutely. In other words, if your very first listening session leads you to believe they are "different" then you need to try and label them A and B. The next session, you need to try and match your A and B labels. That kind of stinks, but that's how it works. You need to try and focus on some aspect of the sound that you believe makes them different (if you think you hear a difference). Note which speaker belongs on which end of the spectrum for whatever aspect you choose (for example, A is brighter, B less so, and in each trial you try to consistently label the bright one as A).
With three speakers you could arrange an ABX test that would be much easier on you, at least, but with two you need to have some auditory memory from one trial to the next.
I hope you understand my point. You can't simply record "same or different" each time, because technically they will always be different, and there will be nothing objective to match your results to. You have to record "this speaker sounds X, so it is A" each time, and hope for consistency.
Of course, with the dramatic break-in claims made, you would think that it will be easy to hone in one one aspect of the sound and use it as an identifier to label A and B correctly. You might be surprised.