I've also been interested in the effects of various pattern sizes/configurations on my Samsung display calibration and created some test patterns using the following (don't know if this workflow is something you can use though):
1. Create patterns in photoshop (0-255) and write to uncompressed tiffs
2. Import and transform to YUV in final cut pro (both SD and HD versions)
This seems to work quite well, I've benchmarked some 15% area windowed patterns against AVSHD and get great agreement in luminance values and dE.
I've also compared small embedded gray scale and color patterns within captured movie frames to simulate an APL based on real material. I analyzed a series of images from the movie "The Tree of Life" (which has wonderful natural looking cinematography) and found average video levels tend to be lower than I expected (25-30%) but you often get a bi-model histogram with means around 25% and a second smaller peak in the 60-70% range for typical sunlit scenes. So I tested a set of patterns using a test image with 27% average video level and embedded patterns at 55% (1 standard deviation above the mean). The results are summarized here:
Average comparison with AVSHD windowed, dE=0.6 +/- 0.6
Average repeatability, dE=0.7 +/- 0.4
The yellow highlights comparisons that yield >3 dE differences and red >6 dE. Using the 15% area windowed patterns as reference both the Scene based vs. windowed, and AVSHD Large APL vs. windowed showed >3 dE differences at the low end of the gray scale and in the primaries. The APL patterns showed the larger differences. Average gamma was quite stable for all the patterns on my contrast and gamma settings (2.29, 36 ftL peak white). I don't consider any of these differences dramatic, the largest measured difference of 7.1 dE is still quite hard to discern but it does show that at the 3-6 dE level the patterns you choose do make a difference on displays with APL dependent response.