I think I can throw another concept in the mix that might clarify this - visual accuity as measured in arcminutes (or arc seconds). As in like, degrees-minutes-seconds. This is essentially the fraction of our visual field that an element takes up. Its measured in angles, which would be a linear/1 dimensional measurement, which can be translated into other things if we specify a fixed viewing distance. Essentially, if we can halve the amount of our vision an object takes up (pixel, scircle of light, tip of a needle, drop of ink on paper, whatever), then we can use a collection of those to represent an image with twice as much detail.
A screen with 1 arcminute big pixels relative to our viewing position is half as sharp as a screen with .5 arcminute pixels, because we can resolve something half the linear size. Thats been pretty well accepted by the photography/print world for a while.
Another similar concept is lines of resolution - its relevant on analog CRT displays where a display with 800 lines of resolution is going to be twice as sharp as a display with 400 lines of resolution. With square or square-ish pixels, you don't get 4 times more detail with 2x more lines of resolution.
I realized this is just rehashing some other things people have said but with different ways of measuring it, but it shows that resolution has been commonly measured in terms of a linear, 1 dimensional factor. Think of it like the circumference of a circle, a circle is a two dimensional object, but if you double the circumference, the circle will appear twice as big, not 4x bigger. Heck, we even measure our screen sizes in diagonal - nobody is going to brag how their 60" tv is 4x the size of a 30" tv, because thats just not how we perceive images. Yes, it does have 4x the area and thus takes 4x more raw material to make, but its not 4x bigger.
I think the same confusion started happening with digital images initially with camera megapixels (mentioned before but I'll elaborate on this too). The common pro advice on digital camera resolution was "Don't worry about megapixel counts - an 8 megapixel camera isn't twice as sharp as a 4 megapixel camera, its only 1.5x sharper but megapixel is an easier way to inflate the apparent specs." I realize the same advice was also given to emphasise that other things matter like lens quality, dynamic range, noise, color quality, etc, but I pretty clearly remember articles being written about how megapixel counts are nonsense because it look like theres a bigger difference in sensor resolution than there really is.
Long story short, we don't count pixels, but our vision works based on the apparent size of tiny circles of light and the smaller you can make the radius of those circles of light, the sharper it'll be. In the digital world, it means you do need to divide the area of a pixel by 4 to fit in twice as many pixels in each dimension, but perceived resolution is most definitely proportional to that linear resolution while assuming that its shrunk equally in both directions.
The 200x100 image with 2:1 pixel ratio didn't really resonate with me - it only has double the pixels, but it also has either "exactly the same resolution" or "double the resolution" depending on which way its measured. You really need to keep the pixel aspect ratio constant to make an apples to apples comparison of resolutions. 200x200 is definitely twice as sharp as 100x100, but a weird 2:1 pixel ratio display is only better on, for example, old computers because it gave you extra pixels to use for dithering. If you've ever played any game in monochrome hercules graphics on an old system, that has 640x200 resolution, so they could get dither the heck out of it horizontally with not too awful results despite the monochrome 1 bit display, and they could make text sharper too relative to 320x200 CGA graphics.
Its still true that 4k has 4x the pixels of 2k (1080p), but you will only perceive it to be twice as sharp at best.