Great work, thanks for sharing

this is something that could be used to create interesting 2d effects along shaders.
I'm using your FastMem2Sprite in my Steam API wrapper (little modified) for copying users avatar pictures to sprite and that's why I had some question about code responsible for flipping image. In your code one of lines is:
unsigned char* high = (unsigned char*) &pixels((height -1) * width);
in c++ examples the last variable isn't 'width' but 'stride' -> 4*width, and I was wondering why it's working here, then looking at rest of code reminded me that as function argument you are using glb int array without casting it to 'uint char/byte'. This raised another question, glb int in 32bit are 4bytes, but in 64bit are they 4 or 8bytes? But after compiling to 64bit win it looks that they are still 4bytes and code is working without problems. One thing here, not sure if
typedef unsigned int size_t;
is really needed, it will compile without this in 32bit, and in 64bit it will give an error as size_t is already defined somewhere in c++ libs, it would be good to have at least some check here.
One thing that gave me a headache (as I forgot about it), was that 'power of 2 requirement' for texture size, I switched texture for some 720p image and it wasn't working, it took me couple minutes to figure out that it needs to be ^2

but that's really not an issue, as for any tricks we can use larger sprite/texture.
And thing about OpenGL ES, yeah unfortunately it lack's glGetTexImage but similar result could be achieved with glReadPixels like in this code:
https://stackoverflow.com/questions/53993820/opengl-es-2-0-android-c-glgetteximage-alternativedid You checked something like this? I don't have Android toolset now, so can't play with this, but one of most desired use case would be on Android to simulate some shaders that are not available in GLB on android.
Ah, I would forget but here are result's of this benchmark on old i5-4300u w Intel HD 4400 iGPU:
FastS2M + FastM2S -> 115cps
GLBS2M + GLBM2S -> 9.8cps
FastS2M + GLBM2S -> 10.7cps
GLBS2M + FASTM2S -> 54.5cps
With such speed it could be used to some real-time effects on smaller textures.