NX Gamer Does Some Great In-Depth Technical Analysis

Great Example of Horrible API Interface Design

Linux gamepad support (via /dev/input/event) is one of the best examples of horrible interface design that I can remember. It should serve as a reminder of "how not to do it".

(1.) Sparsely Segmented IDs
Every large gap in a sequence of IDs say for input event codes has an associated increase in complexity for the developer: results in special branching, etc.

(2.) Mix of Same Buttons With Different and Similar IDs on Different Devices
There is a dead obvious common mapping between XBOX and PLAYSTATION controllers which should be used for common IDs. When instead the devices have subsets of different IDs for the same buttons, this results in angry developers.

(3.) The Same IDs Getting Used for Different Buttons on Different Devices
Another example of the same disease: creating extra complexity for the developer. The point of a interface is to reduce complexity. When the app needs to special case the device, right the app is now writing the driver too.

(4.) Using Base 10 Numbering Without Leading Zeros
Talking about the ordering of "event#" in "/dev/input/event" file names. Instead of something dead simple for a program like just "/dev/input/event##" where "##" is a hex number with a leading zero when under 0x10, lets instead make it base ten and require the developer to special case numbers under 10 (remove the leading zero). Fail.

(5.) Keeping Information Used Together On Unaligned Memory
For example, "struct input_id { __u16 bustype; __u16 vendor; __u16 product; __u16 version; };", it should be dead obvious that many developers would want to load "{vendor, product}" as a 32-bit value.

(6.) Providing No Good Way to Know if a Device is a Gamepad
The logic of the interface seems to be that developers need to keep an up-to-date {vendor, product} table for all current and future gamepads, this way they can write all the special case per device code required, or alternatively an even worse option of using a string table by device name.

(7.) Resorting To User-Space Daemons to Translate
Fantastic idea. First lets add extra latency to input. Second lets make it so that applications which handle the native /dev/input/event(s) manually now see duplications of the same device as some other kind of device, but with no obvious way to know the device was duplicated? Third lets push the problem to the user, creating another thing which needs to be configured by an expert and can easly go wrong.

Suggestions on How to Fix This Mess
Add INPUT_PROP_GAMEPAD so devs instantly know if the device is a "gamepad" or not. Any device which has the INPUT_PROP_GAMEPAD property uses a standard common set of contiguous input event codes.


The Order 1886!

This review contains no spoilers...

The Order 1886 is a fantastic game, one of my personal favorite games of all time.

Initially I was caught off guard by the doubt cast by various critics out to smear the game. They ended up doing me a favor, in that I now have a great list of online publications which I know to avoid spending any future time reading.

My quest started with a pre-order roughly 16 hours prior to launch. Followed by a playthrough on easy, beginning in early morning then wrapping a regular working Friday.

Why Easy? And a Word on "Replay Value"
When I want a gaming challenge, aka something on "hard", and something with "replay value", I take on the best humans I can find in competitive multiplayer: often in Call of Duty or Killzone. It is the people who bring you back, not specifically the game. The Order 1886 serves a different purpose in my eye, to provide a self-contained story experience which can be consumed, enjoyed, and remembered, and in this regard The Order 1886 excels. There is no expectation or need for replay value in this kind of game, just like there is no need for shoes to double as a toaster oven.

Quality Over Quantity
The game had the right length for me: not too long paired with high production value. To experience the ultimate in a given art form, to be immersed in attention to detail so fine that the mind is transplanted into the scene: this is where The Order takes you. The Order is a ride, part film, part cover shooter, with time inbetween to get absorbed in the style, sound, material and environment of years past. Gunplay feels refined, with fantastic audio/visual feedback expressed in the technology of the era. The focus on quality pixels brings The Order to a place no other game has yet to venture visually.

Too all those at Ready At Dawn, your hard work is much appreciated. Look forward to whatever you have in store next!


Leaving Something for the Imagination

Lack of information can invoke the perfect reconstruction of the mind. For visuals, seems like the deepness and slope of the uncanny valley is proportional to the spatial temporal resolutions of the output device. This 4K and eventual 8K crazyness, while awesome for 2D and print, has an unfortunate consequence for real-time realistic 3D: available perf/pixel tanks simultaneously as required perf/pixel skyrockets due to the increased correctness required for perceptual reality.

The industry continues to shoot itself in the foot focusing on quantity instead of quality, raising the baseline cost required for digital 3D content to climb out of the uncanny valley.

The industry also seems to be on a roll reducing the quality of display technology. Sitting through an "IMAX" film at a DLP based theater destroyed the respect I had for that brand. Paying extra for a "cubism" filter applied to what otherwise might have been a good experience is not what I had in mind when going to the theater. Quite a shock for someone who grew up with analog IMAX and OMNIMAX (IMAX projected in a dome).

Scan-and-hold LCDs have killed the quality of motion, and strobed LCDs have insane frame-rate requirements compared to a similar experience on a CRT. With typical HDTVs and LCD displays, 60Hz sits in the no-man's land between having more perf/pixel for something visually interesting, and having enough frame-rate on a scan-and-hold device to remove enough blur in motion (120Hz and higher required). For this scan-and-hold generation the true purpose of the "motion-blur filter" is not to add realistic motion blur, but remove enough visual quality to mask the distortion caused by scan-and-hold in motion.

Making Lemonade
Display technology trends provide a powerfull polarization: general flexible rendering solutions attempting to solve all problems will result in mediocre results (jack of all trades and the master of none of them). IMO everything interesting can only be found by sacrificing something others view as required, but which enables you to do something otherwise impossible.

Giving up frame-rate, for example the hot path for realistic rendering on PC: leverage variable refresh rate (to be able to simultaneously maximize the quality of animation and GPU utilization), render letterboxed around 30Hz (to maximize perf/pixel), run all game logic on the GPU reading input from CPU-filled persistent mapped ring buffer (minimize input latency), use heavy post processing like motion blur and extreme amounts of film grain (remove enough exactness to invoke the mind's reconstruction filter).

4K presents a serious problem in that in-display up-sampling can add latency, and often in-GPU-scan-out or in-display up-sampling is total garbage (for example too strong negative lobe adds a halo effect). The way I'd tackle the 4K display is actually the oppsite of convention: output native but use the increased resolution to simulate a synthetic CRT shadow mask or maybe a very high ISO film (massive grain), to reduce the required internal target resolution to something under 1080p. On that topic, the majority of the trending "pixel art" games completely missed the point of the arcades: ultra-low-latency input with constant frame-rate (not possible on mobile platforms or in browsers), arcade joystick input (something well-grounded and precise which can take a pounding), and high-quality non-block pixels produced by a CRT.

My personal preference is for the most extreme tradeoffs: drop view dependent lighting, go monochome, drop resolution, drop motion blur, drop depth of field, no hard lighting, no hard shadows, remove aliasing, add massive amounts of film grain, maximize frame-rate, and minimize latency. Focus on problems which can be solved without falling into the valley, produce something which respects the limits of the machine, and yet strives for timeless beauty.


Continued Notes on Custom Language

Continuing notes on portable development for Linux and Windows using a custom language...

Using WINE on Linux for Windows development is working great: ability to use OpenGL 4.x in native 64-bit WIN32 binaries on Linux. Using the same compiler to build both Linux and Windows binaries which can both be run from the same Linux box.

Forth-like Language as a Pre-Processor for Code-Generation
My 1706 byte compiler takes a stream of prefixed text: mostly {optional prefix character, then word, then space} repeated...

word \compile call to address at word\
:word \compile pop data stack into word\
.word \compile push 64-bit value at word onto data stack\
'word \compile conditional call to address at word if pop top of stack is non-zero\
`word \copy opcodes at word into compile stream\
%word \compile push address of word, still haven't used this\
{ word \define word, stores compile address in word, then at closing stores size (used for opcodes)\ }
34eb- \compile push hex number negated\
,c3 \write raw byte into compile stream, x86 ret opcode in this example\
"string" \compile push address of string, then push size of string\

Compiler strips whitespace and comments while converting the input words into machine code (using the conventions above), then executes the machine code. There is no "interpreter" as would normally be used in a Forth based language. The compiler does not even have the standard Forth opcodes, these are instead just specified in the input source file. For example ADD,

{ add ,48 ,03 ,03 ,83 ,eb ,08 }

Which is the raw opcode stream for "MOV rax,[rbx]; ADD rbx,8" where "rax" is the top of the data stack, and "rbx" points to the 2nd item on the data stack. Since Forth opcodes are zero operand, it is trival to just write them in source code directly (language is easily extendable). I use under 30 forth style opcodes. After an opcode is defined, it can be used. For example,

10 20 `add

Which pushes 16 then 32 on the data stack (numbers are all in hex), then adds them. To do anything useful, words are added which pop a value from the data stack and write them to a buffer. For example write a byte,

{ byte ,40 ,88 ,07 ,83 ,c7 ,01 ,48 ,8b ,03 ,83 ,eb ,08 }

Once the compiler is finished executing the machine code generated by the source, which in turn is used to write a binary into a buffer, the compiler stores that buffer to disk and exits. In order to do anything useful the next step is to use the language and source opcodes which extend the language to build an assembler. Some bits of my assembler (enough for the ADD opcodes),

\setup words for integer registers\
0 :rax 1 :rcx 02 :rdx 03 :rbx 04 :rsp 05 :rbp 06 :rsi 07 :rdi
8 :r8 9 :r9 0a :r10 0b :r11 0c :r12 0d :r13 0e :r14 0f :r15

\words used to generate the opcode\
{ REXw 40 `add `byte }
{ REX .asmR 1 `shr 4 `and .asmRM 3 `shr `add `dup 'REXw `drp }
{ REXx .asmR 1 `shr 4 `and .asmRM 3 `shr `add 48 `add `byte }
{ MOD .asmR 7 `and 8 `mul .asmRM 7 `and `add .asmMOD `add `byte }
{ OP .asmOP 8 `shr 'OPh .asmOP `byte }
{ OP2 :asmOP :asmRM :asmR 0c0 :asmMOD REX OP MOD }
{ OP2x :asmOP :asmRM :asmR 0c0 :asmMOD REXx OP MOD }

\implementation of 32-bit and 64-bit ADD\
{ + 03 OP2 } { X+ 03 OP2x }

Afterwords it is possible to write assembly like,

.rax .rbx + \32-bit ADD eax,ebx\
.rax .rbx X+ \64-bit ADD rax,rbx\

Due to the complexity of x86-64 ISA, I used roughly 300 lines to get a full assembler (sans vector opcodes). With a majority of those opcodes not even getting used in practice. The ref.x86asm.net/coder64.html site is super useful as an opcode reference.

Binary Header
Next step is writing source to generate either a PE (Windows) or ELF (Linux) binary header. ELF with "dlsym" symbol used roughly 70 lines (mostly comments to describe the mess of structure required for an ELF). The PE header I generate for WIN32 binaries looks similar to this example from Peter Ferrie which is a rather minimal header with non-standard overlapping structures. I added an import table for base Kernel32 functions like "LoadLibraryA", because of fear that manual run-time "linking" via PEB would trigger virus warnings on real Windows boxes. I'm not really attempting to hit a minimum size (like a 4KB demo), but rather just attempting to limit complexity. WINE handles my non-standard PE with ease.

If I was to write an OS, I wouldn't have binary headers (PE/ELF complexity just goes away). Instead I would just load the binary at zero, with execution starting at 4 with no stack setup, with binary pages loaded with read+write+execute, and then some defined fixed address to grab any system related stuff (same page always mapped to all processes as read-only). This has an interesting side effect that JMP/CALL to zero would just restart the program (if nop filled) or do exception (if invalid opcode filled). Program start would map zero-fill and setup stack. I'd also implement thread local storage as page mapping specific to a thread (keeping it simple).

ABI: Dealing With the Outside World
Having your own language is awesome ... dealing with the C based Linux/Windows OS is a pain. I use a 8-byte stack alignment convention. The ABI uses a 16-byte stack alignment convention. The ABI for Linux Kernel, Linux C export libraries, and 64-bit Windows is different. Here is a rough breakdown of register usage,

_ ___ _ LXK LXU WIN
0 rax .
1 rcx . k0 a3 a0
2 rdx . a2 a2 a1
3 rbx . s0 s0 s0
4 rsp X
5 rbp X t0 t0 t0
6 rsi . a1 a1 a4 <- stored before call on WIN
7 rdi . a0 a0 a5 <- stored before call on WIN
8 r8_ . a4 a4 a2
9 r9_ . a5 a5 a3
a r10 . a3 k0 k0
b r11 . k1 k1 k1
c r12 X t1 t1 t1
d r13 X t2 t2 t2
e r14 . s1 s1 s1
f r15 . s2 s2 s2

rax = return value (or syscall index in Linux)
rsp = hardware stack
a# = register argument (where Windows 64-bit a4 and a5 are actually on the stack)
t# = temp register saved by callee, but register requires SIB byte for immediate indexed addressing
s# = register saved by callee, no SIB required for immediate indexed addressing
k# = register saved by caller if required (callee can kill the register)

I use a bunch of techniques to manage portability. Use a set of words {abi0, abi1, abi2 ... } and {os0, os1, os2 ... } (for things which map to Linux system calls) which map to different registers based on platform. Use word "ABI(" to store the stack into R13, then align the stack to 16-bytes, then setup a stack frame for the ABI safe for any amount of arguments. Words "ABI", "ABI5", "ABI6+" to do ABI based calls based on the number of integer arguments needed for the call. This is needed because Linux supports 6 arguments in registers, and Windows only supports 4. Then later ")ABI" to get my 8-byte aligned stack back,

{ ABI( .abiT2 .rsp X= .rsp 0fffffffffffffff0 X#& .rsp 50- X#+ }
{ ABI \imm\ #@CAL } \call with up to 4 arguments\
{ ABI5 \imm\ #@CAL } \call with 5 arguments\
{ ABI6+ \imm\ #@CAL } \call with 6 or more arguments\
{ )ABI .rsp .abiT2 X= }

With the following words overriding some of the above words on Windows (slighly more overhead on Windows),

{ ABI(.W .abiT2 .rsp X= .rsp 0fffffffffffffff0 X#& .rsp 80- X#+ }
{ ABI5.W .abi4 .abiW4 PSH! \imm\ #@CAL }
{ ABI6+.W .abi4 .abiW4 PSH! .abi5 .abiW5 PSH! \imm\ #@CAL }

So a C call to something like glMemoryBarrier would end up being something like,

ABI( .abi0 \GL_BUFFER_UPDATE_BARRIER_BIT\ 200 # .GlMemoryBarrier ABI )ABI

And in practice the "ABI(" and ")ABI" would be factored out to a much larger group of work. The "#@CAL" translates to call to "CALL [RIP+disp32]", since all ABI calls are manual dynamic linking the ".GlMemoryBarrier" is the address which holds the address to the external function (in practice I rename long functions into something smaller). Since both Windows and Linux lack any way to force libraries into the lower 32-bits of address space, and x86-64 has no "CALL [RIP+disp64]", I decided against run-time code modification patching due to complexity (would be possible via "MOV rax,imm64; CALL rax"). Both Windows and Linux require slightly different stack setup. Convention used for Linux (arg6 is the 7th argument),

-58 pushed return
-50 arg6 00 <---- aligned to 16-bytes, ABI base
-48 arg7 08
-40 arg8 10
-38 arg9 18
-30 argA 20
-28 argB 28
-20 argC 30
-18 argD 38
-10 argE 40
-08 argF 48
+00 aligned base

Convention used for Windows,

-88 pushed return
-80 .... 00 <---- aligned to 16-bytes, ABI base
-78 .... 08
-70 .... 10
-68 .... 18
-60 arg4 20
-58 arg5 28
-50 arg6 30
-48 arg7 38
-40 arg8 40
-38 arg9 48
-30 argA 50
-28 argB 58
-20 argC 60
-18 argD 68
-10 argE 70
-08 argF 78
+00 aligned base

The C based ABI and associated "{{save state, call, ... nesting ..., return, load state} small amount of code} repeat ...}" pattern forces inefficient code in the form of code bloat and constant shuffling around data between functional interfaces. Some percentage of callee-save registers are often used to shadow arguments, often data is loaded to register arguments only to be saved in memory again for the next call, caller saves happen even when the callee does not modify, etc.

I'd much rather be using a "have command structures in memory, fire-and-forget array of pointers to commands" model. The fire-and-forget model is more parallel friendly (no return), and provides ability for reuse of command data (patching). The majority of system calls or ABI library calls could just be baked command data which exists in the binary. Why do I need to generate complex code to do constant run-time generation and movement of mostly static data?

I conceptionally treat registers as a tiny fast compile-time immediate-indexed RAM (L0$). Register allocation is a per-task process, not a per-function process. There is no extra shuffling of data. No push/pop onto stack frames, etc. For example register allocation is fixed purpose during the two passes of the compiler,

.rax :chr \input character\ .rax :dic \dictionary entry address\ .rax :str#
.rcx :hsh \hash value\ .rcx :jmp \jump address\ .rcx :num .rcx :str$
.rdx :pck1 \string packing 1\ .rdx :num- .rdx :siz
.rbx :pck2 \string packing 2\ .rbx :chr$1
.rbp :dic$ \dictionary top pointer, not addressed from\
.rsi :chr$ \input pointer\
.rdi :mem$ \memory stack, compile stack\
.r8 :def$ \define stack\
.r15 :zero \set to zero\
.rax :top \rax used for .dic only at launch\
.rcx :cnt \counter on text copy\
.rdx :fsk$ \float stack pointer, points to 2nd entry\
.rbx :stk$ \data stack pointer, points to 2nd entry\
.rsi :txt$ \used for source on text copy\
.rdi :out$ \output pointer, points to next empty byte\

My dev environment is basically two side by side terminal windows per virtual desktop with a easy switch to different virtual desktops. I'm using nano as a text editor and have some rather simple color syntax highlighting for my language. No IDE no debugger. Proper late stone-age development.

To port to WIN32 I did have to fix some code generation bugs with regard to having a non-zero ORG. On Linux I load the binary at zero, so file offset and loaded offsets are the same. This is not possible in Windows. IMO there is more utility in keeping it simple than having zero page fault. When tracking down bugs in code generation or binary headers I just use "hexdump -C binary". My language supports injection of any kind of data during code generation, so it is trivial to just wrap an instruction or bit of code or data with something like "--------" which is easy to find via "hexdump -C binary | less".
The Forth inspired language I use has only one possible error, attempting to call a word which has not been defined (which calls zero). My compiler does not print any error messages, it simply crashes on that error. Since in practice this effectively almost never happens, I've never bothered to have it write out an error message. The last time I misspelled a word, it was a super quick manual log search (commenting out code) to find the error. When compiling and testing is perceptually instant, lots of traditional complexity is just not needed.

As for regular development, I started programming when a bug would take down the entire machine (requiring reboot). Being brought up in that era quickly instills a different kind of development process (one that does not generate bugs often). Most bugs I have are opcode mistakes (misspellings), like load "@" instead of store "!", or using a 32-bit form of an opcode instead of the 64-bit form. The only pointers which are 64-bit in my binaries are pointers to library functions (Linux loads libraries outside the 32-bit address space), the stack (I'm still using the OS default instead of moving to the lower 32-bit address space), or pointers returned by a library. When dealing with bugs with an instant iteration cycle, "printf" style debug is the fast path. I've built some basic constructs for debugging,

BUG( "message" )BUG \writes message to console\
.bugN .rax = 10 BUG# \example: prints the hex value in RAX to 16 digits to console\

Adding something to my "watch window" is just adding the code to output the value to the console. This ends up being way better than a traditional debug tool because console output provides both the history and timeline of everything being "watched".

In the past I had a practice of building a custom editor and run-time for each language. The idea being that it was more efficient to compile from a binary representation, which has a dictionary embedded in the source code (no parsing, no hashing). Ultimately I moved away from that approach due to the complexity involved in building the editor that I wanted for the language. Mostly due to the complexity of interfacing with the OS. It is really easy to fall in the trap of building a tool which is many times more complex than the problem the tool is engineered to solve.

Decoupling from the dependency of a typical compiler and toolchain on modern systems has been a great learning experience. If FPGA perf/$ was competitive with GPUs I'd probabably move on to building at the hardware level instead...


Transparency/OIT Thoughts

Possible to do the 64-bit single pass method from OIT in GL4.x with 32-bit atomics by packing {16-bit depth, 16-bit color channel} in 32-bits, where color channel is on a Bayer grid. Would require a demosaic.

Benjamin 'BeRo' Rosseaux posted an example of hybrid atomic loop weighted blended order independent transparency implementation which mixes the 2 pass algorithm from the "OIT in GL4" with depth weighted blending for the tail.

If rastering in compute, might be possible to do a single pass version of "OIT in GL4" with only 32-bit atomics by packing {16-bit depth, RG},{16-bit depth, BD} into a pair of 32-bit values. Where RGBD encodes HDR. The trick is leveraging {even,odd} pairs of invocations (aka, threads) in compute to do atomics to a pair of 32-bit values where the pair is 64-bit aligned. API does not ensure that the pair of 32-bit atomics happens atomically, but in practice I believe desktop hardware (AMD/NV) will do that anyway. Clearly doing the pair of 32-bit atomics in one invocation in serial won't work.

I still prefer stochatic methods with spatial+temporal post filtering. One such method with a K entry array per pixel, is to use a per pixel atomic to grab array index then store out {depth, color, alpha} packed into a 64-bit value. Post process does an in-register sorting network to sort, reduces to one {color,alpha}, then onto filtering, then later composite with opaque. With only K bins per pixel, it is possible to overflow, but there is a biased method to avoid overflow: stochastically do more agressive dropping of the fragment if "alpha < threshold(gl_FragCoord,K_index)". Where the dither pattern is based on gl_FragCoord, and the K_index is read (only do the atomic if the fragment is not dropped). The threshold progressively increases as K_index approaches the max K. Just a very rough front to back sort could help this out a bit (could amortize the sorting to a multipass algorithm, only doing a bit of the sort per frame). This stochatic method could be interesting if "depth" for the fragment is stochastically set to somewhere in the probability of the volume which a billboard represents. With proper noise filtering could remove the billboard order change pop problem. Also with NVIDIA's NV_shader_thread_group extension it might be possible to reduce a per fragment atomic to a 2x2 fragment quad atomic for this kind of algorithm (but not the "OIT in GL4" algorithm).


AMD64 Assembly Porting Between Linux and Windows

One of the unfortunate differences between Linux and Windows for AMD64 assembly is that both have completely different ABI call conventions when accessing common system libraries like OpenGL. However it is possible to bridge the gap. Arguments 0 through 3 are in registers on both platforms, but in different registers (easy macro workaround),

Linux__: rdi rsi rdx rcx r8 r9
Windows: rcx rdx r8 r9

The solution for portability is to target Windows as if it had 6 argument registers since both rdi and rsi are callee save,

Windows: rcx rdx r8 r9 {rdi rsi}

But prior to a more than 4 integer argument C library call, push rsi and rdi on the stack, then add 32-bytes to the stack pointer for the Windows "register parameter area".

Finally don't use the "red zone" from the Linux ABI and also don't use the "register parameter area" from the Windows API.


Random Notes on Maxwell

Notes From GTX 980 Whitepaper | Maxwell Tuning Guide
GTX 980
16 geometry pipes
... One per SM
16 SMs
... 96KB shared memory per SM
... ?KB instruction cache per SM
... SMs divided into 4 quadrants
... Pair of quadrants sharing a TEX unit
... Each Quadrant
....... Issue 2 ops/clk per warp to different functional units
....... Supports up to 16 warps
....... Supports up to 8 workgroups
4 Memory Controllers
... 512KB per MC (2MB total)
... 16 ROPs per MC (64 total)

Only safe way to get L1 cached reads for read/write images is to run warp sized workgroups and work in global memory not shared by other workgroups. Hopefully an application can express this by typecasting to a read-only image before a read.

Using just shared memory, this GPU can run 64 parallel instances of a 24KB (data) computer without going out to L2.

(EDIT from Christophe's comments) This GPU has an insane untapped capacity for geometry: 16 pipes * more than 1GHz * 0.333 = maybe 5.3 million triangles per millisecond. Or enough for 2 single-pixel triangles per 1080p screen pixel per millisecond...


The Source of the Strange "Win7" Color Distortion?

EDIT: Root caused. Two problems: (1.) The Dell monitors at work have some problems in "Game" mode. They work fine in "Standard" mode. I'm guessing "Standard" uses some (latency adding?) logic to correct for some color distortion of the panel, and this logic gets turned off in "Game" mode. (2.) Displays tested at work are slow IPS panels with larger gamut than the fast displays at home. The hue-to-warm-to-white transition in my algorithm is too punchy in the yellows for large gamut displays. Algorithm needs some more tuning there.

I have a shadertoy program which presents photo editing controls and a different way to handle over-exposure. Fully saturated hues blend towards white instead of clamping at a hue, and all colors in over-exposure take a path towards white which follows a warm hue shift. So red won't go to pink then white (which looks rather unnatural), but rather red to orange to yellow to white. This test program also adds 50% saturation to really stress the hard exposure cases.

The result looks awesome on my wife's MacPowerBook, and on my home Linux laptop using either a CRT or laptop LCD. However at work on a Win7 box with two Dell monitors and on a co-worker's personal Win7 laptop, the result looks like garbage. Specifically some hues on their route towards white have non-contiguous gradient which looks like some color management mapping operation failed.

I tried lots of different things to adjust color management in Win7 to fix it, but was unable. Was certain this had to be a problem in Win7 because two different Win7 machines with completely different displays all had the same problem. Then another co-worker running Win8 with a set of two different Dell monitors (also color calibrated) got a different result: one display had the same problem, the other looked good.

So not a Windows problem, but rather some new problem I had never seen before on any display I personally have owned: LCD displays with really bad distortion on saturated hues. And apparently it is common enough such that 4 of 5 different displays in the office had the problem. 4 of these displays were color calibrated via the GPU's LUT per color channel, but calibrated to maintain maximum brightness. Resetting that LUT made no difference. However changing some of the color temp settings on the display itself did reduce the distortion on one monitor (only tested one monitor).

Maybe the source of the problem is that the display manufactures decided that the "brightness" and "contrast" numbers were more important, so they overdrive the display by default to the point where they have bad distortion? Changing the color temp would reduce the maximum output of one or two channels. Not at work right now, so not able to continue to test the theory, but guessing the solution is to re-calibrate the display at a lower brightness.