Finally figured this out.  Pages 11, 12 and 13 in the official HDR specification document have details and explanation of OOTF: https://www.itu.int/dms_pub/itu-r/opb/rep/R-REP-BT.2390-10-2021-PDF-E.pdf 
 
	But I'll try to explain it in my own words:
 
	
 
	Image on the left is the unintuitive result of keeping the light fully linear from real world to the nits on the display. Meaning that camera essentially records the number of photons hitting each part of the sensor. So if a pixel has signal value 0.5 that literally corresponds to double the amount of photons came from there than from the pixel that has a value of 0.25.  But if we preserve that ratio all the way to the display and make sure that those two pixels shine in that same ratio, one is twice as bright as the other, we'd intuitively expect to get perfectly faithful representation of reality, but in most cases we actually don't. There seem to be multiple contributors to this effect, but I noticed in my experiments that it's more pronounced with daylit scenes, with scenes containing some atmosphere, with scenes containing blooming / lens flares, with displays that can't show true black, with displays that have low peak brightness, when viewing the display further away and it depends on ambient light level  in the viewing environment.
 
	But if scene light is actually well matched to the display and issues from viewing environment and all the aforementioned factors are avoided, then preserving 1:1 light ratio actually does produce the most realistic looking image possible that arguably couldn't even be made better by changing anything, provided that real life scene itself is aesthetically pleasing as possible. Since in over 99% of cases that's not the case, grading is needed to compensate for all those effects, and the average grade that one would want, that's what OOTF is. In the above image, it turns it from the washed out looking to having contrast that feels more true to real life (even though it's technically further away from it). 
 
	OOTF is what breaks that 1:1 ratio of light that hits the camera sensor and light emitted by the display, and that pixel that was twice as bright before will now perhaps be three or four times as bright as that other pixel.
 
	So when we're doing grading by ourself, we don't really "need" OOTF, we can use it as a starting point if we like, many may feel it will do half the job for them. I for example find it easier to grade HDR footage without OOTF, as it becomes much easier to control the shadow detail that way. OOTF is necessary for those who record home videos with cameras and then want to immediately watch that video on their TV, without grading it, and it's obviously essential for live broadcasts. If OOTF wasn't a part of the pipeline there, in over 99% of cases people would find that resulting video looks really washed out and just bad.
 
	I'll quote the above document:
 
 
	This means that traditionally, EOTF was hardcoded by physics of CRT. So rec.709 camera recording would in addition to encoding into inverse of this gamma (OETF) at the same time add that basic grade called OOTF, so that footage would look nice on CRTs by default.
 
	These days we use log profiles or record in raw. When we are inverting the log curve via CST effect, we can choose if we want to also apply the default grade which resolve calls "Apply Forward OOTF". This means that two transformations will take place, first the log encoding will be inverted and we'll arrive back at scene referred linear light, and then OOTF function will be applied to that to give a starting point or even a final result that looks more normal in most cases.  If we undo the log profile with both OOTF checkboxes disabled then we'll be seeing scene linear light, and it's up to us to grade it to look good, to create our own artistic improved OOTF for that particular footage by grading the footage ourselves.
 
	But sometimes LUTs that invert LOG profiles already contain OOTF as part of their transformation, so the LUT actually contains the result of two sequential transformations. If the person who is grading instead wants to only undo the log encoding and arrive at scene linear light, then CST effect can be applied which converts from timeline gamma to timeline gamma, and with "Apply Inverse OOTF" checkbox checked we can return the footage back to scene referred linear light, essentially undoing the OOTF grade. One of the benefits of grading in scene referred linear light is that gain, a simple multiplication, will act like a physically correct exposure control for example. The other benefit is personal preference, where I for example find it gives me much better controls over blacks and shadows, as they don't come pre-compressed by the OOTF.