Connecting the mask to the Effect Mask input of the MediaIn node gives me the behavior I would expect: both the mask and the footage are scaled down by the Transform node. However, there is some weirdness connecting a mask to the Effect Mask input of Transform node - the mask seems to use the un-transformed version and you get a weird amalgam output.
A workaround would be to leave the mask connected to MediaIn or to use it further downstream. For example, add a Merge. Connect the Transform to the Merge FG input, the mask to the Merge Effect Mask input, and a Background solid color (with Alpha set to 0) to the Merge BG input.