MicroPython assumes valid prelude and bytecode in .mpy files
We're running MicroPython as a task in an embedded product, and feel it's relatively safe to run Python code in this sandboxed environment. It accesses the file system in a restricted context, and Python code shouldn't be able to access data outside of the MicroPython heap and data structures embedded in the firmware.
We're looking at supporting .mpy files in this environment. It seems safe to allow .mpy files created on the device and stored such that the user cannot modify the contents. We'd also like to support use of mpy-cross to compile files that require a larger heap.
But we're concerned that users could modify the .mpy file in ways that would (for example) allow for reading any memory address or overwriting areas of RAM outside of the MicroPython heap or its task's stack.
I've been looking at file contents outside of the actual bytecode to begin with, and would like to implement some sanity checks on some values. For example, n_def_pos_args must be <= n_pos_args. And it looks like n_state should be at least n_pos_args + n_kwonly_args + 1.
Is it possible to calculate a value for n_exc_stack by doing a validation pass on the bytecode? Or even a sanity check on the three _args settings (ensure the bytecode doesn't reference an arg index beyond what's configured)? Are there other checks we could perform?
I feel that it's better to add this burden to the import phase and reject invalid .mpy files instead of adding range checks to the vm.
We plan to implement these behind a MICROPY_ configuration macro and eventually submit a PR. Open to recommendations on a name for that macro.
Further ways to improve .mpy encoding
Following up after https://github.com/micropython/micropython/pull/4564 .
- https://github.com/micropython/micropython/issues/4378 remains an issue.
- After #4564, the situation is such that there're 2 sufficiently differing (== independent) encodings for the bytecode: one is in-memory encoding, as used by VM, another is .mpy. .mpy is noticeably transformed during loading process to get to in-memory encoding. But that only means that further transformations can be added. One such would be not storing extra cache bytes for name lookup opcodes in .mpy - indeed, it's useless there, and needed only by VM. That means that MICROPY_OPT_CACHE_MAP_LOOKUP_IN_BYTECODE flag can be gone from .mpy header, and .mpy's can be made a bit more portable (fully portable between "beefy" ports of MicroPython which use unicode, MPZ long ints, as long as small ints are limited to 32 bits). But that shifts responsibility of handling MICROPY_OPT_CACHE_MAP_LOOKUP_IN_BYTECODE to other tools, like mpy-tool.py.