esp32: Heap fragmentation and mbedtls/lwip.
This is a meta-issue to track esp32 heap fragmentation issues and analysis of possible solutions and workarounds.
(See original analysis in #5543 and other reports/related work in #7038, #8628, #8662, #8251, #5355, #7061, #5219, #5808, #7214)
The high-level issue is that on, for example, a pico-d4 with IDF v4.4 the RAM layout at startup is:
Showing data for heap: 0x3ffb9a20
Block 0x3ffbb4e8 data, size: 4452 bytes, Free: Yes
Showing data for heap: 0x3ffc9ba8
Block 0x3ffcf174 data, size: 69256 bytes, Free: Yes
Showing data for heap: 0x3ffe0440
Block 0x3ffe0a9c data, size: 13440 bytes, Free: Yes
Showing data for heap: 0x3ffe4350
Block 0x3ffe49ac data, size: 112208 bytes, Free: Yes
MicroPython's current logic is that it calculates the total (8-bit capable) IDF heap size (244072), the largest contiguous free block (110592), and then tries to allocate min(244072 / 2, 110592) = 110592. (get_largest_free_block returns 110592 when that block is actually 112208.. i.e. round to 4kiB -- looking at the implementation of the IDF allocator, it does power-of-two allocations)
This leaves four fragments for the IDF heap 4452 + 69256 + 13440 + 1612 bytes.
From this, the IDF allocates for:
- mbedtls (16kiB + 4kiB per wrapped socket for buffers, plus a further ~15kiB at peak)
- lwip (~4-7kiB per socket)
- wifi stack (28kiB for initialisation, 2.5kiB for active, 7kiB for connect = 45.5kiB total) (further 3.5kiB if you also enable softAP)
- bluetooth stack (19kiB for ble.active(1), plus small extra for NimBLE service registrations).
For example, fetching a http resource. After wifi is connected, the largest available IDF heap alloc is 25600. This drops to 15616 while the socket is open (9.75kiB total) and returns to 25600 after close.
Fetching an https resource, at the point wrap_socket is called, the available IDF blocks are (22528, 13312, 1600) (total 37kiB). There's a further ~7kiB of lwip mallocs, plus the 35kiB of mbedtls. The request subsequently fails due to an OOM.
mbedtls calls malloc a lot -- the sequence of allocations up to the first free is (16717,4429,220,128,2240,16,344,1435,32,32,172,260,4,16,16,16,16,16,16,16,344,1306,32,32,32,32,172,260,4,16,16,344,1380,32,32,32,172,516,4,16,4,4,32).
This shows clearly why it's impossible to do an SSL request on IDF 4.4.
I implemented #8526 for ESP32 and used mbedtls_platform_set_calloc_free to intercept mbedtls allocs. The implementation was to reserve a contiguous 64kiB for IDF (for lwip, wifi), then use all remaining IDF blocks for split heaps. This successfully does an https request. But there can only be one concurrent request, and BLE cannot be enabled. So the problem is, in order to keep everything else working (BLE, wifi, lwip) you can only really afford to take the first block anyway, so the benefit of split heap is marginal here.
py/gc: Support multiple heaps (version 2).
Enable the addition of heap space at runtime. Advantages:
- The ESP32 has a fragmented heap so to use all of it the heap must be split.
- Support a dynamic heap while running on an OS, adding more heap when necessary.
Rewritten PR of #3533. The biggest difference is that multiple heaps support can now be disabled (and is disabled by default) to reduce code size. I hope it is also an more stable as I did the changes after looking how the memory manager actually works.
With this code, I managed to extend the MicroPython heap to ~200kB on the ESP32:
MicroPython v1.9.3-241-gfbc575cd1-dirty on 2018-01-24; ESP32 module with ESP32
Type "help()" for more information.
>>> import micropython
>>> micropython.mem_info(True)
stack: 752 out of 15360
GC: total: 206976, used: 5200, free: 201776
No. of 1-blocks: 30, 2-blocks: 7, max blk sz: 264, max free sz: 6936
GC memory layout; from 3ffb30a0:
00000: h=AhhBMh=DhhhDBBBBAhh===h===Ahh==h==============================
00400: ================================================================
00800: ================================================================
00c00: ================================================================
01000: =========================================h==Bh=ShShhThAh=h=Bh==B
01400: ..h.h=......h=..................................................
(87 lines all free)
17400: ................................................
GC memory layout; from 3ffe4de0:
(108 lines all free)
1b000: ........................
>>>
This is necessary because by default, the esp32 does not have a contiguous memory area:
I (343) cpu_start: Pro cpu up.
I (344) cpu_start: Single core mode
I (344) heap_init: Initializing. RAM available for dynamic allocation:
I (347) heap_init: At 3FFAE6E0 len 00001920 (6 KiB): DRAM
I (353) heap_init: At 3FFDCE60 len 000031A0 (12 KiB): DRAM
I (360) heap_init: At 3FFE0440 len 00003BC0 (14 KiB): D/IRAM
I (366) heap_init: At 3FFE4350 len 0001BCB0 (111 KiB): D/IRAM
I (372) heap_init: At 4008FC7C len 00010384 (64 KiB): IRAM
I (379) cpu_start: Pro cpu start user code
I (172) cpu_start: Starting scheduler on PRO CPU.
I have tested using tests/run-tests and haven't seen a regression (both on unix and esp32).
Image size changes:
| port | change |
|---|---|
| unix | +14 |
| bare-arm | 0 (unchanged) |
| minimal | +20 |
| stm32 | -116 |
| esp8266 | +136 |
| esp32 | +80 (without patch), +432 (with patch adding multiheap support) |
I have tried to keep the image sizes unchanged, but some changed anyway for some reason. Maybe the optimizer is less effective on some ports than other ports. Most of these ports (with the exception of bare-arm) jumped around a lot during development, so I'm suspecting it's mostly just an inconsistent optimizer.