ESP32: Threading with 2 or 3 similar threads not properly scheduled
Platform: ESP32 Wroom
MP-Version: MicroPython v1.19.1 on 2022-06-18; ESP32 module with ESP32
I observed that the computation performance of 2 or 3 identical threads started in parallel is very unevenly distributed, the worst case being one of 3 threads getting only 1 % of computation power.
Further investigation revealed that there are very long durations during which a single thread is not activated. This may be the reason for the first observation.
A test program appended below after a few runs exhibits the following output,
Threads tmin: 0 0 ms
Threads tmax: 9981 10 ms
Iterations: 957 302461
showing that the first of 2 threads is not active in 9.98 seconds of 10 seconds total, resulting in only 957 iterations instead of about 150000 which would be expected in fair scheduling.
This problem seems to vanish if there are 4 or more similar threads active, then the tmax take quite similar values of 20-40 ms.
This is only observed on esp32. Testing of Pyboard and Unix port of MP revealed fair scheduling.
The test program:
import _thread
from time import sleep, sleep_ms
from machine import idle, Timer
flag = False
tim0 = Timer(0)
tim0.init(period=20000, mode=Timer.ONE_SHOT, callback=lambda t:print('interrupt 0'))
rmin = [0, 0]
rmax = [0, 0]
ret = [0, 0]
def th_tims(i):
global rmin, rmax, ret
while not flag:
idle()
tmin = 1_000_000; tmax = 0; n = 0
t0 = tim0.value()
t1 = tim0.value()
td = t1 - t0 # put some computation here,
t1 = min(td, tmin) # .. so we don't produce the minimum here
t1 = max(td, tmin)
t1 = min(td, tmin)
t1 = max(td, tmin)
while flag:
t1 = tim0.value() # get weird results without the lock (4, 5, 10 seconds for max)
td = t1 - t0
tmin = min(td, tmin)
tmax = max(td, tmax)
t0 = t1
n +=1
rmin[i] = tmin
rmax[i] = tmax
ret[i] = n
_thread.start_new_thread(th_tims, (0,))
_thread.start_new_thread(th_tims, (1,))
flag = True
sleep(10)
flag = False
sleep_ms(50)
print('Threads tmin: {:9d} {:9d} ms'.format(rmin[0], rmin[1]))
print('Threads tmax: {:9d} {:9d} ms'.format(rmax[0], rmax[1]))
print('Iterations: {:9d} {:9d}'.format(ret[0], ret[1]))
From an initial version with machine.ticks_us() I switched to hardware timer, suspecting that ticks_us() was not thread-safe.
This is wrong. It was discussed here.
As a partial workaround the introduction of a lock (aquired and released in the while flag: loop) could be used. The threads then have breaks of up to 0.5 seconds and overall performance is almost halved, but scheduling is quite fair.
esp32: Fix uneven GIL allocation between Python threads.
Summary
Explicitly yield each time a thread mutex is unlocked. Closes #15423.
Key to understanding this bug is that Python threads run at equal RTOS priority, and although ESP-IDF FreeRTOS (and I think vanilla FreeRTOS) scheduler will round-robin equal priority tasks in the ready state it does not make a similar guarantee for tasks moving between ready and waiting.
The pathological case of this bug is when one Python thread task is busy (i.e. never blocks) it will hog the CPU more than expected, sometimes for an unbounded amount of time. This happens even though it periodically unlocks the GIL to allow another task to run.
Assume T1 is busy and T2 is blocked waiting for the GIL. T1 is executing and hits a condition to yield execution:
- T1 calls MP_THREAD_GIL_EXIT
- FreeRTOS sees T2 is waiting for the GIL and moves it to the Ready list (but does not preempt, as T2 is same priority, so T1 keeps running).
- T1 immediately calls MP_THREAD_GIL_ENTER and re-takes the GIL.
- Pre-emptive context switch happens, T2 wakes up, sees GIL is not available, and goes on the waiting list for the GIL again.
To break this cycle step 4 must happen before step 3, but this may be a very narrow window of time so it may not happen regularly - and quantisation of the timing of the tick interrupt to trigger a context switch may mean it never happens.
Yielding at the end of step 2 maximises the chance for another task to run.
Testing
Adds a test case which is based on the code in the linked bug report. This test fails consistently on esp32 without this fix, and passes afterwards. Test case also passes on rp2, which is the only other port where thread tests are currently enabled, and passes if run manually on the unix port.
This PR also includes a commit to enable the thread tests on the esp32 port.
Trade-offs and Alternatives
Could make a port-specific MP_THREAD_GIL_EXIT macro that yields, to avoid the overhead of yielding any time a thread mutex is unlocked. However almost all of the thread mutex lock/unlock events in the esp32 port are the GIL (the other thread mutexes are the GC mutex and the mutex that protects thread creation and cleanup).
This work was funded through GitHub Sponsors.