QUERY · ISSUE

ESP32: Threading with 2 or 3 similar threads not properly scheduled

openby rkompassopened 2022-11-26updated 2023-01-06

bugport-esp32

Platform: ESP32 Wroom
MP-Version: MicroPython v1.19.1 on 2022-06-18; ESP32 module with ESP32

I observed that the computation performance of 2 or 3 identical threads started in parallel is very unevenly distributed, the worst case being one of 3 threads getting only 1 % of computation power.
Further investigation revealed that there are very long durations during which a single thread is not activated. This may be the reason for the first observation.
A test program appended below after a few runs exhibits the following output,

Threads tmin:         0         0 ms
Threads tmax:      9981        10 ms
Iterations:         957    302461

showing that the first of 2 threads is not active in 9.98 seconds of 10 seconds total, resulting in only 957 iterations instead of about 150000 which would be expected in fair scheduling.
This problem seems to vanish if there are 4 or more similar threads active, then the tmax take quite similar values of 20-40 ms.
This is only observed on esp32. Testing of Pyboard and Unix port of MP revealed fair scheduling.

The test program:

import _thread
from time import sleep, sleep_ms
from machine import idle, Timer

flag = False
tim0 = Timer(0)
tim0.init(period=20000, mode=Timer.ONE_SHOT, callback=lambda t:print('interrupt 0'))
rmin = [0, 0]
rmax = [0, 0]
ret = [0, 0]

def th_tims(i):
    global rmin, rmax, ret
    while not flag:
        idle()
    tmin = 1_000_000; tmax = 0; n = 0
    t0 = tim0.value()
    t1 = tim0.value()
    td = t1 - t0        #   put some computation here, 
    t1 = min(td, tmin)  # .. so we don't produce the minimum here
    t1 = max(td, tmin)
    t1 = min(td, tmin)
    t1 = max(td, tmin)
    while flag:
        t1 = tim0.value()    # get weird results without the lock (4, 5, 10 seconds for max) 
        td = t1 - t0
        tmin = min(td, tmin)
        tmax = max(td, tmax)
        t0 = t1
        n +=1
    rmin[i] = tmin
    rmax[i] = tmax
    ret[i] = n

_thread.start_new_thread(th_tims, (0,))
_thread.start_new_thread(th_tims, (1,))

flag = True
sleep(10)
flag = False
sleep_ms(50)
print('Threads tmin: {:9d} {:9d} ms'.format(rmin[0], rmin[1]))
print('Threads tmax: {:9d} {:9d} ms'.format(rmax[0], rmax[1]))
print('Iterations:   {:9d} {:9d}'.format(ret[0], ret[1]))

From an initial version with machine.ticks_us() I switched to hardware timer, suspecting that ticks_us() was not thread-safe.
This is wrong. It was discussed here.

As a partial workaround the introduction of a lock (aquired and released in the while flag: loop) could be used. The threads then have breaks of up to 0.5 seconds and overall performance is almost halved, but scheduling is quite fair.

CANDIDATE · PULL REQUEST

esp32: Fix uneven GIL allocation between Python threads.

mergedby projectgusopened 2024-07-17updated 2024-07-23

bugport-esp32

Summary

Explicitly yield each time a thread mutex is unlocked. Closes #15423.

Key to understanding this bug is that Python threads run at equal RTOS priority, and although ESP-IDF FreeRTOS (and I think vanilla FreeRTOS) scheduler will round-robin equal priority tasks in the ready state it does not make a similar guarantee for tasks moving between ready and waiting.

The pathological case of this bug is when one Python thread task is busy (i.e. never blocks) it will hog the CPU more than expected, sometimes for an unbounded amount of time. This happens even though it periodically unlocks the GIL to allow another task to run.

Assume T1 is busy and T2 is blocked waiting for the GIL. T1 is executing and hits a condition to yield execution:

T1 calls MP_THREAD_GIL_EXIT
FreeRTOS sees T2 is waiting for the GIL and moves it to the Ready list (but does not preempt, as T2 is same priority, so T1 keeps running).
T1 immediately calls MP_THREAD_GIL_ENTER and re-takes the GIL.
Pre-emptive context switch happens, T2 wakes up, sees GIL is not available, and goes on the waiting list for the GIL again.

To break this cycle step 4 must happen before step 3, but this may be a very narrow window of time so it may not happen regularly - and quantisation of the timing of the tick interrupt to trigger a context switch may mean it never happens.

Yielding at the end of step 2 maximises the chance for another task to run.

Testing

Adds a test case which is based on the code in the linked bug report. This test fails consistently on esp32 without this fix, and passes afterwards. Test case also passes on rp2, which is the only other port where thread tests are currently enabled, and passes if run manually on the unix port.

This PR also includes a commit to enable the thread tests on the esp32 port.

Trade-offs and Alternatives

Could make a port-specific MP_THREAD_GIL_EXIT macro that yields, to avoid the overhead of yielding any time a thread mutex is unlocked. However almost all of the thread mutex lock/unlock events in the esp32 port are the GIL (the other thread mutexes are the GC mutex and the mutex that protects thread creation and cleanup).

This work was funded through GitHub Sponsors.