QUERY · ISSUE

ESP32: Threading with 2 or 3 similar threads not properly scheduled

openby rkompassopened 2022-11-26updated 2023-01-06

bugport-esp32

Platform: ESP32 Wroom
MP-Version: MicroPython v1.19.1 on 2022-06-18; ESP32 module with ESP32

I observed that the computation performance of 2 or 3 identical threads started in parallel is very unevenly distributed, the worst case being one of 3 threads getting only 1 % of computation power.
Further investigation revealed that there are very long durations during which a single thread is not activated. This may be the reason for the first observation.
A test program appended below after a few runs exhibits the following output,

Threads tmin:         0         0 ms
Threads tmax:      9981        10 ms
Iterations:         957    302461

showing that the first of 2 threads is not active in 9.98 seconds of 10 seconds total, resulting in only 957 iterations instead of about 150000 which would be expected in fair scheduling.
This problem seems to vanish if there are 4 or more similar threads active, then the tmax take quite similar values of 20-40 ms.
This is only observed on esp32. Testing of Pyboard and Unix port of MP revealed fair scheduling.

The test program:

import _thread
from time import sleep, sleep_ms
from machine import idle, Timer

flag = False
tim0 = Timer(0)
tim0.init(period=20000, mode=Timer.ONE_SHOT, callback=lambda t:print('interrupt 0'))
rmin = [0, 0]
rmax = [0, 0]
ret = [0, 0]

def th_tims(i):
    global rmin, rmax, ret
    while not flag:
        idle()
    tmin = 1_000_000; tmax = 0; n = 0
    t0 = tim0.value()
    t1 = tim0.value()
    td = t1 - t0        #   put some computation here, 
    t1 = min(td, tmin)  # .. so we don't produce the minimum here
    t1 = max(td, tmin)
    t1 = min(td, tmin)
    t1 = max(td, tmin)
    while flag:
        t1 = tim0.value()    # get weird results without the lock (4, 5, 10 seconds for max) 
        td = t1 - t0
        tmin = min(td, tmin)
        tmax = max(td, tmax)
        t0 = t1
        n +=1
    rmin[i] = tmin
    rmax[i] = tmax
    ret[i] = n

_thread.start_new_thread(th_tims, (0,))
_thread.start_new_thread(th_tims, (1,))

flag = True
sleep(10)
flag = False
sleep_ms(50)
print('Threads tmin: {:9d} {:9d} ms'.format(rmin[0], rmin[1]))
print('Threads tmax: {:9d} {:9d} ms'.format(rmax[0], rmax[1]))
print('Iterations:   {:9d} {:9d}'.format(ret[0], ret[1]))

From an initial version with machine.ticks_us() I switched to hardware timer, suspecting that ticks_us() was not thread-safe.
This is wrong. It was discussed here.

As a partial workaround the introduction of a lock (aquired and released in the while flag: loop) could be used. The threads then have breaks of up to 0.5 seconds and overall performance is almost halved, but scheduling is quite fair.

CANDIDATE · ISSUE

RFC ESP32 application performance relative to STM

closedby peterhinchopened 2020-12-22updated 2024-09-14

needs-info

tl;dr There is evidence that application performance of ESP32 is substantially below that of the Pyboard 1.1.

This was prompted by two observations. In this issue users noted that on ESP32 the GPS demo failed at message rates above 2Hz, whereas on a Pyboard 1.1 it was tested at 10Hz. Secondly lcd160cr-gui produces snappy screen changes on a Pyboard, but is visibly sluggish on an ESP32.

I wrote a script (listing below) which launches a metrics task which does nothing except maintain statistics of how frequently it is scheduled. Every 10s it issues a report to the REPL which includes these figures:

Maximum, minimum and average interval between the task being scheduled.
Mean frequency with which the task is scheduled.
Free RAM.

This demo script was modified to launch the metrics task. While the demo ran, I operated the touch GUI. The reports with the worst value of maximum interval are reproduced below:

Pyboard 1.1
Max 72148μs Min 273μs Avg 1955μs No. of calls 5113 Freq 511
mem free 34928
ESP32
Max 1460920μs Min 1102μs Avg 6363μs No. of calls 1571 Freq 157
mem free 43152

The worst case on ESP32 is ~20x slower, the mean scheduling rate (measured by average time or frequency) is consistently about 3x slower on ESP32. In both cases over many measurements free RAM was similar, varying from ~35-47K as I operated the GUI with the ESP32 being generally rather better than the Pyboard.

I would be interested to hear any views on why this is the case. Neither the GUI nor the GPS applications use interrupts or timers: they are pure uasyncio V3 code using polling (touch GUI) and stream I/O (GPS). Apart from hardware setup, code is identical on the two platforms.

Running metrics.py as a stand-alone benchmark produced this outcome (very consistent between reports):

ESP32
Max 5489μs Min 2061μs Avg 2232μs No. of calls 4479 Freq 447
mem free 106528
Pyboard 1.1
Max 480μs Min 449μs Avg 478μs No. of calls 20901 Freq 2090
mem free 98192

metrics.py

import uasyncio as asyncio
import gc
from utime import ticks_us, ticks_diff

def metrics():
    ncalls = 0
    max_d = 0
    min_d = 100_000_000
    tot_d = 0
    st = 'Max {}μs Min {}μs Avg {}μs No. of calls {} Freq {}'
    async def func():
        nonlocal ncalls, max_d, min_d, tot_d
        while True:
            tstart = ticks_us()
            t_last = None
            while ticks_diff(t := ticks_us(), tstart) < 10_000_000:
                await asyncio.sleep(0)
                if ncalls:
                    dt = ticks_diff(t, t_last)
                    max_d = max(max_d, dt)
                    min_d = min(min_d, dt)
                    tot_d += dt
                ncalls += 1
                t_last = t
            print(st.format(max_d, min_d, tot_d//ncalls, ncalls, ncalls//10))
            gc.collect()
            print('mem free', gc.mem_free())
            ncalls = 0
            max_d = 0
            min_d = 100_000_000
            tot_d = 0
    return func

# Example of call
async def main():
    asyncio.create_task(metrics()())  # Note the syntax
    while True:
        await asyncio.sleep(0)

# asyncio.run(main())

ESP32: Threading with 2 or 3 similar threads not properly scheduled

RFC ESP32 application performance relative to STM

Keyboard