ESP32: Threading with 2 or 3 similar threads not properly scheduled
Platform: ESP32 Wroom
MP-Version: MicroPython v1.19.1 on 2022-06-18; ESP32 module with ESP32
I observed that the computation performance of 2 or 3 identical threads started in parallel is very unevenly distributed, the worst case being one of 3 threads getting only 1 % of computation power.
Further investigation revealed that there are very long durations during which a single thread is not activated. This may be the reason for the first observation.
A test program appended below after a few runs exhibits the following output,
Threads tmin: 0 0 ms
Threads tmax: 9981 10 ms
Iterations: 957 302461
showing that the first of 2 threads is not active in 9.98 seconds of 10 seconds total, resulting in only 957 iterations instead of about 150000 which would be expected in fair scheduling.
This problem seems to vanish if there are 4 or more similar threads active, then the tmax take quite similar values of 20-40 ms.
This is only observed on esp32. Testing of Pyboard and Unix port of MP revealed fair scheduling.
The test program:
import _thread
from time import sleep, sleep_ms
from machine import idle, Timer
flag = False
tim0 = Timer(0)
tim0.init(period=20000, mode=Timer.ONE_SHOT, callback=lambda t:print('interrupt 0'))
rmin = [0, 0]
rmax = [0, 0]
ret = [0, 0]
def th_tims(i):
global rmin, rmax, ret
while not flag:
idle()
tmin = 1_000_000; tmax = 0; n = 0
t0 = tim0.value()
t1 = tim0.value()
td = t1 - t0 # put some computation here,
t1 = min(td, tmin) # .. so we don't produce the minimum here
t1 = max(td, tmin)
t1 = min(td, tmin)
t1 = max(td, tmin)
while flag:
t1 = tim0.value() # get weird results without the lock (4, 5, 10 seconds for max)
td = t1 - t0
tmin = min(td, tmin)
tmax = max(td, tmax)
t0 = t1
n +=1
rmin[i] = tmin
rmax[i] = tmax
ret[i] = n
_thread.start_new_thread(th_tims, (0,))
_thread.start_new_thread(th_tims, (1,))
flag = True
sleep(10)
flag = False
sleep_ms(50)
print('Threads tmin: {:9d} {:9d} ms'.format(rmin[0], rmin[1]))
print('Threads tmax: {:9d} {:9d} ms'.format(rmax[0], rmax[1]))
print('Iterations: {:9d} {:9d}'.format(ret[0], ret[1]))
From an initial version with machine.ticks_us() I switched to hardware timer, suspecting that ticks_us() was not thread-safe.
This is wrong. It was discussed here.
As a partial workaround the introduction of a lock (aquired and released in the while flag: loop) could be used. The threads then have breaks of up to 0.5 seconds and overall performance is almost halved, but scheduling is quite fair.
RFC ESP32 application performance relative to STM
tl;dr There is evidence that application performance of ESP32 is substantially below that of the Pyboard 1.1.
This was prompted by two observations. In this issue users noted that on ESP32 the GPS demo failed at message rates above 2Hz, whereas on a Pyboard 1.1 it was tested at 10Hz. Secondly lcd160cr-gui produces snappy screen changes on a Pyboard, but is visibly sluggish on an ESP32.
I wrote a script (listing below) which launches a metrics task which does nothing except maintain statistics of how frequently it is scheduled. Every 10s it issues a report to the REPL which includes these figures:
- Maximum, minimum and average interval between the task being scheduled.
- Mean frequency with which the task is scheduled.
- Free RAM.
This demo script was modified to launch the metrics task. While the demo ran, I operated the touch GUI. The reports with the worst value of maximum interval are reproduced below:
Pyboard 1.1
Max 72148μs Min 273μs Avg 1955μs No. of calls 5113 Freq 511
mem free 34928
ESP32
Max 1460920μs Min 1102μs Avg 6363μs No. of calls 1571 Freq 157
mem free 43152
The worst case on ESP32 is ~20x slower, the mean scheduling rate (measured by average time or frequency) is consistently about 3x slower on ESP32. In both cases over many measurements free RAM was similar, varying from ~35-47K as I operated the GUI with the ESP32 being generally rather better than the Pyboard.
I would be interested to hear any views on why this is the case. Neither the GUI nor the GPS applications use interrupts or timers: they are pure uasyncio V3 code using polling (touch GUI) and stream I/O (GPS). Apart from hardware setup, code is identical on the two platforms.
Running metrics.py as a stand-alone benchmark produced this outcome (very consistent between reports):
ESP32
Max 5489μs Min 2061μs Avg 2232μs No. of calls 4479 Freq 447
mem free 106528
Pyboard 1.1
Max 480μs Min 449μs Avg 478μs No. of calls 20901 Freq 2090
mem free 98192
metrics.py
import uasyncio as asyncio
import gc
from utime import ticks_us, ticks_diff
def metrics():
ncalls = 0
max_d = 0
min_d = 100_000_000
tot_d = 0
st = 'Max {}μs Min {}μs Avg {}μs No. of calls {} Freq {}'
async def func():
nonlocal ncalls, max_d, min_d, tot_d
while True:
tstart = ticks_us()
t_last = None
while ticks_diff(t := ticks_us(), tstart) < 10_000_000:
await asyncio.sleep(0)
if ncalls:
dt = ticks_diff(t, t_last)
max_d = max(max_d, dt)
min_d = min(min_d, dt)
tot_d += dt
ncalls += 1
t_last = t
print(st.format(max_d, min_d, tot_d//ncalls, ncalls, ncalls//10))
gc.collect()
print('mem free', gc.mem_free())
ncalls = 0
max_d = 0
min_d = 100_000_000
tot_d = 0
return func
# Example of call
async def main():
asyncio.create_task(metrics()()) # Note the syntax
while True:
await asyncio.sleep(0)
# asyncio.run(main())