QUERY · ISSUE

O with high priority

openby peterhinchopened 2020-04-02updated 2020-05-04

This was discussed at length in the past, and prototyped with early code of this uasyncio. There is known demand for it in high speed UART applications and audio processing. Salient points:

Enable I/O readiness to be tested on every pass of the scheduler.
The facility to be available on a per-interface basis. A subset of active interfaces can be selected to run at high priority.

This would improve real time throughput and reduce buffering requirements

CANDIDATE · ISSUE

uasyncio scheduling is not fair

closedby pfalconopened 2016-12-24updated 2017-05-01

If several tasks are scheduled for the same time, uasyncio may execute them not in the order scheduled. This is due to the fact that the heap data structure used by the uasyncio for scheduling does not offer stable order itself.

12 comments

pfalcon · 2016-12-24

Initial version of uasyncio worked around a compound key issue in (u)heapq module using extra incrementing field after a scheduled time. At the same time, it implicitly protected against the issue discussed in this ticket, but cost extra 33% of memory. It was later optimized out, but made uasyncio vulnerable to this issue.

Another way to address this issue would be to switch to a stable priority queue implementation. Unfortunately, such implementations (which don't require extra memory overhead of course) are noticeably more complex in the code, and likely will be more slower to to higher constant factor, even though they maintain the same O(log N) worst-time complexity.

pfalcon · 2017-03-03

@dpgeorge : So, in addition to all other in-flight issues, I'd like to get move on this, because uasyncio issues accumulate, and I'd like them be resolved in FIFO manner. So, it's a bit sad situation that one of the reasons to make utimeq was to get rid of extra field, and now it needs to be put back :-(. Let's consider other options. E.g., specing out that utime.time_ms() and friends should return no more than 24 bits of ticks, then we can put extra byte of discriminating field in the same word. But that would let only 256 coroutines be scheduled in one millisecond (read: in one go at the start), which isn't much. Well, we can say 20 bits of ticks, that raises # of simul. coroutines to more realistic number, but clamps timing functions noticeably.

On the positive note, if we add whole-word counter, it can be used as "thread id", though I'm not exactly sure how useful such implementation would be (e.g. would it allow to communicate with coroutines, or safely kill one, etc.)

Note that amount of memory taken by uasyncio queue is considerable, I recently tweaked that to be 0.5K instead of 1.5K for 32-bit case, and that already fails some load tests.

dpgeorge · 2017-03-06

specing out that utime.time_ms() and friends should return no more than 24 bits of ticks

I would be very wary of reducing precision of ticks functions. With 24 bits, ticks_us only lasts 16 seconds.

One option might be to store differences in ticks between successive elements in the timeq, and then store the absolute value of the first one separately (in the main timeq object, not the list). Differences should generally take up less bits.

Another place to store info is in the lower 2 bits of the callback/arg pointers, since they should always be aligned to a machine word (even a GC block which would give 4 bits in each pointer).

I would still be interested to see alternative implementations of the queue algo that are stable.

Note that amount of memory taken by uasyncio queue is considerable, I recently tweaked that to be 0.5K instead of 1.5K for 32-bit case, and that already fails some load tests.

Ideally the size of the queue would grow to accommodate the application. Otherwise it's going to be difficult for users to pick the right length. Any events that get lost due to a full queue will break the application, so the length must be chosen to be the maximum over all running time (like stack depth). This means that there'll be a lot of wasted space in the queue because most of the time you won't need the maximum.

pfalcon · 2017-03-06

One option might be to store differences in ticks between successive elements in the timeq, and then store the absolute value of the first one separately (in the main timeq object, not the list). Differences should generally take up less bits.

Nope, that's not going to work. The idea of utimeq is that there's "sliding base" for all current elements in a queue, roughly against head of list. If we store diffs against head, then every time head is popped off, deltas for every other element in the queue needs to be updated.

Another place to store info is in the lower 2 bits of the callback/arg pointers, since they should always be aligned to a machine word (even a GC block which would give 4 bits in each pointer).

Callback perhaps, but args is generally supposed to be an arbitrary objects. So, few bits which can be scavenged there wouldn't be usable for anything but flags.

I would still be interested to see alternative implementations of the queue algo that are stable.

Me too, and I looked into those when I coded utimeq, but algos behind them appear to be complicated, I couldn't find existing small implementations, so coding it all up, multiplied by "sliding base" handling would be quite big scope creep.

Ideally the size of the queue would grow to accommodate the application. Otherwise it's going to be difficult for users to pick the right length. Any events that get lost due to a full queue will break the application, so the length must be chosen to be the maximum over all running time (like stack depth). This means that there'll be a lot of wasted space in the queue because most of the time you won't need the maximum.

Well, the idea is the opposite (well, at least orthogonal) - there should be statically allocated queue which is able to handle given resource capacity (e.g., 10 networking connections), and that capacity should be handled reliably (read: without memory allocation and related risks), and overflowing that capacity should not affect system stability - like, it should still be able to handle planned capacity after such overload. Ideally, planned capacity should be handled even during overload periods, but that's of course (much) harder to achieve. If we don't plan it like that, we just build a DoS toy, and will never be able to compete with C software (well, doing stuff in C of course doesn't automatically mean guaranteed capacity and recovery, so well, we can establish a better baseline here).

pfalcon · 2017-03-06

On the positive note, if we add whole-word counter, it can be used as "thread id", though I'm not exactly sure how useful such implementation would be (e.g. would it allow to communicate with coroutines, or safely kill one, etc.)

So, the problem (one of, the initial one) is that it won't be a "thread id", it will be a "queue id", and as entries get popped off queue, and then pushed again (usually, unless a coroutine has ended), this id will change all the time for the same coroutine. So, as is, it won't be usable for anything.

pfalcon · 2017-03-06

Anyway, I guess we need to fix this issue (because again, it's just one of few which start to pile up), and look into optimizations later. I coded this on the plane, need to test a bit more and then push it.

dpgeorge · 2017-03-07

Another idea is to randomly pick an element from the head, a random choice from all those elements which have the same timestamp. With a proper (or even just good enough) random source this would prevent certain coros from hogging the scheduler.

dpgeorge · 2017-03-07

there should be statically allocated queue which is able to handle given resource capacity, and that capacity should be handled reliably, and overflowing that capacity should not affect system stability - like, it should still be able to handle planned capacity after such overload.

That sounds like a pretty difficult goal. For example, you'd need to guarantee that coros that close sockets (or free resources in general) are always added to the queue. Or at least, if they aren't pushed, then they are executed regardless.

pfalcon · 2017-03-07

Yes, it's difficult goal, and there're lot to do uasyncio side yet.

pfalcon · 2017-03-07

Another idea is to randomly pick an element from the head, a random choice from all those elements which have the same timestamp. With a proper (or even just good enough) random source this would prevent certain coros from hogging the scheduler.

This skews the properties of the queue. Naive implementation adds O(N) factor, and even that adds enough code complexity. More advanced sub-algorithm, if exists, is still a function of N, with even more complex code. And a pain to test.

pfalcon · 2017-03-07

https://github.com/micropython/micropython-lib/commit/48ead94116c6a33d60200cece635e432b3592b35 added a test for fair, deterministic round-robin scheduling. https://github.com/micropython/micropython/commit/830ce74f324d3a384e383e840de4ee229c41ba36 is a (straightforward, as discussed above) change to make it pass.

I'm not closing this ticket, let's see if any follow-up work will ensue in the next couple of months. But the only viable idea seems to be to find and implement a stable-queue algo.

pfalcon · 2017-05-01

Closing.

extmod/uasyncio Provide means of scheduling I/O with high priority

uasyncio scheduling is not fair

Keyboard