[ESP32 port] network_ppp does not manage pcb properly; can cause LWIP assertion or unmanaged pcb
In handling of active(false), network_ppp.c ignores the return value of pppapi_free() and always sets self->pcb to NULL. However, pppapi_free() can fail if the PPP link is not currently dead, in which case the PCB remains allocated.
The implication of this is:
- The PCB is now freed and yet network_ppp.c no longer has a reference to it. The PCB is essentially 'leaked' at this point.
- Subsequent attempts to call active(true) will cause an LWIP assertion (which leads to a panic) due to the fact that the network interface we are trying to add already exists (due to the previously-unfreed pcb).
- Also note that since the pcb didn't get freed, it's possible that the ppp_status_cb() function will be called by LWIP. If it does, the fact that self->pcb is set to NULL can cause an Access Violation (or Load Prohibited) since the ppp_status_cb() function assumes self->pcb is non-NULL.
The scenario above is exacerbated by the fact that:
- network_ppp.c has a 4000 ms timeout for the pppapi_close() call. However, if pppapi_close() was called while the LCP finite state machine had not yet set the phase of the PPP to DEAD, then LCP layer will try to send 2 TERM_REQs to its peer.
- If the peer is not answering, both of these TERM_REQs will timeout. In the version of LWIP, the timeout is configured for 6 seconds per attempt, for a total of 12 seconds. Therefore the LCP layer will require more than the 4 (or 4000ms) given by network_ppp.c.
- As a result, active(false) will timeout on pppapi_close(). Furthermore its call to pppapi_free() will fail and this failure will trigger the initial problem described above where network_ppp.c has NULLed its reference to the PCB eventhough the PCB hasn't been freed successfully in LWIP.
esp32/network_ppp: Bugfixes for deadlocks and crashes when disconnecting.
Summary
A follow-up to #17138 in which I reworked the ESP32 PPP implementation to be closer to the extmod implementation.
In that PR it was mentioned that:
PPP is quite hard to get right and modems can be very finicky. So it's great now that the code running on esp32 boards is also (indirectly) testing the extmod code, and vice versa
Well, I found three issues on the ESP32. Two of them are specific to the ESP32 and it's use of a thread-safe API, and I think one of them (the PCB cleanup) also is probably also wrong in the extmod version.
The three fixes (each a separate commit):
Use thread-safe API for PPPoS input
The ESP32 port uses the thread-safe API, but in the previous PR the PPP input function was accidentally changed to use the non-safe API. It happens to work fine, but the correct way is to use the thread-safe API as we do elsewhere in the implementation (and did before this change was accidentally introduced).
(extmod doesn't use the thread-safe API so isn't affected.)
Use non-thread-safe API inside status callback
The status callback runs on the lwIP tcpip_thread, and thus on the ESP32 we must use the non-thread-safe API because the thread-safe API would cause a deadlock (because it would wait on that same tcpip_thread to first finish executing the status callback).
(extmod doesn't use the thread-safe API so isn't affected, other than a change that doesn't change any functionally but keeps the two files as similar as possible when diffing them.)
Correctly clean up PPP PCB after close
If PPP is still connected, freeing the PCB will fail (see lwIP code here) and thus instead we should trigger a disconnect and wait for the lwIP callback to actually free the PCB.
When PPP is not connected we should check if the freeing failed, warn the user if so, and only mark the connection as inactive if not.
When all this happens during garbage collection the best case is that the PPP connection is already dead, which means the callback will be called immediately and cleanup will happen correctly. The worst case is that the connection is still alive, thus we are unable to free the PCB (lwIP won't let us) and it remains referenced in the netif_list, meaning a use-after-free happens later when lwIP traverses that linked list.
While this change does not fully fix the garbage collection case, on the ESP32 port specifically it does improve how the PPP.active(False) method behaves: It no longer immediately tries to free (and fails), but instead triggers a disconnect and lets the cleanup happen correctly through the status callback. (extmod doesn't have the .active() method.)
Testing
I've so far only tested this on the ESP32 port, by repeatedly connecting/disconnecting/deleting, and checking via GDB that the netif_list did not get corrupted anymore. As for other ports: I did not test them, but am fairly confident the change makes sense; as the linked code from the exact lwIP submodule used by extmod shows the ppp_free function indeed can fail and thus the change to handle it in the callback seems correct.