Opened 5 years ago
Closed 5 years ago
#2230 closed defect (fixed)
Timer crash in STUN session
Reported by: | nanang | Owned by: | nanang |
---|---|---|---|
Priority: | normal | Milestone: | release-2.10 |
Component: | pjnath | Version: | trunk |
Keywords: | timer-refactoring | Cc: | |
Backport to 1.x milestone: | Backported: | no |
Description
Reported that there is a time gap in function pj_timer_heap_poll() after the unlock_timer_heap(ht) that pj_stun_session can be destroyed and crash inside on_cache_timeout()
Stack trace:
Exception thrown: read access violation. tdata->sess was 0xFFFFFFFFFFFFFFF7. occurred on_cache_timeout(pj_timer_heap_t * timer_heap, pj_timer_entry * entry) Line 226 at pjnath\src\pjnath\stun_session.c(226) pj_timer_heap_poll(pj_timer_heap_t * ht, pj_time_val * next_delay) Line 651 at pjlib\src\pj\timer.c(651)
Analysis:
This should be the crashing line: sess = tdata->sess; which indicates that tdata pool was already released when on_cache_timeout() was invoked. Who destroyed the tdata? It should not be STUN session destroy because the timer is scheduled using sess->grp_lock. So far, the (only) suspect is app, i.e: via pj_stun_msg_destroy_tdata(). So, created a patch that introduces tdata->grp_lock which is used in scheduling tdata cache timeout timer (was scheduled using sess->grp_lock). However, it still crashed on different location:
Exception thrown: read access violation. atomic_var was 0xFFFFFFFFFFFFFFFF. occurred pj_atomic_dec_and_get(pj_atomic_t * atomic_var) Line 800 at pjlib\src\pj\os_core_win32.c(800) grp_lock_dec_ref(pj_grp_lock_t * glock) Line 554 at pjlib\src\pj\lock.c(554) pj_grp_lock_dec_ref(pj_grp_lock_t * glock) Line 647 at pjlib\src\pj\lock.c(647) pj_timer_heap_poll(pj_timer_heap_t * ht, pj_time_val * next_delay) Line 654 at pjlib\src\pj\timer.c(654)
Analysis:
The call stack indicates that tdata had been destroyed after timer callback on_cache_timeout() returned (might be destroyed by the callback itself or by other thread). For STUN response scenario (the only scenario that triggers cache timer scheduling), delete_tdata() should only be invoked by either on_cache_timeout() callback or STUN session destroy. And in this case, it seems that both invoked delete_tdata() in a race scenario which destroyed tdata prematurely.
The initial patch is then updated with additional ammo:
- also add ref STUN session when scheduling cache so STUN session is guaranteed to be still valid in the callback,
- in the on_cache_timeout(), destroy_tdata() is done inside STUN session lock, because it updates sess->cached_response_list.
Thanks Hugo Sobral for the report.
Change History (1)
comment:1 Changed 5 years ago by nanang
- Owner set to nanang
- Resolution set to fixed
- Status changed from new to closed
In 6069: