Opened 7 days ago

Closed 7 days ago

#2230 closed defect (fixed)

Timer crash in STUN session

Reported by: nanang Owned by: nanang
Priority: normal Milestone: release-2.10
Component: pjnath Version: trunk
Keywords: timer-refactoring Cc:
Backport to 1.x milestone: Backported: no

Description

Reported that there is a time gap in function pj_timer_heap_poll() after the unlock_timer_heap(ht) that pj_stun_session can be destroyed and crash inside on_cache_timeout()

Stack trace:

Exception thrown: read access violation.
   tdata->sess was 0xFFFFFFFFFFFFFFF7. occurred
on_cache_timeout(pj_timer_heap_t * timer_heap, pj_timer_entry * entry) Line 226
   at pjnath\src\pjnath\stun_session.c(226)
pj_timer_heap_poll(pj_timer_heap_t * ht, pj_time_val * next_delay) Line 651
   at pjlib\src\pj\timer.c(651)

Analysis:
This should be the crashing line: sess = tdata->sess; which indicates that tdata pool was already released when on_cache_timeout() was invoked. Who destroyed the tdata? It should not be STUN session destroy because the timer is scheduled using sess->grp_lock. So far, the (only) suspect is app, i.e: via pj_stun_msg_destroy_tdata(). So, created a patch that introduces tdata->grp_lock which is used in scheduling tdata cache timeout timer (was scheduled using sess->grp_lock). However, it still crashed on different location:

Exception thrown: read access violation.
   atomic_var was 0xFFFFFFFFFFFFFFFF. occurred
pj_atomic_dec_and_get(pj_atomic_t * atomic_var) Line 800
   at pjlib\src\pj\os_core_win32.c(800)
grp_lock_dec_ref(pj_grp_lock_t * glock) Line 554
   at pjlib\src\pj\lock.c(554)
pj_grp_lock_dec_ref(pj_grp_lock_t * glock) Line 647
   at pjlib\src\pj\lock.c(647)
pj_timer_heap_poll(pj_timer_heap_t * ht, pj_time_val * next_delay) Line 654
   at pjlib\src\pj\timer.c(654)

Analysis:
The call stack indicates that tdata had been destroyed after timer callback on_cache_timeout() returned (might be destroyed by the callback itself or by other thread). For STUN response scenario (the only scenario that triggers cache timer scheduling), delete_tdata() should only be invoked by either on_cache_timeout() callback or STUN session destroy. And in this case, it seems that both invoked delete_tdata() in a race scenario which destroyed tdata prematurely.

The initial patch is then updated with additional ammo:

  • also add ref STUN session when scheduling cache so STUN session is guaranteed to be still valid in the callback,
  • in the on_cache_timeout(), destroy_tdata() is done inside STUN session lock, because it updates sess->cached_response_list.

Thanks Hugo Sobral for the report.

Change History (1)

comment:1 Changed 7 days ago by nanang

  • Owner set to nanang
  • Resolution set to fixed
  • Status changed from new to closed

In 6069:

Fix #2230: Fixed crash in STUN session due to race condition which leads to premature tdata destroy.

Note: See TracTickets for help on using tickets.