Opened 2 weeks ago

Closed 2 weeks ago

Last modified 10 days ago

#2225 closed enhancement (fixed)

Timer heap refactoring

Reported by: ming Owned by: ming
Priority: normal Milestone: release-2.10
Component: pjlib Version: trunk
Keywords: timer-refactoring Cc:
Backport to 1.x milestone: Backported: no

Description

Some timer crash reports that are difficult to reproduce and analyse from log & callstack. Here are typical reported callstacks:

#0  0x00007f7ae3dc0249 in copy_node (ht=0x184cbf0, slot=2305, moved_node=0x7f797732f0c0) at ../src/pj/timer.c:137
#1  0x00007f7ae3dc0677 in reheap_up (ht=0x184cbf0, moved_node=0x7f780e0f2990, slot=2305, parent=1152) at ../src/pj/timer.c:208
#2  0x00007f7ae3dc0882 in remove_node (ht=0x184cbf0, slot=2305) at ../src/pj/timer.c:254
        parent = 1152
        moved_node = 0x7f780e0f2990
        removed_node = 0x7f797f4a9e08
#3  0x00007f7ae3dc0b64 in cancel (ht=0x184cbf0, entry=0x7f797f4a9e08, flags=1) at ../src/pj/timer.c:353
        timer_node_slot = 2305
#4  0x00007f7ae3dc1061 in cancel_timer (ht=0x184cbf0, entry=0x7f797f4a9e08, flags=0, id_val=0) at ../src/pj/timer.c:591
        count = 32634
#5  0x00007f7ae3dc10ea in pj_timer_heap_cancel (ht=0x184cbf0, entry=0x7f797f4a9e08) at ../src/pj/timer.c:611
#6  0x00007f7ae3d0f0c8 in pjsip_endpt_cancel_timer (endpt=0x184c908, entry=0x7f797f4a9e08) at ../src/pjsip/sip_endpoint.c:848
#0  0x00007f543c47dc2a in copy_node (ht=0xbdc8f0, slot=0, moved_node=0x7f51653ca820) at ../src/pj/timer.c:137
#1  0x00007f543c47e146 in remove_node (ht=0xbdc8f0, slot=0) at ../src/pj/timer.c:245
        parent = 139999765322679
        moved_node = 0x7f51653ca820
        removed_node = 0x7f50e53f7e30
#2  0x00007f543c47eb97 in pj_timer_heap_poll (ht=0xbdc8f0, next_delay=0x7f5123517cd0) at ../src/pj/timer.c:643
        node = 0x7f50fe3fad70
        grp_lock = 0x7f52d8e53ba8
        now = {sec = 12628573, msec = 998}
        count = 1

As we can see, the callstack does not really tell the owner of the entry or if the entry is still valid or already freed, because the entry is allocated by application. Moreover, timer structure is public, including its internal states such as _timer_id, _timer_value, _grp_lock, so application may access those internal states directly, which is potentially harmful.

So the goal of this ticket are to:

  • Refactor current implementation:
    • Maintain duplicates of entry timer instances and hide the internal states, except _timer_id. Tthis behavior is configurable via the compile-time setting PJ_TIMER_HEAP_USE_COPY (default enabled).
    • Set timer debug (PJ_TIMER_DEBUG) to on by default.
    • Provide some protection against crash due to accidental timer entry modification or premature deallocation (to get the protection, PJ_TIMER_HEAP_USE_COPY must be enabled to allow timer heap to check against the copy if the timer entry has changed. PJ_POOL_RELEASE_WIPE_DATA (default: disabled) is also recommended to be switched on to make sure any freed memory to be wiped out first.)

When the protection detects some changes, PJSIP will print log beginning with "Bug!" such as "Bug! Polling entry 0x7fad11807ea8 from ../src/pjlib-test/timer.c line 646 has been deallocated without being cancelled". The timer heap itself can then resume as per normal, but it is recommended to fix the real bug itself when encountering such message.

  • Update timer stress test (continuation of #2176)
    • Testing race conditions with many worker threads performing random tasks of scheduling, cancelling, and polling.

Change History (3)

comment:1 Changed 2 weeks ago by ming

  • Owner set to ming
  • Resolution set to fixed
  • Status changed from new to closed

In 6058:

Fixed #2225: Timer heap refactoring

comment:2 Changed 2 weeks ago by ming

In 6059:

Re #2225: Fixed incorrect pool to be released in pjlib timer test

comment:3 Changed 10 days ago by nanang

In 6067:

Re #2225: Fixed compile errors in timer on MSVC2005.

Note: See TracTickets for help on using tickets.