Ticket #329 (closed bug: fixed)

Opened 4 years ago

Last modified 3 years ago

jack segfault in debugmodule

Reported by: bsjones Assigned to:
Priority: major Milestone:
Component: generic Version: FFADO SVN (trunk)
Keywords: Cc: nils
The device the bug applies to:

Description

Experience segmentation fault in fedora 15 jack version 1.9.7, ffado 2.1.0 rev 1913. Also occurs on previous versions of jack 1.9.6-6 in fedora 15

*** glibc detected *** /usr/bin/jackd: free(): invalid pointer: 0x00007ffff7db8fe0 *** ======= Backtrace: ========= /lib64/libc.so.6[0x3dea2787ba] /usr/lib64/libffado.so.2(_ZN18DebugModuleManagerD2Ev+0x8e)[0x7ffff7a6558e]

Seems to be occurring on jack initialisation after unloading jackd_firewire.so. There are no firewire devices on this machine

Bugzilla ticket: https://bugzilla.redhat.com/show_bug.cgi?id=684392

Not sure if this should be posted there as well. Attached full backtrace Smolt: http://www.smolts.org/client/show/pub_912bfc89-4884-45de-ba54-367e4bd53433

Package build info: libffado http://koji.fedoraproject.org/koji/buildinfo?buildID=219120 jack http://koji.fedoraproject.org/koji/buildinfo?buildID=237324

[Thread debugging using libthread_db enabled] jackdmp 1.9.7 Copyright 2001-2005 Paul Davis and others. Copyright 2004-2011 Grame. jackdmp comes with ABSOLUTELY NO WARRANTY This is free software, and you are welcome to redistribute it under certain conditions; see the file COPYING for details [New Thread 0x7ffff77ba700 (LWP 10425)] Cleaning up leftover debug module: DeviceManager? *** glibc detected *** /usr/bin/jackd: free(): invalid pointer: 0x00007ffff7db8fe0 *** ======= Backtrace: ========= /lib64/libc.so.6[0x3dea2787ba] /usr/lib64/libffado.so.2(_ZN18DebugModuleManagerD2Ev+0x8e)[0x7ffff7a6558e] /usr/lib64/libffado.so.2(+0x9d076)[0x7ffff7a59076] /lib64/ld-linux-x86-64.so.2[0x3de9a13b68] /lib64/ld-linux-x86-64.so.2[0x3de9a1462e] /lib64/ld-linux-x86-64.so.2[0x3de9a0e8a6] /lib64/libdl.so.2[0x3deaa0152f] /lib64/libdl.so.2(dlclose+0x1f)[0x3deaa0100f] /usr/lib64/libjackserver.so.0[0x392424bf23] /usr/lib64/libjackserver.so.0[0x392424cfae] /usr/lib64/libjackserver.so.0(jackctl_server_create+0xc1a)[0x3924250aca]

Attachments

jack-backtrace.txt (12.3 kB) - added by bsjones on 04/05/11 15:05:35.
jackd backtrace
ffado-diag.log (6.6 kB) - added by bsjones on 04/05/11 15:06:44.
linus_memcpy_libffado.valgrind (146.2 kB) - added by bsjones on 04/06/11 04:21:31.
no_preload_libffado.valgrind (146.3 kB) - added by bsjones on 04/06/11 04:22:19.
gdb-ffado-patched.txt (10.4 kB) - added by bsjones on 04/07/11 14:37:59.
gdb arfter remove delete
linkit (4.2 kB) - added by adi on 04/14/11 06:22:44.
shellscript to trigger the final linking stage
libffado-debian-ld.so.lzma (1.8 MB) - added by adi on 04/14/11 08:01:46.
libffado.so compiled with FC15 toolchain but linked with Debian's ld
libffado-fedora-ld.so.lzma (1.8 MB) - added by adi on 04/14/11 08:02:22.
libffado.so compiled with FC15 toolchain AND linked with Fedora's ld
ctor-test.cpp (260 bytes) - added by cladisch on 04/14/11 23:54:48.
static constructors test program
libffado-2.1.0-add-link-flags (1.3 kB) - added by bsjones on 04/26/11 00:59:52.
Patch to pass link flags into scons
ffado-linker.patch (457 bytes) - added by adi on 09/30/11 19:50:21.
possible fix

Change History

04/05/11 15:05:35 changed by bsjones

  • attachment jack-backtrace.txt added.

jackd backtrace

04/05/11 15:06:44 changed by bsjones

  • attachment ffado-diag.log added.

04/06/11 00:29:22 changed by adi

This is weird.

I just tried the same on Debian with ffado trunk, and even without an attached device, nothing fails.

Any chance to try again on i386? Or with an older libc? This somehow reminds me of

https://bugzilla.redhat.com/show_bug.cgi?id=638477

hence I'm asking.

04/06/11 04:20:53 changed by bsjones

We have duplicate error reports from x86 users as well. However I'll try an older libc a bit later.

As for the glibc memcpy issue I have tested with Linus's patch in LD_PRELOAD and without and without - still segfaults - valgrind out attached here.

I also tried relocating the memcpy functions to memove as described here but also with the same result. https://bugzilla.redhat.com/attachment.cgi?id=487982

Any other suggestions?

04/06/11 04:21:31 changed by bsjones

  • attachment linus_memcpy_libffado.valgrind added.

04/06/11 04:22:19 changed by bsjones

  • attachment no_preload_libffado.valgrind added.

04/06/11 05:41:26 changed by wagi

Looking at the code, the DebugModuleManager? tries to delete an invalid DebugModul?. So there is a race in the destroy path. Probably concurrent access of the m_debugModules vector....

DebugModuleManager::~DebugModuleManager()
264	{
265	    // cleanin up leftover modules
266	    for ( DebugModuleVectorIterator it = m_debugModules.begin();
267	          it != m_debugModules.end();
268	          ++it )
269	    {
270	        fprintf(stderr,"Cleaning up leftover debug module: %s\n",(*it)->getName().c_str());
271	        m_debugModules.erase( it );
272	        delete *it;
273	    }

Line 272 is the problem.

04/07/11 01:21:31 changed by adi

Hi!

I just copied libffado.so.2 compiled on a Debian system to FC15. Result: works like a charm.

Given that it's the same source, it seems we're talking compiler/toolchain now.

04/07/11 02:18:50 changed by wagi

I think the 'delete *it' is wrong because the debug modules register/unregister themselfs at the manager. So the owner of the modules is not the manager.

The manager is a singleton you don't really know when that thing is destroyed. Hence this code depends on the libc order of cleaning things up. For most installation the debug modules have been removed before the destructor from the manager is called. But for few others...

So the simple solution is just to remove the 'delete *it' and everything is fine. In case my reasoning is wrong we leak a few bytes during the shutdown of the system... do we really care?

04/07/11 14:37:59 changed by bsjones

  • attachment gdb-ffado-patched.txt added.

gdb arfter remove delete

04/07/11 14:52:58 changed by bsjones

Thanks for looking into this.

Patching as suggested gets us along a little further in the DebugModule? destructor but falls over a line before in the erase.

Can we mitigate this some how in JackDriverLoader?.cpp do you think? ie. get all of our descriptors then unload all of them when done?

04/12/11 08:57:04 changed by wagi

Sorry for the late response.

You have to remove the fprintf, since it derefences the pointer too.

Index: debugmodule.cpp
===================================================================
--- debugmodule.cpp     (revision 1979)
+++ debugmodule.cpp     (working copy)
@@ -267,9 +267,7 @@
           it != m_debugModules.end();
           ++it )
     {
-        fprintf(stderr,"Cleaning up leftover debug module: %s\n",(*it)->getName().c_str());
         m_debugModules.erase( it );
-        delete *it;
     }
 
     if (!mb_initialized)

04/12/11 09:33:40 changed by adi

wagi,

I'm afraid to say your suggestions don't work. Even if you remove all code in this loop body, it segfaults at a later stage.

It's so utterly broken that you cannot even debug it, gdb is aborting:

../../gdb/infrun.c:3049: internal-error: handle_inferior_event: Assertion `inf' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
Quit this debugging session? (y or n) y

If we look at the code in question, it's iterating over a vector. As long as the vector is valid, this cannot possibly go wrong. If the vector is invalid, the question is why it's invalid. (memory corruption elsewhere, race condition, whatsoever)

I found a quick workaround: leave everything in debugmodule.cpp as is and simply remove line 61 in ffado.cpp:

// this is very much nescessary, as otherwise the
// message buffer thread doesn't get killed when the
// library is dlclose()'d

static void exitfunc(void) __attribute__((destructor));

static void exitfunc(void)
{
    delete DebugModuleManager::instance();

}

Given that the library compiled on a Debian system runs fine on FC15, I still claim toolchain issue. It's nothing obvious, I can confirm g++-4.6.1 creates working binaries.

04/12/11 09:50:03 changed by wagi

I see, I'm firing up my build system again and then I can start to dig a bit deeper.

I think the whole debugmodele concept got a bit out of hand. IIRC, Stefan Richter also complained about ffado not reporting any helpful message if used with jack. I completely agree. I'm trying to figure out how we can simplify things.

04/12/11 10:26:47 changed by arnonym

Just 2 cents:

Deleting elements from a container while iterating over it is just asking for trouble.

A better solution will probably be:

while( !m_debugModules.empty() )
{
  delete m_debugModules.takeFirst();
}

Just in case this slipped your mind...

04/12/11 10:32:42 changed by wagi

Of course, that's normally a very bad idea... I have lost all my C++ skills.

04/13/11 01:55:37 changed by janek

This is the standard way erase&delete is done:

Index: debugmodule.cpp
===================================================================
--- debugmodule.cpp	(revision 1979)
+++ debugmodule.cpp	(working copy)
@@ -264,12 +264,11 @@
 {
     // cleanin up leftover modules
     for ( DebugModuleVectorIterator it = m_debugModules.begin();
-          it != m_debugModules.end();
-          ++it )
+          it != m_debugModules.end(); )
     {
         fprintf(stderr,"Cleaning up leftover debug module: %s\n",(*it)->getName().c_str());
-        m_debugModules.erase( it );
         delete *it;
+        it = m_debugModules.erase( it );
     }
 
     if (!mb_initialized)

Note that I haven't compiled the code nor tested.

04/13/11 03:14:43 changed by adi

Still segfaulting in debugmanager.

04/13/11 06:15:49 changed by bsjones

I have tried the following. At first the delete line was uncommented and segfaulted the first time there.

I commented that line out and ran into another invalid pointer in DebogModuleManager::unregisterModule which seems to be called from the ~DebugModule? destructor. Commenting out this call gives us a running jack at the expense of some lost bytes. My knowledge of c++ is fairly rudimentry and I can't say I know what's goingn on here. Its interesting to note from jack's output that ~ebugModuleManager is being called twice.

diff -Nur libffado-2.1.0.orig/src/debugmodule/debugmodule.cpp libffado-2.1.0/src/debugmodule/debugmodule.cpp
--- libffado-2.1.0.orig/src/debugmodule/debugmodule.cpp 2009-12-20 02:12:53.000000000 +1000
+++ libffado-2.1.0/src/debugmodule/debugmodule.cpp      2011-04-13 22:56:05.000000000 +1000
@@ -83,10 +83,10 @@
 //              << " at DebugModuleManager"
 //              << endl;
 //     }
-    if ( !DebugModuleManager::instance()->unregisterModule( *this ) ) {
-        cerr << "Could not unregister DebugModule at DebugModuleManager"
-             << endl;
-    }
+    //if ( !DebugModuleManager::instance()->unregisterModule( *this ) ) {
+    //    cerr << "Could not unregister DebugModule at DebugModuleManager"
+    //         << endl;
+    //}
 
 }
 
@@ -263,13 +263,11 @@
 DebugModuleManager::~DebugModuleManager()
 {
     // cleanin up leftover modules
-    for ( DebugModuleVectorIterator it = m_debugModules.begin();
-          it != m_debugModules.end();
-          ++it )
+    while (!m_debugModules.empty() )
     {
-        fprintf(stderr,"Cleaning up leftover debug module: %s\n",(*it)->getName().c_str());
-        m_debugModules.erase( it );
-        delete *it;
+         fprintf(stderr,"Cleaning up leftover debug module: %s\n",m_debugModules.back()->getName().c_str());
+         // delete m_debugModules.back();
+         m_debugModules.pop_back();
     }
 
     if (!mb_initialized)

04/14/11 05:07:04 changed by adi

Success. Well, sort of. With *unmodified* code, that is, current svn HEAD, I was able to build a working libffado.so on FC16.

All I did was to use Debian's binutils, to be precise, version 2.21.0.20110327-3.

I think it's time to escalate the bug to Fedora's binutils managers, at least I'd value their input.

04/14/11 05:24:45 changed by bsjones

Thanks. Before I hit up the binutils maintainers for comment can you summarise for me what you think is happening? I'm afraid I am out of my depth here.

thanks again, B

04/14/11 06:21:54 changed by adi

Ok. When compiling FFADO under Debian, the resulting library works in both, Debian and Fedora. When compiling FFADO under Fedora, the resulting library neither works in Debian nor Fedora.

So one can assume the code isn't at fault here, but the compiler and its friends create an erroneous binary.

So I replaced Fedora's binutils (containing assemblers, linkers and so on) with the one from Debian, and now Fedora can also create working binaries.

I've narrowed it down to the final linking stage: if I compile everything with Fedora's toolchain and just switch to Debian's ld for creating libffado.so, I get a working binary.

In other words, it seems it's the linker. I cannot draw further conclusions at the moment, it's now up to the ELF experts. If you want to file a bug report, link to this ticket.

I've also attached a small shell script to be run from the source's toplevel directory. That's what I use to overlay the final linker stage with Debian's ld.

04/14/11 06:22:44 changed by adi

  • attachment linkit added.

shellscript to trigger the final linking stage

04/14/11 06:39:59 changed by bsjones

Thanks for the info! Will post back any updates here

04/14/11 08:01:46 changed by adi

  • attachment libffado-debian-ld.so.lzma added.

libffado.so compiled with FC15 toolchain but linked with Debian's ld

04/14/11 08:02:22 changed by adi

  • attachment libffado-fedora-ld.so.lzma added.

libffado.so compiled with FC15 toolchain AND linked with Fedora's ld

04/14/11 08:04:44 changed by adi

JFTR, I've attached two libraries to this ticket, one contains the pure Fedora version (segfaulting) and the other one contains the version which uses Debian's ld (working) for the final linking stage (linking also done on Fedora, just a replaced ld binary and linker scripts).

This might help the binutils guys or ELF experts to look for differences.

04/14/11 18:00:27 changed by bsjones

Performing the last link step with ld.gold also provides a working libffado.so . This was done with revision 1913 in Fedora 15

04/14/11 23:54:48 changed by cladisch

  • attachment ctor-test.cpp added.

static constructors test program

04/14/11 23:54:52 changed by cladisch

If I sort the symbols by address and compare them, I get, among others, this difference:

diff -u libffado-debian-ld-objdump.txt libffado-fedora-ld-objdump.txt
...
 0025f970 l    d  .eh_frame     00000000              .eh_frame
 002a3c04 l     O .eh_frame     00000000              __FRAME_END__
 002a3c08 l    d  .gcc_except_table     00000000              .gcc_except_table
-002b4000 l     O .ctors        00000000              __CTOR_LIST__
-002b4000 l    d  .ctors        00000000              .ctors
-002b41c0 l     O .ctors        00000000              __CTOR_END__
-002b41c4 l     O .dtors        00000000              __DTOR_LIST__
-002b41c4 l    d  .dtors        00000000              .dtors
+002b4000 l    d  .init_array   00000000              .init_array
+002b41bc l    d  .fini_array   00000000              .fini_array
+002b41c0 l     O .ctors        00000000              __CTOR_LIST__
+002b41c0 l    d  .ctors        00000000              .ctors
+002b41c4 l     O .ctors        00000000              __CTOR_END__
+002b41c8 l     O .dtors        00000000              __DTOR_LIST__
+002b41c8 l    d  .dtors        00000000              .dtors
 002b41cc l     O .dtors        00000000              __DTOR_END__
 002b41d0 l     O .jcr  00000000              __JCR_END__
 002b41d0 l     O .jcr  00000000              __JCR_LIST__

Apparently, the Fedora linker moves most constructor/destructor calls of static objects from the .ctors/.dtors to the .init_array/.fini_array sections, where they are probably not found by the C++ runtime.

Please check whether the attached test program fails with the Fedora linker.

04/15/11 00:09:11 changed by cladisch

On further thought, the problem is probably not that the constructors aren't executed, but that their execution order changes: GCC bug 46770.

04/15/11 01:11:05 changed by wagi

Thanks for tracking this down. I figure from that we should fix our code then. This proves why singletons are just an anti pattern :)

04/15/11 14:16:41 changed by bsjones

Rebuilding binutils-2.21.51.0.6 with --enable-initfini-array=no also produces a working libffado.so .

04/26/11 00:59:52 changed by bsjones

  • attachment libffado-2.1.0-add-link-flags added.

Patch to pass link flags into scons

04/26/11 01:04:57 changed by bsjones

To get around this issue we are going to use the ld.gold linker until a resolution can be found from binutils/upstream.

I've modified the scons scripts as per the patch above to accept an alternative path where we point ld to /usr/bin/ld.gold - is this how you'd recommend doing this or is there a more obvious way? Not sure if you want to include something like this in a similar way to COMPILE_FLAGS.

thanks

Brendan

09/03/11 04:42:17 changed by nils

  • cc set to nils.

09/09/11 03:05:38 changed by wagi

FYI, I have done some work on removing the singleton debug module. The code is available here:

http://repo.or.cz/w/ffado.git/shortlog/refs/heads/debug-cleanup

This needs some more work because I have removed quite a few debug levels and features, which nobody seemed to use anyway. But it breaks probably some of the debug messages people expect to see. So we need some coordinated review of the debug messages.

Could someone try someone who is seeing this bug the above version?

Thanks

09/29/11 11:37:14 changed by adi

JFTR, it's no longer possible to hide the problem with binutils-gold as of 2.21.53.20110922-1 (Debian version).

However, the backtraces still differ between binutils-gold and plain binutils.

09/29/11 17:07:43 changed by jwoithe

I have recently upgraded my development PC such that it now includes binutils 2.21.x. Previously (on a system with binutils 2.20...) I was not able to reproduce this bug. If my system now exhibits these issues I intend to have a closer look. It would appear that the binutils version thing may have been a false lead and that instead we're dealing with a memory-corruption type bug. The tricky bit is that it might not even be in the debug module - that might just be the thing which happens to be taken out by the underlying problem. Therefore I think it's important to understandthe cause of this issue if at all possible even though the existing debug module will be replaced with Daniel's rewrite at some point.

09/30/11 05:16:33 changed by jwoithe

I've now recompiled trunk (presently r1995) in the new environment (gcc 4.5.2, binutils 2.21.51.0.6.20110118). The crash in the debug module (or anywhere else for that matter) still does not occur on this system. The behaviour does not change regardless of whether a firewire device is plugged in or not. The lack of consistency suggests to me that we might be looking at a memory corruption or initialisation issue somewhere (which may or may not be in the debug module). Not being able to trigger the bug myself does make it a little difficult to investigate. Hmm.

09/30/11 18:46:35 changed by adi

Weird. In contrast to my i386 Debian unstable system, I cannot reproduce this bug on my amd64 Debian unstable system.

OTOH, this ticket contains crashlogs from an amd64 machine, so it's not an amd64<->i386 thing.

Jonathan, if you like, you can have a login on my i386 machine. Just contact me, if interested.

09/30/11 19:50:21 changed by adi

  • attachment ffado-linker.patch added.

possible fix

09/30/11 19:52:16 changed by adi

I've added ffado-linker.patch to this ticket. Those who like, give it a whirl. It fixes the problem for me, and I can't see no obvious negative impact so far, but broader testing (and subsequent feed-back) would be highly appreciated.

09/30/11 23:47:33 changed by jwoithe

Adi: I'm going to be short of time today and possibly tomorrow but I'll check the effect of the patch out on my box as soon as I can (keeping in mind that to this point I haven't seen the crash). What exactly does the "nodelete" linker switch do? I'll have to look it up.

Even so, what's your feeling as to what this patch is doing - do you think it's fixing the problem, or is it instead working around some other issue? Like you said in comment 31, it's strange that it seems to come and go from system to system, which is why I still wonder about there being some memory corruption happening somewhere in the system. Precisely how this linker option effects any sort of behavioural change is still unclear to me. Time and testing will tell though.

10/04/11 23:35:17 changed by adi

News. "-znodelete" doesn't fix the problem. While it hides the problem when calling libffado2 from jackd, it will still crash with this minimal test program:

echo 'int main() {}' |gcc -x c -lffado - && ./a.out

While debugging, I've noticed that XMLDeserialize calls the DebugModuleManager? *constructor* rather late (I'd say upon termination). This then might or might not collide (race condition) when accessing m_debugModules.

valgrind's first complaint:

Cleaning up leftover debug module: DeviceManager
==14771== Invalid free() / delete / delete[]
==14771==    at 0x4023EAC: operator delete(void*) (in /usr/lib/valgrind/vgpreload_memcheck-x86-linux.so)
==14771==    by 0x41241F1: DebugModule::~DebugModule() (debugmodule.cpp:91)
==14771==    by 0x4124821: DebugModuleManager::~DebugModuleManager() (debugmodule.cpp:294)
==14771==    by 0x4121569: exitfunct (ffado.cpp:61)
==14771==    by 0x400E4C9: _dl_fini (dl-fini.c:244)
==14771==    by 0x42DA4DE: __run_exit_handlers (exit.c:78)
==14771==    by 0x42DA54E: exit (exit.c:100)
==14771==    by 0x42C1E4D: (below main) (libc-start.c:260)
==14771==  Address 0x428941c is 0 bytes inside data symbol "m_debugModule"

(ignore the line numbers for debugmodule.cpp, I've hacked a bit on it)

Anyway, still no useful indication where to go. The best advice I have atm is to drop line 61 in ffado.cpp.

(follow-up: ↓ 37 ) 10/07/11 04:38:58 changed by jwoithe

Hmm. I'll be the first to admit that I'm not up to speed on all the exotic features of C++, but something doesn't quite add up to me here. The exitfunct calls the debug module destructor; fair enough. This iterates over the debug modules which are still associated with the manager to get rid of them. This seemingly deletes the iterator before using it in the "++it" operation as part of the "for" loop. Ordinarily this would count as a dereference after deallocation; are the rules different for C++ iterators - in other words, is this a valid use of iterators in C++?

Then there's the issue of erasing the current item from the iterator. Is that valid?

Having written that, I reckon this discussion has come up before. If it has, the continued existence of this code would suggest that it was decided that it was acceptable.

(follow-up: ↓ 38 ) 10/07/11 14:25:20 changed by dehnhardt

Jonathan,

I tried debugging a little bit today and your remarks seems to be leading in the right direction.
But before some thoughts:

1.) I don't have this error on my machine. To reproduce it, I had to comment out the erase-statement in DebugModuleManager::unregisterModule.
After doing that I always had the error.
The question is: What prevents modules from being unregistered on some setups?
I found out, that - even when commenting out the erase-statement in unregisterModule - the DebugModule? destructor is called. (But I did not find where the Module is deleted)
But it might be a sign that the modules are already deleted and only the reference in the vector is not erased. This would explain why even the getName() is not working...
In this case, erase would be prefectly sufficient in ~DebugModuleManager()

2.) ~DebugModuleManager() is called twice during jack startup!!! This seems weird to me...

Back to the vector:
The increment in the "for loop" is at least unnesessary. Erase seems to position the iterator to the next element. The additional increment will result in deleting only every second element in the vector..., this could cause trouble when the second last element is erased: The iterator is at the last element but is incremented again...

BTW:

In RegisterModule() is a return-statement in the for loop, which makes all the "already_present stuff" useless.

Holger

(in reply to: ↑ 35 ) 10/07/11 16:37:41 changed by cladisch

Replying to jwoithe:
Most changes to a vector invalidate all iterators into this vector. So neither are you allowed to dereference any iterator after calling erase(), nor is it possible to reliably continue the iteration.

Consider a simple vector implementation where its contents are a plain array and the iterator type is a plain pointer. After erasing an element, it will have been overwritten by the next element. Furthermore, after erasing the last element, incrementing the iterator makes it point to the element behind the new end().

(in reply to: ↑ 36 ; follow-up: ↓ 40 ) 10/08/11 13:48:25 changed by dehnhardt

Replying to dehnhardt:

Ok, replying to my own comment;-)

1.) I don't have this error on my machine. To reproduce it, I had to comment out the erase-statement in DebugModuleManager::unregisterModule.

I have overseen, that this is called in the destructor of DebugModule.

This explains, why every operation on this module in ~DebugModuleManager() fails.

BUT this seems to reproduce exactly the behavior of this bug. Could it be, that calling unregisterModule may fail in the DebugModule destructor under some circumstances?

To proove or disprove, could someone who suffer from this bug apply the following patch, start and stop jackd (without any ffado debug level) and post the output?

Holger

PATCH

Index: src/debugmodule/debugmodule.cpp
===================================================================
--- src/debugmodule/debugmodule.cpp     (Revision 1998)
+++ src/debugmodule/debugmodule.cpp     (Arbeitskopie)
@@ -83,6 +83,7 @@
 //              << " at DebugModuleManager"
 //              << endl;
 //     }
+    cerr << "*~* delete DebugModule " << this << endl;
     if ( !DebugModuleManager::instance()->unregisterModule( *this ) ) {
         cerr << "Could not unregister DebugModule at DebugModuleManager"
              << endl;
@@ -265,11 +266,12 @@
     // cleanin up leftover modules
     for ( DebugModuleVectorIterator it = m_debugModules.begin();
           it != m_debugModules.end();
-          ++it )
+           )
     {
-        fprintf(stderr,"Cleaning up leftover debug module: %s\n",(*it)->getName().c_str());
+       cerr << "*~* erase from vector DebugModule " << *it << endl;
+        //fprintf(stderr,"Cleaning up leftover debug module: %s\n",todel->getName().c_str());
         m_debugModules.erase( it );
-        delete *it;
+        //delete *it;
     }
 
     if (!mb_initialized)
@@ -445,7 +447,8 @@
     {
         if ( *it == &debugModule ) {
             already_present=true;
-            return true;
+           break;
+            //return true;
         }
     }
 
@@ -464,7 +467,7 @@
 
     for ( DebugModuleVectorIterator it = m_debugModules.begin();
           it != m_debugModules.end();
-          ++it )
+          it++ )
     {
         if ( *it == &debugModule ) {
             m_debugModules.erase( it );


10/08/11 14:54:01 changed by rsalveti

Just to say that this bug is also happening at Ubuntu Oneiric, as described by bug https://bugs.launchpad.net/ffado/+bug/869420

The current version used by Ubuntu is from svn r1985, built with binutils-gold (2.21.53.20110810).

(in reply to: ↑ 38 ) 10/08/11 15:12:08 changed by rsalveti

Replying to dehnhardt:

Replying to dehnhardt: Ok, replying to my own comment;-)

1.) I don't have this error on my machine. To reproduce it, I had to comment out the erase-statement in DebugModuleManager::unregisterModule.

I have overseen, that this is called in the destructor of DebugModule. This explains, why every operation on this module in ~DebugModuleManager() fails. BUT this seems to reproduce exactly the behavior of this bug. Could it be, that calling unregisterModule may fail in the DebugModule destructor under some circumstances? To proove or disprove, could someone who suffer from this bug apply the following patch, start and stop jackd (without any ffado debug level) and post the output?

Output from jackd when using libffado with your patch, on Ubuntu (on top of r1985): http://paste.ubuntu.com/704633/

10/08/11 20:13:25 changed by jwoithe

I don't have a lot of time right now (I have to go out), but I am really questioning the validity of doing the erase() call in the way it's currently done in the DebugModuleManager? destructor. Clement's comment certainly seems to confirm my gut feeling that it isn't valid code.

Calling unregisterModule() from the DebugModule? destructor isn't a fundamental problem AFAICT, and in fact would seem to be the right thing to do for other reasons. When the DebugModuleManger? destructor runs it calls erase before deleting the individual DebugModules?, so when the unregisterModule/erase call is called from the DebugModules?() the erase won't be called a second time because by that point the module is no longer in the manager.

More to come when I get back online.

(follow-up: ↓ 43 ) 10/09/11 04:08:28 changed by jwoithe

I think what we ought to do is try to clean up these destructors which use iterators to delete objects from lists. The construct under suspicion in the debug module system is used in many other places within FFADO; if it is indeed invalid code then this may go a long way to explaining the random crashes we're still seeing on exit on certain machines.

The typical pattern we're seeing is as follows.

  for (Iterator it = SomeVector.begin(); it != SomeVector.end(); it++) {
    SomeVector.erase(it);
    delete *it;
  }

In other words, this goes through SomeVector? and deletes each element in it. At least that's the intention.

Checking the STL documentation, it seems that the above pattern is wrong - this is consistent with Clement's advice earlier. An iterator erase() method "Removes from the vector container either a single element (position) or a range of elements", and "This invalidates all iterator and references to position (or first) and its subsequent elements".

Since the call to erase() invalidates the iterator, it seems to me that we simply can't use vector iterators to effect cleanup of vectors and their elements. At the very least, after erase() the iterator is invalidated and thus the use of it in the following delete() operation is undefined.

The question is what form this alternative might take. I'm thinking of something like this:

  while (!SomeVector.empty()) {
    SomeVector.pop_back();
  }

The pop_back() method calls the removed object's destructor, so that side of things is taken care of. What I'm not sure of is whether a dynamically allocated object's memory will also be freed by this action. I don't think it is, in which case something like this may be in order:

  while (!SomeVector.empty()) {
    Element e = SomeVector.get_last_element();
    SomeVector.erase_element(e);
    delete e;
  }

The idea here being that we get the last element in the vector, remove it from the vector, then call delete() on it to invoke the destructor and free its memory. Then repeat until the vector is empty.

By way of concrete example, I'm considering something like this in the context of ~DebugModuleManager?() in place of the existing iterator-based deletion of m_debugModules elements:

  while (!m_debugModules.empty()) {
    DebugModule *mod = m_debugModules.back();
    m_debugModules.erase(m_debugModules.begin()+m_debugModules.size()-1);
    delete mod;
  }

I've compile-tested this on my box, but because the error condition is not triggered for me I don't know whether it helps or not.

Note that we can't use m_debugModules.end() as the argument to erase() because end() returns a reference to the "just past the end" element, not the final element itself.

Of course in the light of this whole discussion, it could well be that the debug module segfaults are just the end result of memory corruption caused by one of the other iterator-erase-delete constructs which have executed earlier. Since it seems that the existing construct could do crazy things, any one of these under the correct circumstances could give rise to the problems we're seeing.

If the above suggested construct makes sense to those more familiar with C++ than I, I'll go through and replace all the existing iterator-erase-delete loops with this new version and see if it changes things.

(in reply to: ↑ 42 ; follow-up: ↓ 44 ) 10/09/11 05:36:36 changed by cladisch

Replying to jwoithe:

... The question is what form this alternative might take. I'm thinking of something like this:

  while (!SomeVector.empty()) {
    SomeVector.pop_back();
  }

The pop_back() method calls the removed object's destructor, so that side of things is taken care of. What I'm not sure of is whether a dynamically allocated object's memory will also be freed by this action.

If the vector's element type is a pointer, it is only the pointer itself that is destructed. If the SomethingManager object owns the Something objects, you might consider using smart pointers.

I'm considering something like this in the context of ~DebugModuleManager() in place of the existing iterator-based deletion of m_debugModules elements:

  while (!m_debugModules.empty()) {
    DebugModule *mod = m_debugModules.back();
    m_debugModules.erase(m_debugModules.begin()+m_debugModules.size()-1);
    delete mod;
  }

This looks correct, but you can use pop_back() instead of erase(...).

As the same code is used multiple times, consider using a helper function:

template<typename T>
void delete_pointers_from_vector(vector<T*>& vec)
{
  while (!vec.empty()) {
    T* ptr = vec.back();
    vec.pop_back();
    delete ptr;
  }
}

Or even more C++ish:

template<typename T>
inline void delete_ptr(T* ptr)
{
  delete ptr;
}

template<typename T>
void delete_pointers_from_vector(vector<T*>& vec)
{
  vector<T*> d;
  d.swap(vec);
  for_each(d.begin(), d.end(), delete_ptr<T>);
}

(in reply to: ↑ 43 ) 10/09/11 05:49:35 changed by jwoithe

Replying to cladisch

Thanks for the comments and clarification. I'll do up a patch incorporating this in the next day or so; then we can see whether it makes any difference to those experiencing the crash on exit. Even if it doesn't fix the crashes it seems like it's a fix we need to make regardless.

10/10/11 00:12:37 changed by adi

  • status changed from new to closed.
  • resolution set to fixed.

Fixed in r1999. Kudos to David Henningsson for the patch (via e-mail).

10/11/11 02:53:10 changed by jwoithe

For the record, here is the text of David's post to ffado-devel which explains the rationale of the above patch:

I can reproduce the crash, and have attached a patch which fixes it here.

There are two problems, which both are already mentioned in the report:

First, debug modules are not allocated pointers, in fact, the standard way of making one (using DECLARE_DEBUG_MODULE macro) does not. Therefore you must not delete it.

Second, DebugModuleManager::instance() becomes a dangling pointer after the instance has been deleted, so DebugModule? instances must not reference it.

The attached patch fixes both problems.

In the longer term we should also fix the issue of DebugModuleManager::instance() becoming a dangling pointer on close-down, for completeness. With the revised code though this isn't a current problem. Also, similar uses of iterators to delete member objects of other manager functions in ffado also need to be addressed along the lines of that outlined in comment:42 and comment:43. While most of these occur in destructors they do crop up in other places as well.

03/28/12 04:47:48 changed by jwoithe

Note: the problems discussed here may have been related to those raised in ticket #283.