Ticket #354 (closed bug: fixed)

Opened 12 years ago

Last modified 12 years ago

ffado-test BusReset: sometimes this command does not end

Reported by: yves.g Assigned to:
Priority: minor Milestone:
Component: generic Version: FFADO SVN (trunk)
Keywords: ffado-test BusReset Cc:
The device the bug applies to: All

Description

I am opening this ticket after a discussion on ffado-devel, just to leave a trace in this trac, since the problem is still open.

The problem is with Focusrite Saffire Pro24, Ubuntu Precise Pangolin 12.04 (using jackdmp). I am using recent revisions of svn (last time I observed the problem was with r2161).

The command "ffado-test BusReset" sometimes never ends, being blocked after printing the line:

Doing busreset on port 0

When I launch qjackctl, this bug prevents the application from responding. If I kill the process ffado-test, then qjackctl continues working, and jack is started normally.

I join a corresponding extract of /var/log/kern.log and the result of strace ffado-test BusReset

Attachments

strace.log (29.3 kB) - added by yves.g on 06/03/12 07:13:45.
result of "strace ffado-test BusReset?"
kernel.log (11.1 kB) - added by yves.g on 06/03/12 07:14:35.
content of /var/log/kern.log

Change History

06/03/12 07:13:45 changed by yves.g

  • attachment strace.log added.

result of "strace ffado-test BusReset?"

06/03/12 07:14:35 changed by yves.g

  • attachment kernel.log added.

content of /var/log/kern.log

06/04/12 04:46:40 changed by jwoithe

  • priority changed from major to minor.

Some points to stimulate further discussion. Firstly, I'm reducing the priority to "minor" because as far as I can see there's really no reason why anyone should want to routinely run "ffado-test BusReset?". Sure, the fact this gets hung up is a problem which we need to fix, but unless the issue can be demonstrated as affecting the primary functions of libffado it doesn't really count as a "high priority" bug. I should note that libffado normally does not issue bus resets.

Another thing to note is that running ffado-test has nothing to do with jackd, so jackd itself (no matter what version) can be discounted here.

Stefan: I would appreciate your thoughts on the strace log and kernel log which has been attached by the reporter. I can't see anything obviously amiss in the kernel log. As to the strace, it seems a futex is taken from within the bus reset ioctl and that this is never released by whatever is holding it. My guess is that one of the threads within ffado-test is holding a lock on the firewire device for some reason. I can't quite understand why though, because theoretically ffado-test shouldn't have to do much except initiating the bus reset. A closer study of ffado-test may be needed here.

yves.g: Another test that's worth doing is to get gscanbus and try initiating a bus reset from that. This will at least tell us whether the problem is isolated to ffado-test. In addition, what happens if you try "ffado-test BusReset?" without the Pro24 connected?

Finally, I wonder whether this issue is isolated to the Pro24 or happens with other interfaces as well. I can't really see how ffado-test's bus reset initiation could possibly depend on the devices on the bus, but stranger things have happened.

06/04/12 06:01:56 changed by jwoithe

  • owner changed.
  • component changed from devices/dice to generic.

Here's some additional developments. Firstly, I can reproduce the hang of "ffado-test BusReset?" with the RME FF800 connected. Most importantly however, I can also trigger the problem without any active devices on the bus at all. The pretty much rules out the problem as being specific to the Pro24 so I've removed DICE from the ticket component.

"Ffado-test BusReset?" calls the busreset() function. It's a pretty simple function: create an Ieee1394Service object, use this to initiate a bus reset, and then exit. As part of the exit the Ieee1394Service destructor is called. Running gdb on the apparently frozen process reveals a number of active threads (far more than might have expected, but then again Ieee1394Service does set up a lot of the infrastructure required if streaming were to be done). There are 10 in all:

(gdb) info threads
  10 Thread 0xb6c2fb70 (LWP 2373)  0xb74dd920 in sem_wait@@GLIBC_2.1 () from /lib/libpthread.so.0
  9 Thread 0xb642fb70 (LWP 2374)  0xb72c61dc in clock_nanosleep () from /lib/librt.so.1
  8 Thread 0xb5c2fb70 (LWP 2375)  0xb72c61dc in clock_nanosleep () from /lib/librt.so.1
  7 Thread 0xb542fb70 (LWP 2376)  0xb6f42a1c in epoll_wait () from /lib/libc.so.6
  6 Thread 0xb4c2fb70 (LWP 2377)  0xb6f42a1c in epoll_wait () from /lib/libc.so.6
  5 Thread 0xb442fb70 (LWP 2378)  0xb6f42a1c in epoll_wait () from /lib/libc.so.6
  4 Thread 0xb3c2fb70 (LWP 2379)  0xb72c61dc in clock_nanosleep () from /lib/librt.so.1
  3 Thread 0xb342fb70 (LWP 2380)  0xb74de224 in __lll_lock_wait () from /lib/libpthread.so.0
  2 Thread 0xb2c2fb70 (LWP 2381)  0xb6f0486c in nanosleep () from /lib/libc.so.6
* 1 Thread 0xb6e326d0 (LWP 2372)  0xb74d7c59 in pthread_join () from /lib/libpthread.so.0

Thread 1 corresponds to the "main" thread, the one which calls busreset(). A simple "where" shows that this is hung up in the Ieee1394Service destructor:

#0  0xb74d7c59 in pthread_join () from /lib/libpthread.so.0
#1  0xb760fafd in Util::PosixThread::Stop (this=0x9897e48) at src/libutil/PosixThread.cpp:198
#2  0xb75e69f9 in IsoHandlerManager::~IsoHandlerManager (this=0x98841b8, __in_chrg=<value optimized out>)
    at src/libieee1394/IsoHandlerManager.cpp:494
#3  0xb75e6b6c in IsoHandlerManager::~IsoHandlerManager (this=0x98841b8, __in_chrg=<value optimized out>)
    at src/libieee1394/IsoHandlerManager.cpp:507
#4  0xb75dee37 in Ieee1394Service::~Ieee1394Service (this=0xbfd666b4, __in_chrg=<value optimized out>)
    at src/libieee1394/ieee1394service.cpp:113
#5  0x0804bba1 in busreset (port=0) at tests/test-ffado.cpp:223
#6  0x0804bf7a in main (argc=4, argv=0xbfd67fa4) at tests/test-ffado.cpp:303

This indicates that the bus reset succeeds and that the problem is in the closedown of the ieee1394service object.

The other threads are as follows.

Thread 3 appears to be stuck waiting for a lock as part of its creation:

#0  0xb74de224 in __lll_lock_wait () from /lib/libpthread.so.0
#1  0xb74d91ad in _L_lock_810 () from /lib/libpthread.so.0
#2  0xb74d8fd1 in pthread_mutex_lock () from /lib/libpthread.so.0
#3  0xb760ebdc in Util::PosixMutex::Lock (this=0x9897e70) at src/libutil/PosixMutex.cpp:134
#4  0xb760f1ac in Util::PosixThread::ThreadHandler (arg=0x9897e48) at src/libutil/PosixThread.cpp:67
#5  0xb74d6c04 in start_thread () from /lib/libpthread.so.0
#6  0xb6f4202e in clone () from /lib/libc.so.6

I find frames 2 and 3 in thread 1 interesting: IsoHandlerManager?.cpp:507 is the end of the ~IsoHandlerManager?() destructor. IsoHandlerManager?.cpp:494, on the other hand, is part way through ~IsoHandlerManager?() acting on the same object, which doesn't make a whole lot of sense. This has me thinking whether there's another subtle object destruction race condition going on here: that something is causing ~IsoHandlerManager?() to be called twice on the same object. What I can't work out is why the first call frame remains on the stack while the second is active. And to my eyes the disposal of the IsoHandlerManager? object should be fine: when the ieee1394service destructor is called it simply calls delete on the IsoHandlerManager?; there's no iterators involved, so there's no obvious mechanism for control to re-enter ~IsoHandlerManager?().

If this is indeed what's happening it's little surprise we're getting an apparent hang because the object being acted on has already been disposed of (or at least its components have been). Of course the question is what's behind this.

06/04/12 06:35:02 changed by jwoithe

Some more debugging info about the threads.

  • Thread 10: DebugModuleManager::mb_thread_func thread
  • Thread 9: Util::Watchdog::WatchdogHartbeatTask? thread
  • Thread 8: Util::Watchdog::WatchdogCheckTask? thread
  • Thread 7: the BUSRST thread
  • Thread 6: the ARMSTD thread
  • Thread 5: the ARM-RT thread
  • Thread 4: this thread runs the cycle timer helper
  • Thread 3: the lock being requested is ISOXMT
  • Thread 2: the IsoTask? type handled by this thread is eHT_Receive

Thread 1 (the main thread) is supposedly hung up at line IsoHandlerManager?.cpp:494 waiting to for a thread to complete. That thread is the iso transmit thread, which is precisely the thread that's held up on a lock during thread object creation (thread 3). It's still not clear to me why ~IsoHandlerManager?() is apparently being re-entered, but the state of thread 3 may not be a direct consequence of this because thread 3 never really gets to run. IsoHandlerManager::IsoTask::Execute() should be called from Util::PosixThread::ThreadHandler?() at PosixThread?.cpp:90 (as happens in thread 2), but instead it got hung up acquiring a lock earlier on during startup at line PosixThread?.cpp:67.

06/04/12 06:39:45 changed by stefanr

How about running ffado-test BusReset in valgrind or the likes in order to check for use-after-free, use of uninitialized memory and the likes?

Furthermore, do debugging aids exist for lock dependency checking? I.e. something which detects lock recursion, ABBA locking etc.? (I merely know that kernel programmers have such means, so I wonder if there isn't anything of that kind for userspace programming.)

06/04/12 07:22:43 changed by jwoithe

Ok, here's a theory arrived at by tracing thread entry points. It seems that some of the FFADO threads are taking a reasonable time to start, and that sometimes they're not all started by the time the bus reset component of ffado-test has completed its thing and the ieee1394service destructor is called. This then causes trouble because the close-down code assumes that all threads are operational and behaves accordingly. In the case of "ffado-test BusReset?", if the ISO transmit thread wasn't completely started by the time the ieee1394service destructor was called, bad things would happen and the freeze would be the result. Whenever that thread's startup was complete before the destructor was called, "ffado-test BusReset?" would exit cleanly.

Svn r2162 addresses this in the obvious way, by inserting a small delay to give the thread objects a chance to start before moving on (a neater solution may be to query the thread objects for their status and continue when they're all running, but I don't have time to work out how to do that right now, and it may not be necessary to go with something that complex). This appears to have made "ffado-test BusReset?" stable on my system (in that I haven't had a freeze since implementing r2162). However, since we're dealing with a timing related issue it would be good if others could test this before I close this ticket as done. For example, to cope with slower machines it may be necessary to increase the delay somewhat.

Yves: please test r2162 and report your findings.

I'm also wondering whether r2162 may address some intermittent start-up glitches I've seen from time to time (again, caused by the assumption that certain threads are running when in fact they're not). That's off-topic for this ticket though; time will tell.

Running ffado-test in valgrind isn't a bad suggestion irrespective of the outcome of the above testing. I'll put it on my list of things to do since I'm out of time now (it's getting late). I'm unsure about the lock dependency checking: I'd have to check that out.

06/04/12 12:38:25 changed by yves.g

Before testing r2161, I verified that gscanbus went OK. I also verified that "ffado-test BusReset" finished immediately when the Pro24 was not connected. On the opposite, it did not end when the Pro24 was powered (repeated a few times).

I then compiled r2162, and after that the command "ffado-test BusReset" recovered a normal behaviour, ending correctly after resetting the bus. I rebooted several times my PC, with the Pro24 On or Off during reboot. I did not observe any situation where ffado-test was blocked. Maybe I should wait two or three days and several more tests before I conclude that r2162 solves the problem (the error was random as I said in the title). But up to now, after two hours of repeated tests, the problem did not occur and we may reasonably presume that it is solved. If it does not reappear in the next few days, I will mark the ticket as solved.

06/04/12 16:34:42 changed by jwoithe

  • device_name changed from Focusrite Saffire Pro 24 to All.

Quoting from comment 6:

I verified that gscanbus went OK

Ok, that provides further confirmation that the fault is not in the bus reset itself, but rather within the FFADO infrastructure. In fact, as explained in comment 5 the bus reset itself does actually complete; it was the tear-down of the FFADO objects which caused issues.

I also verified that "ffado-test BusReset?" finished immediately when the Pro24 was not connected.

I think you'll find that if you tried a large number of times without the Pro24 connected (or connected but not turned on) that you would have seen the hang. As I said, I was able to get the hang regardless of whether the FF800 was turned on. What I neglected to mention was that with the device turned off it was much harder to trigger the hang - but it could be done. Given the nature of the problem as we now understand it though, this is not surprising. The point here is that the issue is totally independent of the device - which again comes as no surprise since no device-specific code is executed by ieee1394service, or more generally by "ffado-test BusReset?".

I then compiled r2162 ... I did not observe any situation where ffado-test was blocked.

This is good news. Thanks for confirming.

As I mentioned, a somewhat more robust solution is to introduce a method whereby the threads can indicate when they are actually running and ready for use. That would avoid the need to guess an appropriate delay. However, if it can be proven that the simple solution of the added delay is adequate I'm inclined to run with that, at least for now.

06/11/12 11:07:31 changed by yves.g

I have been testing for one week, and unfortunately, the problem reappeared one of these seven days. The other days went OK. When the problem happened, I switched my interface off and on, and retried to start jack. This failed until a fourth repetition of this trial which succeeded.

I would say that the delay introduced by Jonathan in r2162 solves the problem most of the times, but in some circumstances, it is not sufficient.

(follow-up: ↓ 11 ) 06/11/12 16:28:13 changed by jwoithe

I would say that the delay introduced by Jonathan in r2162 solves the problem most of the times, but in some circumstances, it is not sufficient.

Thanks for the report.

Probably a more likely scenario is simply that the actual delay length isn't quite enough: that in most cases it is, but that there are cases where it is not. I will do one of two things in the next day or so, depending on the outcome of my pondering the situation: either I'll increase the delay which I suspect will resolve the problem, or I'll put in some signalling so it's possible to determine when all threads have started processing. The latter is a lot more complex, so if a simple increase in the delay value fixes things then I'm inclined to go with that. We'll see.

The other thing in the back of my mind is that "ffado-test BusReset?" is not a user command - people generally should not need to do this at all. It's therefore hard to justify spending a lot of developer time fixing something which, at the end of the day, is just a convenience tool for developers. Anyway, more to follow.

(follow-up: ↓ 12 ) 06/12/12 06:29:41 changed by jwoithe

I have re-analysed the situation and now understand fully what was going on. r2164 contains a more robust fix for the problem which doesn't rely on an ad hoc usleep(). So long as I haven't missed any corner cases, r2164 should resolve this issue.

The deadlock described in this ticket occurred via the following sequence of events. Firstly, PosixThread::Start() was called to start the thread, which it did. It returned to the caller and execution of the program continued. In the case of "ffado-test BusReset?", the main program completed very quickly, resulting in a call to PosixThread::Stop() in very short order. Critically, due to scheduling behaviour, PosixThread::Stop() could, at times, get executed before the relevant thread's PosixThread::ThreadHandler?() had a chance to get going. The end result was that PosixThread::Stop() acquired m_lock and then blocked, waiting for the thread to terminate.

In the meantime, PosixThread::ThreadHandler?() was scheduled and it tried for m_lock. This blocked because in this scenario PosixThread::Stop() already had the lock. This resulted in the observed deadlock.

r2164 addresses this problem by ensuring that PosixThread::ThreadHandler?() acquires m_lock before anything else (essentially PosixThread::Start() will not return until the thread has signaled that it has the lock). Even if PosixThread::Stop() is called very soon after PosixThread::Start() the deadlock cannot occur because PosixThread::ThreadHandler?() already has the lock at this point. This ensures the logic in PosixThread::Stop() (and PosixThread::Kill(), coincidently) works as expected.

The reason I took the time to fix this properly is that I have a suspicion that similar race conditions may be behind some of the intermittent start-up problems I've been investigating lately. These don't cause deadlocks, but could (for example) cause thread-based objects to be used prematurely, giving erratic or undefined results.

yves: please re-test using svn r2164 and report your findings.

(in reply to: ↑ 9 ) 06/12/12 13:25:57 changed by yves.g

Replying to jwoithe:

The other thing in the back of my mind is that "ffado-test BusReset?" is not a user command - people generally should not need to do this at all. It's therefore hard to justify spending a lot of developer time fixing something which, at the end of the day, is just a convenience tool for developers. Anyway, more to follow.

You are right, "ffado-test BusReset?" should not be necessary. I probably use it too frequently. There was a time where I took the habit to use it systematically before starting jack, and at that moment it helped me a lot. I will try to forget this command...

(in reply to: ↑ 10 ) 06/12/12 13:29:45 changed by yves.g

Replying to jwoithe:

yves: please re-test using svn r2164 and report your findings.

I compiled and started jack without any problem. I will continue testing. Since the error was random, it will take some time for me to be convinced that it has disappeared.

Let's talk about it later...

06/13/12 15:42:35 changed by stefanr

yves.g, in a perfect world you wouldn't need it, but anyway: "firecontrol <<< br" is an alternative to "ffado-test BusReset?". Or you write your own 7-line utility which opens /dev/fw1 and issues an FW_CDEV_IOC_INITIATE_BUS_RESET ioctl.

06/20/12 14:13:48 changed by yves.g

  • status changed from new to closed.
  • resolution set to fixed.

One week later, having tested almost every day, I confirm that the error no longer appears. I propose to consider this bug as "fixed".

06/20/12 17:05:26 changed by jwoithe

Great! Thanks yves.g for the ongoing testing and verification of the fix.

I suspect that the issue addressed by this fix may well have contributed to some of the unpredicatable startup glitches reported from time to time. So even if the usage which prompted this ticket is somewhat uncommon I think the resulting fix is valuable.