Ticket #216 (closed bug: wontfix)

Opened 15 years ago

Last modified 12 years ago

Unhandled XRUN

Reported by: j.silvestre Assigned to:
Priority: major Milestone: FFADO 2.1
Component: Version: FFADO SVN (2.0 branch)
Keywords: Cc:
The device the bug applies to:

Description

Sometimes Jackd reports xrun even with big buffer and sometimes causing jackd to stop. Attached is a log file with a xrun and an unhandeld xrun causing jackd to stop. Full log file size is more than 1.3GB so I remove what I think to be useless parts.

Attachments

ffado-jack.log-cut.bz2 (116.1 kB) - added by j.silvestre on 04/28/09 14:54:02.
ffado-diag.log (6.5 kB) - added by j.silvestre on 04/28/09 14:54:24.
ffado-jack.log-20120406 (268.8 kB) - added by jwoithe on 04/07/12 04:10:08.
Copy of ffado jack log from j.silvestre on 6 April 2012
ffado-diag-20120406 (6.3 kB) - added by jwoithe on 04/07/12 04:10:58.
Copy of ffado-diag from j.silvestre on 6 April 2012
ffado-diag (6.3 kB) - added by j.silvestre on 04/07/12 07:32:31.
ffado-jack.log (181.0 kB) - added by j.silvestre on 04/07/12 07:32:50.
ffado-jack-saffire.log (343.8 kB) - added by j.silvestre on 04/11/12 08:49:03.
ffado-jack-fa101.log (195.4 kB) - added by j.silvestre on 04/11/12 08:49:48.
ffado-jack-both.log (283.1 kB) - added by j.silvestre on 04/12/12 06:46:33.
ffado-jack-both-xrun-1024.log.xz (0.7 MB) - added by j.silvestre on 04/13/12 03:30:00.
ffado-jack-both-xrun-1024-1.log.xz (0.8 MB) - added by j.silvestre on 04/13/12 06:29:52.
ffado-jack-fa101-256-strip.log (182.7 kB) - added by j.silvestre on 05/06/12 02:58:58.

Change History

04/28/09 14:54:02 changed by j.silvestre

  • attachment ffado-jack.log-cut.bz2 added.

04/28/09 14:54:24 changed by j.silvestre

  • attachment ffado-diag.log added.

04/28/09 14:58:42 changed by j.silvestre

  • version changed from FFADO 2.0-rc1 (1.999.40) to FFADO SVN (2.0 branch).

(follow-up: ↓ 3 ) 04/28/09 15:02:48 changed by ppalmers

There is some sort of bug in the cycle timer update code, I've seen more than one report on this. I'll have to check this out. I suspect that it is a concurrency bug.

(in reply to: ↑ 2 ) 10/12/09 12:09:55 changed by wagi

Replying to ppalmers:

There is some sort of bug in the cycle timer update code, I've seen more than one report on this. I'll have to check this out. I suspect that it is a concurrency bug.

Can you elaborate a bit more? I have started to walk through the code but without any results so far.

11/22/09 10:18:47 changed by ppalmers

  • milestone changed from FFADO 2.0 to FFADO 2.1.

Moving this to 2.1 to see whether more issues come up.

03/28/12 05:14:47 changed by jwoithe

Since this was moved to the 2.1 milestone I am not aware of any further issues in regard to this. What do others think - is this still a valid ticket or should I close it off.

About the only thing I can think of is that perhaps this is another manifestation of the effects of NTP (see ticket #242).

03/29/12 01:11:57 changed by wagi

IIRC, around the time when this bug was reported, someone (Stefan Richter?) added a workaround in the ohci driver for reading correct values from the cycle timer register. Under certain condition some HW reported bogus values. The trick is to sample it a few times and interpolate the values.

Rereading what Pieter said on the cycle timer update code in 2009, it could be that the fix in the ohci layer fixed this bug as well.

Of course, NTP remains a problem but that has to be address in a different way I guess.

03/29/12 03:08:13 changed by jwoithe

  • status changed from new to closed.
  • resolution set to fixed.

What Wagi said in the previous comment (comment 6) makes sense. The OHCI fix, as far as I can tell, dealt with some buggy hardware which is clearly a separate issue to NTP (ticket #242). Given this and Wagi's comment I'll close this ticket as "fixed" (it can always be reopened if the same problem is observed again). Ticket #242 remains open to deal with the separate NTP issue.

03/29/12 09:23:22 changed by stefanr

I am not sure whether Pieter suspected specifically a bug in FFADO's handling of controller bugs regarding cycle timer, or generally a bug in FFADO's cycle timer usage (per comment 2). Even if it was the former, it still cannot be concluded that recent kernels' handling of those controller bugs automatically prevents such a suspected issue in FFADO from happening. Maybe the respective FFADO code paths are still taken.

What's more: The reporter has got an "Texas Instruments PCIxx12 OHCI Compliant IEEE 1394 Host Controller [104c:803a]". I have a suspicion that this one is indeed quirkier than most other TI controllers. (Example of a PCIxx12 oddity: https://bugzilla.redhat.com/show_bug.cgi?id=608544) However, we know little about TI PCIxx12. It is as yet absent from firewire-ohci's hardwired workarounds table (http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=drivers/firewire/ohci.c;h=7f5f0da726dafde5ae23ad1343861e12857c5002#l280). Hence the kernel's cycle timer workaround is not active for this controller.

An owner of a TI PCIxx12 can use "test_cycle_time" from http://user.in-berlin.de/~s5r6/linux1394/utils/ to check whether this controller needs the workaround. It would be good to get to know whether or not this is the case. If yes, it is quick and easy to get the necessary table entry released into the mainline kernel and subsequently all actively maintained distributor kernels.

So the ideal course of action would be:

  • The reporter (or anybody else who can) tests PCIxx12 for the hardware bug. Instructions are at the top of the source file of "test_cycle_time".
  • If the hardware bug is present, the reporter manually activates firewire-ohci's workaround. I will gladly give instructions, it's simple. Then the reporter attempts to reproduce the initial issue of occasional unhandled xruns.
  • If there still are unhandled xruns, someone familiar with FFADO's debug messages needs to compare the old and new logs and figure out whether this is still the same problem or something new.

Re comment 6: The kernel's workaround was written by Clemens.

03/29/12 11:58:51 changed by wagi

Thanks Stefan looking really into the log files.

Maybe adding/including Stefan's test_cycle_time to the ffado-diag script would a good idea.

03/30/12 04:19:13 changed by jwoithe

From comment 9:

Maybe adding/including Stefan's test_cycle_time to the ffado-diag script would a good idea.

To get adequate statistics, test_cycle_time runs for 60 seconds. I'm not sure it's a good idea to put something so lengthy into ffado-diag. However, others may disagree.

In the meantime I've pulled Stefan's test_cycle_time-20100125 into FFADO's tests/ directory as test-cycle-time (to be consistent with the naming convention used for other tools in that area). It's in r2100. At the very least this makes test-cycle-time more easily accessible. If we suspect someone's experiencing the problem they can at least be directed to run this now. If others think that this is something we can routinely add to ffado-diag then I'm not fussed if it's added to that script. Who knows; if it's put in there we might pick up some other chipsets which suffer from the problem.

Note that test-cycle-time needs to be run as root or via sudo.

03/30/12 04:21:17 changed by jwoithe

... I've pulled Stefan's test_cycle_time-20100125 into FFADO's tests/ directory ... It's in r2100.

Err, make that r2101.

04/03/12 04:29:56 changed by j.silvestre

Running test_cycle_time gives:


grep: /usr/share/misc/oui.db: Aucun fichier ou dossier de ce type 080038 (unknown Vendor OUI)

0 cycleOffset backwards out of 32775740 samples (0.00e+00) 0 cycleCount backwards (0.00e+00) 0 cycleSeconds backwards (0.00e+00)

As I now use ffado only occasionally I can't say if the bug is still present or not. As far as I remember this bug occured quite rarely.

(follow-up: ↓ 14 ) 04/03/12 04:55:07 changed by jwoithe

j.silvestre - thanks for taking the time to run test_cycle_time on your hardware. From these results it seems we can discount the hardware issues mentioned by Stefan in comment 8. I assume this test was run against the PCIxx12 firewire card mentioned in the original report.

Having finally had a chance to look at the attachments, I notice a few other things. Firstly, this is, as the original report says, a report against FFADO 2.0. To really progress this we would need a test using a recent FFADO version. Secondly, the test concerned had two devices connected to the firewire bus - an Edirol FA-101 and a Saffire LE - and FFADO did aggregate both as far as I can tell. Whether this has any bearing on the xrun remains to be seen.

To move this ticket on a test with FFADO trunk is definitely warranted - starting with a single device and then adding the additional device if things seem stable. Since the 2.0 days significant work has been done on the streaming system and device aggregation in particular, and it's possible that at least some of the contributors to the erratic xruns have in fact been fixed.

Given that j.silvestre now only occasionally runs FFADO an upgrade to a more recent version may be hard to justify. If you wish to do this then that would be appreciated, but at the same time if it's all too hard and you don't have the time I would understand completely.

If further testing is possible we can re-open this ticket to see if it can be progressed further (assuming that makes sense). All indications are that it will be very hard to nail down.

(in reply to: ↑ 13 ) 04/03/12 05:02:00 changed by wagi

Replying to jwoithe:

Having finally had a chance to look at the attachments, I notice a few other things. Firstly, this is, as the original report says, a report against FFADO 2.0. To really progress this we would need a test using a recent FFADO version. Secondly, the test concerned had two devices connected to the firewire bus - an Edirol FA-101 and a Saffire LE - and FFADO did aggregate both as far as I can tell. Whether this has any bearing on the xrun remains to be seen.

Just to toss in an idea: Couldn't this be a sync problem? The FA-101 really wants to be the clock source, IIRC.

04/03/12 05:13:31 changed by jwoithe

It could be. The trace provided contains a bunch of sync attempt and none of them seem to work but in the end FFADO continues anyway. The eventual failures (one of which triggers the fatal xrun) may well be a consequence of the syncing difficulties. The trouble is that I don't know enough about the streaming and sychronisation systems to be able to say too much about them based on the contents of this log.

So yes, I wouldn't rule out the possibility of a device sync issue here.

(follow-up: ↓ 17 ) 04/03/12 05:42:55 changed by j.silvestre

I've just update ffado to last svn and actually jack is running with the FA-101 only. I still have the Saffire so if needed I can run both together. Really don't remember for sure if the bug occurs with only one device at a time. Of course computer hardware is the same as in the bug report. Kernel is not. Now it's 2.6.33.7-rt30 and new firewire stack.

(in reply to: ↑ 16 ) 04/03/12 16:50:51 changed by jwoithe

  • status changed from closed to reopened.
  • resolution deleted.

Replying to j.silvestre:

I've just update ffado to last svn and actually jack is running with the FA-101 only.

Ok, this in and of itself is good news. Thanks for testing with the latest FFADO.

I still have the Saffire so if needed I can run both together.

Given that the original problem occurred with both devices connected to the system, to be fair we probably need to replicate that situation here.

Of course computer hardware is the same as in the bug report.

Thanks for confirming.

Kernel is not. Now it's 2.6.33.7-rt30 and new firewire stack.

Ok, that's clearly a fair bit newer than the one in the original report, which is good. Could you perhaps run ffado-diag again on the system as it stands at the moment so we can determine whether library versions or anything else may be contributing to the problem?

It would be good to try an even newer kernel at some point if that's not too much hassle, since there have been fixes to the new firewire stack since 2.6.33 which improve things for FFADO. However, if this is done libraw1394 should be upgraded to at least 2.0.7 - so there's a fair bit of work involved. Unless you're particularly keen to do a kernel upgrade, let's take things one at a time and cross the whole kernel-upgrade bridge later if it proves necessary.

So to summarise the next steps: run ffado-diag on the current system and post the results. Also re-run the original test - with both the FA-101 and Saffire LE connected - and see what happens.

I'll re-open this ticket since it seems we can make some progress on determining what the current status really is.

04/06/12 06:53:11 changed by j.silvestre

ffado doesn't use both devices. Looks like both are seen but only the first one is used.

Jackd invocation is: jackd --verbose -P75 -dfirewire -v6 -n3 -r44100 -p512

http://jdlv.homelinux.org/ffado-diag http://jdlv.homelinux.org/ffado-jack.log

As this laptop is now retired upgrading kernel and libs will not be a problem.

04/07/12 04:08:21 changed by jwoithe

Thanks for these latest logs. According to the ffado-jack.log file, I agree that only one of the two devices is now used. However, in the original log from 3 years ago it seems that ffado did attempt to use both devices. The evidence is in the lines after the 00101133847 timestamp: there are two transmit stream processors and two receive stream processors, one each for the two different devices.

That aside, it seems that the latest run has progressed further than the first one you posted about. The streams are now aligned and the streaming system attempts to run (the alignment failed in the initial logs). In fact, the run you've posted appears to have operated just fine. Is this what you observed at the time?

On the basis of this the initial problem may well be associated with the synchronising of the two devices (as suggested by Wagi in comment 14). It seems the old FFADO tries to use both devices by default while the current trunk only uses the first device unless told otherwise.

What do you think is the best way of proceeding at this point? To me it seems that at least the current FFADO works correctly with the single device on your system, although extended testing would tell us whether there was still an intermittant glitch affecting FFADO. It would also be interesting to tell FFADO to use both and see what happens (sorry, I don't know off-hand how to do this - others might be able to assist here). How feasible are either of these test proposals?

On the subject of the system setup, if the xruns persist it would be worth trying a newer kernel (I think anything from 2.6.37 and on is good - am I right here Stefan?) and a newer libraw1394 (2.0.7 or later). The versions in use on your system (kernel 2.6.33.7.2-rt30-1-686 and libraw1394 2.0.5 are quite old and have known issues, although I don't know for sure that any of these known issues align with the symptoms you are seeing.

To ensure they remain accessibe into the future I'll attach copies of the two latest logs to this ticket.

04/07/12 04:10:08 changed by jwoithe

  • attachment ffado-jack.log-20120406 added.

Copy of ffado jack log from j.silvestre on 6 April 2012

04/07/12 04:10:58 changed by jwoithe

  • attachment ffado-diag-20120406 added.

Copy of ffado-diag from j.silvestre on 6 April 2012

04/07/12 06:19:56 changed by stefanr

On software versions vs. dual device usage:

Initially, the newer kernel drivers firewire-ohci/core + libraw1394 did not work for more than a single audio device on the bus at a time. A respective bug in the /dev/fw* ABI was fixed in kernel 2.6.36 and libraw1394 2.0.6. The same or older kernel and libraw1394 did not have this problem if used with the older kernel drivers ohci/ieee/raw1394.

Normally FFADO would per default aggregate all audio devices which it finds on the bus. But in this case it did not do so probably because of this firewire-core ABI bug.

(Note, if libraw1394 is built from source, first make sure to update linux-headers to 2.6.36 or later. Or specify the path to a separate copy of >=2.6.36 firewire kernel headers in a ./configure switch to libraw1394.)

On software versions vs. xrun avoidance:

If firewire-ohci/core are used, libraw1394 >= v2.0.7 is indeed recommended. Which kernel version to recommend I am not so sure of; I would say take the newest that is easily available on your distributions. (Regarding xrun avoidance, improvements in other drivers and in core components like the scheduler are of even greater importance than the occasional firewire driver improvements.)

04/07/12 07:32:31 changed by j.silvestre

  • attachment ffado-diag added.

04/07/12 07:32:50 changed by j.silvestre

  • attachment ffado-jack.log added.

04/07/12 07:34:29 changed by j.silvestre

Replying to jwoithe yes ffado seems to run fine with a single device, but was tested for a few minutes only...

Now kernel is 3.0.27-rt46 and libraw1394 is 2.08 built from source with linux-headers 3.0.27-rt46. FFADO still runs a single device.

04/08/12 03:47:26 changed by jwoithe

Regarding the kernel and libraw1394 versions, it's good that you're now running the most current of both since that eliminates a whole bunch of variables.

I've never attempted to aggregate multiple devices before so I'm not sure what's involved exactly. Like Stefan wrote in comment 20, I was under the impression that FFADO aggregated all devices it found without anything special having to be done. Clearly this isn't happening in your case, but it's something that we probably ought to get to the bottom of.

Looking at the latest jackd log, it is clear that FFADO did see both the FA-101 and the Saffire LE on the bus. While probing the Saffire though it seems FFADO encounters a bunch of FCP errors and ultimately decides that it can't discover the unit. I'm wondering whether this is the reason FFADO does not go on to aggregate the Saffire with the FA-101, but somewhat more important is why the FCP errors occur in the first place. (At this point in time it would probably be helpful if I knew more about the Bebob protocol than I do :-) ).

Could you please try running jackd as you did before but with only the Saffire LE plugged into the bus? This will tell us whether the FCP errors are a consequence of having multiple interfaces plugged into the bus or are instead a sign of something deeper going wrong.

04/11/12 08:49:03 changed by j.silvestre

  • attachment ffado-jack-saffire.log added.

04/11/12 08:49:48 changed by j.silvestre

  • attachment ffado-jack-fa101.log added.

04/11/12 08:55:29 changed by j.silvestre

Added ffado-jack-saffire.log, Focurite Saffire alone and ffado-jack-fa101.log, FA101 alone.

I try to change the firewire cable linking the two devices, no change Jackd still runs only one device at a time.

04/11/12 17:53:33 changed by jwoithe

Thanks for these additional tests. Evidently when both devices are plugged in separately they seem to work just fine. It's only when they are plugged in together that one throws those FCP errors which in turn prevent FFADO discovering one of the devices. The obvious conclusion from this is that the FCP errors noted in the earlier log are a consequence of having both devices plugged into the bus. The next thing to determine is why this happens. Of course all this is quite a different problem to the original ticket issue (which my or may not be solved now).

I'm no expert on the protocol used by either of these two devices - I'm not even sure what the FCP error means in practice. Could someone else who is familiar with the FA-101 / Saffire LE protocol take a look at the log from the latest attempt to run these together (ffado-jack.log)? As far as I can tell, the reason both devices are not aggregated by FFADO is the as a result of the following error (which occurs multiple times throughout the probing of the Saffire LE device):

handleFcpResponse: FCP response node id's don't match! (ffc1, ffc0)

I'm wondering whether it's got something to do with the Bebob driver confusing the two devices for some reason.

04/11/12 23:41:12 changed by stefanr

handleFcpResponse: FCP response node id's don't match! (ffc1, ffc0)

I have not yet looked at the new ffado-jack.log. But this particular error is exactly what libraw1394 would give if it is too old (less than 2.0.6) or compiled against too old kernel headers (less than 2.6.36).

About ffado-diag from 04/07/12:

 Prerequisites (dynamic at run-time)...
   libraw1394 ........ 2.0.8
 Prerequisites (static at compile-time)...
   libraw1394 ........ 2.0.5

I always fail to remember what these two different Prerequisites sections want to tell us.

Please check with

ldd $(which jackd | sed -e s/bin.jackd//)lib/jack/jack_firewire.so
ldd $(ldd $(which jackd | sed -e s/bin.jackd//)lib/jack/jack_firewire.so | grep libffado.so | cut -f3 -d' ')

whether jack_firewire or libffado is still linked against an old libraw1394.

04/12/12 00:14:39 changed by jwoithe

Regarding those prerequisites, a quick glance in ffado-diag tells me the following.

  • The versions reported as "dynamic at runtime" are obtained by querying the relevant package for its version number, usually via a call to "pkg-config --modversion <name>".
  • The "static at compile-time" versions are dumped from the file static_info.txt, which for most people would end up in /usr/local/share/libffado/python/. The file itself is generated using the ffado-diag-static script at compile/install time; the versions it reports are those returned from "pkg-config" as run on the system at the time "scons install" was run.

Given this, the ffado-diag output from 7 April 2012 indicates that the static_info.txt file being accessed by ffado-diag on 7 April claims that libraw1394 version 2.0.5 was installed at the time this file was created. Theoretically this would have been the version present at compile/install time, but perhaps there's been some other glitch which has meant that static_info.txt is stale on this system.

If static_info.txt is accurately indicating the version of libraw1394 that ffado was compiled against, is this an issue? I don't think ffado adapts to the libraw1394 version it's compiled or linked against. Therefore the fact ffado may have been compiled against libraw1394 2.0.5 shouldn't be an issue, should it? What matters is what it's run against, and the system clearly has 2.0.8 installed now. Of course, that still leaves open the possibility that libraw1394 was compiled against headers from a pre-2.6.36 kernel. I'm not sure there's a succinct way of testing for that that, is there?

04/12/12 04:47:00 changed by j.silvestre

$ ldd $(which jackd | sed -e s/bin.jackd//)lib/jack/jack_firewire.so
	linux-gate.so.1 =>  (0xb778e000)
	libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0xb7670000)
	libjackserver.so.0 => /usr/lib/libjackserver.so.0 (0xb75c8000)
	libdl.so.2 => /lib/i686/cmov/libdl.so.2 (0xb75c3000)
	librt.so.1 => /lib/i686/cmov/librt.so.1 (0xb75ba000)
	libpthread.so.0 => /lib/i686/cmov/libpthread.so.0 (0xb75a1000)
	libffado.so.2 => /usr/lib/libffado.so.2 (0xb72ba000)
	libm.so.6 => /lib/i686/cmov/libm.so.6 (0xb7294000)
	libgcc_s.so.1 => /lib/libgcc_s.so.1 (0xb7275000)
	libc.so.6 => /lib/i686/cmov/libc.so.6 (0xb712f000)
	/lib/ld-linux.so.2 (0xb778f000)
	libiec61883.so.0 => /usr/lib/libiec61883.so.0 (0xb7123000)
	libraw1394.so.11 => /usr/lib/libraw1394.so.11 (0xb7116000)
	libconfig++.so.8 => /usr/lib/libconfig++.so.8 (0xb7101000)
	libxml++-2.6.so.2 => /usr/lib/libxml++-2.6.so.2 (0xb70dc000)
	libxml2.so.2 => /usr/lib/libxml2.so.2 (0xb6fb1000)
	libglibmm-2.4.so.1 => /usr/lib/libglibmm-2.4.so.1 (0xb6f59000)
	libgobject-2.0.so.0 => /usr/lib/libgobject-2.0.so.0 (0xb6f1b000)
	libsigc-2.0.so.0 => /usr/lib/libsigc-2.0.so.0 (0xb6f15000)
	libgthread-2.0.so.0 => /usr/lib/libgthread-2.0.so.0 (0xb6f10000)
	libglib-2.0.so.0 => /lib/libglib-2.0.so.0 (0xb6e46000)
	libz.so.1 => /usr/lib/libz.so.1 (0xb6e32000)
	libgmodule-2.0.so.0 => /usr/lib/libgmodule-2.0.so.0 (0xb6e2e000)
	libpcre.so.3 => /lib/libpcre.so.3 (0xb6dfb000)

$ ldd $(ldd $(which jackd | sed -e s/bin.jackd//)lib/jack/jack_firewire.so |

grep libffado.so | cut -f3 -d' ')

linux-gate.so.1 => (0xb7722000) libpthread.so.0 => /lib/i686/cmov/libpthread.so.0 (0xb7408000) libiec61883.so.0 => /usr/lib/libiec61883.so.0 (0xb73fc000) libraw1394.so.11 => /usr/lib/libraw1394.so.11 (0xb73ee000) libconfig++.so.8 => /usr/lib/libconfig++.so.8 (0xb73d9000) libxml++-2.6.so.2 => /usr/lib/libxml++-2.6.so.2 (0xb73b4000) libxml2.so.2 => /usr/lib/libxml2.so.2 (0xb728a000) libglibmm-2.4.so.1 => /usr/lib/libglibmm-2.4.so.1 (0xb7232000) libgobject-2.0.so.0 => /usr/lib/libgobject-2.0.so.0 (0xb71f3000) libsigc-2.0.so.0 => /usr/lib/libsigc-2.0.so.0 (0xb71ed000) libgthread-2.0.so.0 => /usr/lib/libgthread-2.0.so.0 (0xb71e8000) librt.so.1 => /lib/i686/cmov/librt.so.1 (0xb71df000) libglib-2.0.so.0 => /lib/libglib-2.0.so.0 (0xb7116000) libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0xb7020000) libm.so.6 => /lib/i686/cmov/libm.so.6 (0xb6ffa000) libgcc_s.so.1 => /lib/libgcc_s.so.1 (0xb6fdc000) libc.so.6 => /lib/i686/cmov/libc.so.6 (0xb6e96000) /lib/ld-linux.so.2 (0xb7723000) libdl.so.2 => /lib/i686/cmov/libdl.so.2 (0xb6e92000) libz.so.1 => /usr/lib/libz.so.1 (0xb6e7d000) libgmodule-2.0.so.0 => /usr/lib/libgmodule-2.0.so.0 (0xb6e79000) libpcre.so.3 => /lib/libpcre.so.3 (0xb6e46000)

(follow-up: ↓ 29 ) 04/12/12 04:49:10 changed by j.silvestre

$ ldd $(ldd $(which jackd | sed -e s/bin.jackd//)lib/jack/jack_firewire.so | grep libffado.so | cut -f3 -d' ')
	linux-gate.so.1 =>  (0xb7722000)
	libpthread.so.0 => /lib/i686/cmov/libpthread.so.0 (0xb7408000)
	libiec61883.so.0 => /usr/lib/libiec61883.so.0 (0xb73fc000)
	libraw1394.so.11 => /usr/lib/libraw1394.so.11 (0xb73ee000)
	libconfig++.so.8 => /usr/lib/libconfig++.so.8 (0xb73d9000)
	libxml++-2.6.so.2 => /usr/lib/libxml++-2.6.so.2 (0xb73b4000)
	libxml2.so.2 => /usr/lib/libxml2.so.2 (0xb728a000)
	libglibmm-2.4.so.1 => /usr/lib/libglibmm-2.4.so.1 (0xb7232000)
	libgobject-2.0.so.0 => /usr/lib/libgobject-2.0.so.0 (0xb71f3000)
	libsigc-2.0.so.0 => /usr/lib/libsigc-2.0.so.0 (0xb71ed000)
	libgthread-2.0.so.0 => /usr/lib/libgthread-2.0.so.0 (0xb71e8000)
	librt.so.1 => /lib/i686/cmov/librt.so.1 (0xb71df000)
	libglib-2.0.so.0 => /lib/libglib-2.0.so.0 (0xb7116000)
	libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0xb7020000)
	libm.so.6 => /lib/i686/cmov/libm.so.6 (0xb6ffa000)
	libgcc_s.so.1 => /lib/libgcc_s.so.1 (0xb6fdc000)
	libc.so.6 => /lib/i686/cmov/libc.so.6 (0xb6e96000)
	/lib/ld-linux.so.2 (0xb7723000)
	libdl.so.2 => /lib/i686/cmov/libdl.so.2 (0xb6e92000)
	libz.so.1 => /usr/lib/libz.so.1 (0xb6e7d000)
	libgmodule-2.0.so.0 => /usr/lib/libgmodule-2.0.so.0 (0xb6e79000)
	libpcre.so.3 => /lib/libpcre.so.3 (0xb6e46000)

(in reply to: ↑ 28 ) 04/12/12 05:55:02 changed by stefanr

Replying to j.silvestre:

libraw1394.so.11 => /usr/lib/libraw1394.so.11 (0xb73ee000)

And the date of this file confirms that this is the one which you installed yourself, right?

Replying to jwoithe:

If static_info.txt is accurately indicating the version of libraw1394 that ffado was compiled against, is this an issue? I don't think ffado adapts to the libraw1394 version it's compiled or linked against. Therefore the fact ffado may have been compiled against libraw1394 2.0.5 shouldn't be an issue, should it? What matters is what it's run against, and the system clearly has 2.0.8 installed now.

Correct; if a client which was built against libraw1394 2.0.x and libraw1394 is dynamically linked, then it is sufficient to replace the .so.* from 2.0.x to 2.0.y in order to make full use of all 2.0.x->2.0.y fixes.

Of course, that still leaves open the possibility that libraw1394 was compiled against headers from a pre-2.6.36 kernel. I'm not sure there's a succinct way of testing for that that, is there?

If one desparately wanted to go to the bottom of it, I guess "strace" could be used to watch whether FW_CDEV_EVENT_REQUEST = 0x2 or FW_CDEV_EVENT_REQUEST2 = 0x6 are read from /dev/fw*, third u_int32_t in the binary read() stream. Or whether the very first ioctl() contains fw_cdev_get_info.version >= 4 as input from client to kernel. --- Perhaps we should simply ship a copy of the kernel header file in the libraw1394 sources to eliminate this worry.

j.silvestre, something like "grep FW_CDEV_EVENT_REQUEST2 /usr/include/linux/firewire-cdev.h" would have found occurrences of this _REQUEST2 at the time when you built libraw1394, right?

04/12/12 06:24:16 changed by j.silvestre

Yes the date of the file libraw1394.so.11 confirms that it's the one I've installed.

"grep FW_CDEV_EVENT_REQUEST2 /usr/include/linux/firewire-cdev.h" says nothing at all.

Maybe the problem comes from this:?

7.1.1 Kernel headers

Most "normal" programs don't need kernel headers and in fact may break if you use them directly; instead they should be compiled against the headers with which glibc was built, which are the versions in /usr/include/linux and /usr/include/asm of the Debian system.

So do not put symlinks to the directories in /usr/src/linux from /usr/include/linux and /usr/include/asm, as suggested by some outdated documents.

If you need particular kernel headers for some kernel-specific application programs, alter the makefile(s) so that their include path points to dir-of-particular-kernel-headers/include/linux and dir-of-particular-kernel-headers/include/asm. 

04/12/12 06:46:33 changed by j.silvestre

  • attachment ffado-jack-both.log added.

(follow-up: ↓ 33 ) 04/12/12 06:56:02 changed by j.silvestre

So I copy firewire include files from linux-headers 3.0.27-rt46 in /usr/include/linux and then rebuild libraw1394. Now FFADO uses both devices together.

(follow-up: ↓ 34 ) 04/12/12 07:01:08 changed by wagi

So adding something like

grep FW_CDEV_EVENT_REQUEST2 /usr/include/linux/firewire-cdev.h"

to ffado-diag would be a good idea, no? :)

(in reply to: ↑ 31 ) 04/12/12 07:11:39 changed by jwoithe

Replying to j.silvestre:

So I copy firewire include files from linux-headers 3.0.27-rt46 in /usr/include/linux and then rebuild libraw1394. Now FFADO uses both devices together.

Great stuff. So the cause of the inability to aggregate these two devices was the fact that the new libraw1394 was compiled against old kernel headers. It's good to get to the bottom of that. Thanks Stefan for leading us towards the solution.

So, having resolved that little side issue, the remaining question is whether the subject of this ticket is still an issue with the newer libraw1394/kernel/ffado. Obviously given the intermittent nature of the fault a reasonable amount of testing may be needed. However, any feedback you can give as to the probable state of those xruns with the newer software would be appreciated.

(in reply to: ↑ 32 ; follow-up: ↓ 35 ) 04/12/12 07:16:37 changed by jwoithe

Replying to wagi:

So adding something like
grep FW_CDEV_EVENT_REQUEST2 /usr/include/linux/firewire-cdev.h"
to ffado-diag would be a good idea, no? :)

Perhaps. However, the critical point isn't what that header says at the time that ffado is compiled, but rather what header was present when libraw1394 was compiled. In other words, what we really want to know is whether a suitably new kernel header was present when libraw1394 was compiled. Since libraw1394 may come from a binary distribution there's no guarantee that the kernel headers on a user's computer match those used on the machine to compile the binary. So while adding this to ffado-diag may be useful to a point, one would have to be very careful when basing any conclusions on the subsequent reports. For these reasons I'm more of the view that it's better left out, but I'm open to contrary arguments.

(in reply to: ↑ 34 ) 04/12/12 08:37:49 changed by stefanr

Replying to j.silvestre:

Admittedly I know close to nothing how distributors deal with kernel header files. I only know how it is done on Gentoo: There, a package named linux-headers provides header files which are already appropriately converted from the original kernel source headers.

Independently of Linux distributions, it is generally true that the header files from a kernel source tarball or kernel source package should not be used directly to build userland programs against them or copy them to /usr/include. In case of the two "linux/firewire-c*.h" files, those can be copied without worry because they don't require conversion. Also, libraw1394's ./configure contains an option with which it is possible to specify a custom directory with "linux/firewire-c*.h" in them. --- Anyway; this incident convinced me to copy these two header files directly into the libraw1394 sources in order to avoid all this from the next libraw1394 release on. Sorry for all the confusion.

Replying to jwoithe:

I agree that this kind of test in ffado-diag would not be reliable at all. Let me think about whether there is a way to do a proper runtime test for libraw1394 features.

Back to the subject of this ticket:

Two devices aggregated will certainly stress the FFADO stream engine, the kernel, and the controller more than a single audio device. j.silvestre, the new kernel which you are now using hopefully behaves better than the one from when you opened this ticket. When you get around to test this some more, remember that very large buffers may in fact degrade FFADO's ability to keep the streams stable. This also somewhat depends on the particular IEEE1394 controller, as far as I have heard. IOW medium-sized buffers are typically easiest to handle for FFADO on many systems.

04/12/12 08:53:10 changed by j.silvestre

So what buffer size do you recommend for a meaningfull test?

Jackd is now running with both devices and already generate an Xrun after about 30mn with 3 buffers of 1024 samples. Unfortunately I've lost the log file by mistake.

04/12/12 17:18:03 changed by jwoithe

In relation to buffer sizes, there's a trade-off: the larger the buffer size the lower the rate of interrupts and therefore potentially there's CPU load savings to be had. On the other hand, if the buffers are too large it can be difficult to keep the device synced because of the way the various protocols function (with some devices being more prone to problems than others). I've seen numbers of 256, 512 and 1024 being mentioned in the context of "medium-sized".

You reported that with a buffer size of 1024 you managed to get an xrun. In a way that's good because it obviously didn't take too long to trigger; I suspect it should be fairly easy for this test to be replicated to recreate the accidently lost log file. This is probably valuable since it may give us an indication as to what went wrong (that is, whether there is a problem lurking within FFADO, or if it's due to some as-yet-unidentified local issue).

As far as testing for stability, it would then be interesting to try with buffer sizes of 512 and/or 256 and see if that improves things in any way.

04/13/12 03:30:00 changed by j.silvestre

  • attachment ffado-jack-both-xrun-1024.log.xz added.

04/13/12 03:43:56 changed by j.silvestre

Finally got and log an Xrun.

Don't know if it's important or not, the two devices are bus synchronised. If needed I can sync them with spdif.

04/13/12 06:29:52 changed by j.silvestre

  • attachment ffado-jack-both-xrun-1024-1.log.xz added.

04/13/12 06:30:59 changed by j.silvestre

An other one.

04/14/12 16:51:58 changed by jwoithe

In general when running multiple digital audio devices - no matter what the setting - it is important to ensure they are all synchronised to the same digital clock. I'm not sure whether FFADO's streaming system is designed to cope with devices which are running from different digital clocks, but I would suggest it would be a good idea to run the devices synchronous in any case.

In light of this I think it would be good to arrange for the devices to be synchronised to the same clock and then test that with FFADO. Syncing them via SPDIF as you proposed would be fine.

04/21/12 09:34:19 changed by j.silvestre

Syncing devices via spdif doesn't change xruns occurence. Smaller buffers neither, 256 samples instead of 1024 and xruns are still here. I have the log files but they are quite big, up to 100Mo after commpression. Now Jackd is running with the saffire alone. Waiting for xrun...

04/21/12 21:32:03 changed by jwoithe

Thanks for testing with the devices synched via SPDIF. It was a bit of a long shot that this would fix the problem, but it's still good to have confirmed that the xruns have nothing to do with that.

Similarly for the use of smaller samples - it's good to have ruled that out too. For the moment just hold onto the log files locally - don't try to put them up somewhere, given their size. They probably don't give any more information than the logs already posted.

Testing with each device separately for an extended period of time is a good idea. If the frequency of xruns is lower than with both devices active we can conclude that there's something about the aggregated device setup that's causing this. Whether it's a fundamental bug, or something that occurs due to the increased CPU workload, remains to be seen. Let's wait and see the outcome of this latest round of tests.

05/01/12 11:49:21 changed by j.silvestre

Ten days of continous streaming with the saffire at 3x256 samples and still no xrun. Should I do the same test with the FA101?

05/01/12 17:48:30 changed by jwoithe

Wow. That's some test. I think we can reasonably conclude that the Saffire is stable on your system in isolation. :-)

I think it would be worth doing the same thing with the FA101. You probably don't have to run all the way to 10 days if you don't want to - I'll leave the final time length up to you. Assuming that this also proves stable it will pretty much confirm that the problem is triggered through the use of both devices simultaneously. We can then concentrate on that.

05/06/12 02:58:58 changed by j.silvestre

  • attachment ffado-jack-fa101-256-strip.log added.

05/06/12 03:16:32 changed by j.silvestre

Finally got an unhandled xrun with a single device. Took quite a long a time, about 4 days. As the log file is a bit heavy I cut it to keep only the begining and the end. This file is ffado-jack-fa101-256-strip.log. If needed, full log is available at http://jdlv.homelinux.org/ffado-jack-fa101-256.log.xz

(follow-up: ↓ 47 ) 05/06/12 05:10:39 changed by jwoithe

Regarding comment 45, it's very interesting that it took 4 days before an unhandled xrun was generated, especially since the other device kept going for 10 days without a hint of a problem. It does suggest that perhaps there might be an issue in the FA101 driver which is upsetting the Saffire/FA101 aggregation. Perhaps the aggregation makes the underlying problem - whatever that may be - somewhat more likely. Of course it's also possible that this was a transient error condition which has never happened before and is totally unrelated to the problems you see when the two devices are aggregated.

The pertinent lines from the log appear to be the following.

Debug (IsoHandlerManager.cpp)[ 417] waitForActivity: (0x94bf708) sem_timedwait() timed out (result=-1)
Warning (IsoHandlerManager.cpp)[ 292] Execute: Timeout while waiting for activity
Fatal (IsoHandlerManager.cpp)[ 348] Execute: (0x94bf838, Receive) Handler died: now: CE2FB0FC,
  last: C7A37897, diff: 80029797 (max: 49152000)
Warning (StreamProcessor.cpp)[ 173] handlerDied: Handler died for 0x94ef308
Debug (PosixThread.cpp)[  94] ThreadHandler: (ISORCV) ThreadHandler: exit 0x94bf958
Debug (StreamProcessorManager.cpp)[1340] waitForPeriod: exit due to xrun...
Debug (StreamProcessorManager.cpp)[1397] waitForPeriod: Xrun on RECV SP 0x94ef308 due to ISO side xrun

As far as I can remember, this is basically saying that the incoming data (from the audio device) stopped for no reason. That is, from the point of view of FFADO, the device stopped sending audio data. This is what is meant by "ISO side xrun": the receive threads had a buffer under-run caused by a lack of data coming in from the firewire iso stream.

The xrun noted in ffado-jack-both-xrun-1024-1.log.xz seems to be slightly different from what we see above: waitForActivity() does not report a problem, nor do we see any notification of "Handler died". Instead, among other things we see things like

putPacket: (0x8bf6070) dropped 17 packets on cycle 3090, 'dropped'=0, cycle=3090, m_last_cycle=3072

I think this is a notification by the FFADO receiver that we failed to see any data from the device for (in this case) 18 iso cycles. This is not totally unrelated to what we infer from the latest log file; the difference is that in ffado-jack-both-xrun-1024-1.log.xz the device starts sending data again before FFADO gives up on it, while in the latest log FFADO decides that the data flow has ceased and reacts accordingly. The common theme here is that data from the device is being interrupted for an extended period of time. In both cases there's no obvious sign as to why this might have occurred.

Interestingly enough, I note that it is the receive processor of the FA101 which gives trouble in ffado-jack-both-xrun-1024-1.log.xz and in the most recent test. So maybe there's something about the FA101 which is causing it to cease sending data for a certain length of time, and that the action of running two devices on the bus somehow makes the problem more likely to occur.

In any case, we're still left wondering why we are apparently seeing an interruption in the data stream from the device. It could be a strange FFADO bug, a physical problem with the device (either a bug or a fault), an issue with the PCIxx12 firewire card, or something else I'm overlooking.

(in reply to: ↑ 46 ; follow-up: ↓ 48 ) 05/06/12 11:14:45 changed by wagi

In any case, we're still left wondering why we are apparently seeing an interruption in the data stream from the device. It could be a strange FFADO bug, a physical problem with the device (either a bug or a fault), an issue with the PCIxx12 firewire card, or something else I'm overlooking.

I would not rule out that the FA-101 is just buggy. The FA-101 was the first product with the DM1000 chip and it has an early version of the BeBoB stack. Both have more than enough known bugs which are normally hidden. But running the device for 4 days might just trigger one of those.

(in reply to: ↑ 47 ; follow-up: ↓ 49 ) 05/06/12 16:23:03 changed by jwoithe

Replying to wagi:

The FA-101 was the first product with the DM1000 chip and it has an early version of the BeBoB stack. Both have more than enough known bugs which are normally hidden. But running the device for 4 days might just trigger one of those.

Would it therefore be a stretch to consider that running the FA101 aggregated with another device might then make the problem easier to trigger? Otherwise, if we are to put this down to a buggy device we still have to explain why running the FA101 with another device seems to take FFADO down much sooner.

(in reply to: ↑ 48 ) 05/15/12 23:06:04 changed by wagi

Replying to jwoithe:

Replying to wagi:

The FA-101 was the first product with the DM1000 chip and it has an early version of the BeBoB stack. Both have more than enough known bugs which are normally hidden. But running the device for 4 days might just trigger one of those.

Would it therefore be a stretch to consider that running the FA101 aggregated with another device might then make the problem easier to trigger?

Yes, that is what I assume. The saffire did run without troubles with the same setup for a long time and the FA-101 not. Maybe it is just coincidence. If you want to rule this out a bigger sample pool is needed...

Otherwise, if we are to put this down to a buggy device we still have to explain why running the FA101 with another device seems to take FFADO down much sooner.

Either we answer this question or we ask the reported to do more tests but I don't think we should.

I have a FA-101 still here (not used for long time, don't know if it still works) with a serial console to it. There are some limited way to debug the device but it is really long time ago and I did not really work with that unit. So I could give it a spin. But I wouldn't expect any answers.

06/01/12 04:31:23 changed by jwoithe

  • status changed from reopened to closed.
  • resolution set to wontfix.

Based on the evidence at hand, it seems likely that the root cause of the problems being discussed are traceable back to a bug in the FA-101 which happens to become easier to hit when aggregating multiple devices on the same bus. Obviously we can't fix the buggy firmware, and for similar reasons it's not feasible to come up with a workaround without expending a lot of effort. As a result, I'll close this ticket as "wontfix".

If anyone ends up doing the sorts of tests described in comment 49 or otherwise determines the nature of the root cause in more detail then please re-open this ticket to allow discussion to continue.