Ticket #242 (closed bug: fixed)

Opened 5 years ago

Last modified 2 years ago

ntp interferes with ffado

Reported by: ppalmers Assigned to: ppalmers
Priority: major Milestone: FFADO 2.x
Component: Version: FFADO 2.0-rc2 (1.999.42)
Keywords: Cc:
The device the bug applies to: all

Description

As I was taught recently, sometimes it is necessary to configure _small_ buffers to avoid xruns.

Furthermore I learned just yesterday that jackd/ffado crashed (i.e. exited unexpectedly) on my current working PC due to interference by ntpd (the network time daemon). Lines in /var/log/messages like

Nov 25 23:19:17 stein ntpd[7685]: time reset +2.020631 s Nov 25 23:35:18 stein ntpd[7685]: time reset +1.945118 s Nov 25 23:50:44 stein ntpd[7685]: time reset +2.053456 s

coincided with when ffado exited. This was always accompanied by

firewire ERR: wait status < 0! (= -1) DRIVER NT: could not run driver cycle

in jackd's log. I have since disabled the ntpd system service and these crashes are gone.

Furthermore, fast successions of xruns seemed to possibly cause jackd/ffado crashes too So if you are able to get rid of the xruns e.g. by means of a smaller buffer, you might cure your crashes as a very welcome side effect.

[Disclaimer: I'm a FFADO newbie and don't know what I'm talking about.] -- Stefan Richter

Attachments

jackd_r2171_new-libraw1394.txt (112.0 kB) - added by stefanr on 06/23/12 09:37:24.
With ffado r2171 and new libraw1394, jackd fails to start.
ffado-test_r2171_new-libraw1394.txt (80.2 kB) - added by stefanr on 06/23/12 09:38:45.
Inability of r2171 to bring up the streaming threads also shows through in "ffado-test Discover".
jackd_r2174_new-libraw1394.txt (38.6 kB) - added by stefanr on 06/25/12 15:18:00.
r2174 is alas still unabe
ffado-test_Discover_-v6_DICE_r2166_new-libraw1394.txt (49.1 kB) - added by stefanr on 06/26/12 06:02:12.
for completeness: "ffado-test Discover -v", r2166, Saffire PRO24
ffado-test_Discover_-v6_DICE_r2174_new-libraw1394.txt (49.2 kB) - added by stefanr on 06/26/12 06:04:00.
for completeness: "ffado-test Discover -v6", r2174, Saffire PRO24
jackd_r2174_new-libraw1394_Mac-mini_bebob-driver.txt (42.1 kB) - added by stefanr on 06/26/12 11:21:05.
r2174 does not work on my 2nd PC with good RTC either
clock_monotonic.patch (4.4 kB) - added by stefanr on 06/27/12 14:09:30.
suggested patch for simplification and perhaps more robustness even: replace CLOCK_MONOTONIC_RAW by CLOCK_MONOTONIC
kernel-version-recommendation.patch (1.1 kB) - added by stefanr on 06/27/12 15:14:39.

Change History

05/23/10 06:51:52 changed by stefanr

The description lumps several unrelated issues together. Clarification:

The subject issue here --- reset of the local clock by ntpd causes ffado/ jackd to die --- is 100% reproducible on my system and independent of which controller and which kernel drivers are in use.

As a means for a possible fix, a new ioctl has been added to firewire-core in kernel 2.6.34 which lets userspace choose between CLOCK_REALTIME, CLOCK_MONOTONIC, and CLOCK_MONOTONIC_RAW as the local clock, similar to clock_gettime() of libc. I am not exactly sure whether POSIX' CLOCK_MONOTONIC or Linux' CLOCK_MONOTONIC_RAW is the right one to fix this bug. The firewire-core and raw1394 ioctls which back libraw1394's raw1394_read_cycle_timer() use CLOCK_REALTIME and are thus exposed to clock jumps due to ntpd and the likes.

There hasn't been a move to extend libraw1394 to the new ioctl yet. Ways to do that could be

  • a new libraw1394 function as alternative to raw1394_read_cycle_timer(),
  • a libraw1394 environment variable that alters the behavior of raw1394_read_cycle_timer().

Either way would require changes to both libffado and libraw1394. The latter way may not be such a good idea, it seems more convoluted.

If somebody implements a ffado backend to the new kernel driver stack that bypasses libraw1394 entirely, then the new ioctl should be used instead of FW_CDEV_IOC_GET_CYCLE_TIMER. (IMO, kernel 2.6.34 could be made a hard requirement of such a backend, considering that there are other important advantages in that one over .32 or .33.)

05/24/10 08:38:52 changed by stefanr

Affected:

jackd + ffado + linux1394

jackd + ffado + juju

Not affected:

jackd + alsa

(follow-up: ↓ 4 ) 03/27/12 07:30:20 changed by jwoithe

Wouldn't this affect jackd's timing as much as FFADOs? In other words, even if FFADO were to utilise an alternative clock source, wouldn't jackd remain exposed? Or does jackd take care to choose the best clock available?

Is this a large enough issue to maintain it as a 2.1 milestone? Given that there's no move on the horizon to rewrite ffado to use the new stack directly, the fix seems fairly involved as it would span FFADO and libraw1394 at the very least. Is there anyone with the time and required knowledge who could oversee such a fix within the next month? If not, I propose postponing this to at least FFADO 2.2.

(in reply to: ↑ 3 ; follow-up: ↓ 5 ) 03/27/12 14:22:22 changed by stefanr

Replying to jwoithe:

Wouldn't this affect jackd's timing as much as FFADOs? In other words, even if FFADO were to utilise an alternative clock source, wouldn't jackd remain exposed? Or does jackd take care to choose the best clock available?

I can't answer that. All I know is that jackd remains undisturbed by local clock resets if it uses an ALSA device instead of FFADO.

Is this a large enough issue to maintain it as a 2.1 milestone? Given that there's no move on the horizon to rewrite ffado to use the new stack directly, the fix seems fairly involved as it would span FFADO and libraw1394 at the very least.

Adding a new function to libraw1394 and getting it released would be one of the easier aspects of this problem. Could take some time to trickle out into distributions of course.

Is there anyone with the time and required knowledge who could oversee such a fix within the next month? If not, I propose postponing this to at least FFADO 2.2.

As reporter, I definitely don't mind if you postpone this; a 2.1 release should not be blocked by this.

Of course it is hard to say how mayn people are affected by this. At the time when I discovered the problem, I had an ntpd installation which logged every clock reset to the syslog. After an ntpd update quite a while ago, it doesn't do so anymore (on Gentoo Linux, and certainly on other distros), so now I would be totally clueless why jackd + ffado always dies with unhandled xrun after 0...15 minutes runtime. So WRT ability to detect the presence of this issue on any other affected user's setup, this is quite a nasty problem.

On the flipside, as soon as Clemens and you are done implementing kernelspace streaming components, this issue is resolved as well. :-)

(in reply to: ↑ 4 ; follow-up: ↓ 6 ) 03/27/12 16:28:54 changed by jwoithe

  • milestone changed from FFADO 2.1 to FFADO 2.x.

Replying to stefanr:

Replying to jwoithe:

Is this a large enough issue to maintain it as a 2.1 milestone? Given that there's no move on the horizon to rewrite ffado to use the new stack directly, the fix seems fairly involved as it would span FFADO and libraw1394 at the very least.

Adding a new function to libraw1394 and getting it released would be one of the easier aspects of this problem. Could take some time to trickle out into distributions of course.

As I understand it, there are two aspects to this approach: * get the new function in libraw1394 * make ffado use the new function if it's available (we'd need scons to check for this so as to not require that everyone upgrade libraw1394 if they have no need to do so)

To me there seems little reason not to request that the new function go into libraw1394 as soon as possible. Once it's there we can have FFADO call it. This would, from my understanding of the problem, resolve the issue as much as we can.

What is the policy of adding new functions to libraw1394? Is any attempt made to maintain backwards compatibility through the use of weak linkage?

I guess another option which avoids the need for people to upgrade libraw1394 immediately would be to add the function to FFADO's code base; if a sufficiently new version of libraw1394 were available then the libraw1394 function could be called; otherwise we could just go ahead and do the ioctl() ourselves. In time, once the new libraw1394 has propagated into distributions we could remove the FFADO version.

Does this sound like a reasonable plan?

As reporter, I definitely don't mind if you postpone this; a 2.1 release should not be blocked by this.

Ok, thanks. However, if it does boil down to the changes noted above, and if the above process is acceptable we could probably come up with something fairly quickly. I've changed the milestone to 2.x, but if it happens to make 2.1 then great.

Of course it is hard to say how mayn people are affected by this.

True. I've never run ntpd on my audio machines so it's not something which has ever affected me. It probably comes down to how many distributions enable ntp out the box (or via a simple click-through interface during installation).

So WRT ability to detect the presence of this issue on any other affected user's setup, this is quite a nasty problem.

Agreed. The time jumps instigated by ntpd are of variable length, so there's no point in FFADO attempting to deduce whether the time jump is due to NTP or something else. FFADO could print a warning if such a jump is seen saying the NTP sometimes causes this, but since most people run jack via qjackctl they probably won't see it. So coming up with something which might fix the problem in some cases is probably better than doing nothing.

We could possibly add a check for a running ntpd to ffado-diag. I might look into doing this tonight.

On the flipside, as soon as Clemens and you are done implementing kernelspace streaming components, this issue is resolved as well. :-)

True, but that is still a way off (I need to see an example playback implementation before I know enough to port other drivers to the in-kernel option). Therefore, if there is a way of working around the NTP issue in a relatively clean way we should probably aim to do so.

(in reply to: ↑ 5 ; follow-up: ↓ 7 ) 03/28/12 14:32:01 changed by stefanr

Replying to jwoithe:

What is the policy of adding new functions to libraw1394? Is any attempt made to maintain backwards compatibility through the use of weak linkage?

Although my being libraw1394 maintainer, I am new to this and not sure. But my understanding is that there would be a new function added to libraw1394, libraw1394's minor version number be incremented, and that would be it.

I guess another option which avoids the need for people to upgrade libraw1394 immediately would be to add the function to FFADO's code base; if a sufficiently new version of libraw1394 were available then the libraw1394 function could be called; otherwise we could just go ahead and do the ioctl() ourselves. In time, once the new libraw1394 has propagated into distributions we could remove the FFADO version.

The libraw1394 API does not expose a file descriptor that could be used for such a purpose.

Of course it is hard to say how mayn people are affected by this.

True. I've never run ntpd on my audio machines so it's not something which has ever affected me. It probably comes down to how many distributions enable ntp out the box (or via a simple click-through interface during installation).

It is not the mere act of running ntpd that throws FFADO off. It becomes an issue only if the RTC is exceedingly bad (like the one on my main PC which is >2000 ppm too slow, causing ntpd to reset the clock by +2 seconds every 15 minutes, instead of its usual harmless gradual adjustments). However, of course also any other clock adjustment, e.g. manually by the user, kills FFADO.

(in reply to: ↑ 6 ; follow-ups: ↓ 8 ↓ 9 ) 03/28/12 15:59:29 changed by jwoithe

Replying to stefanr:

Replying to jwoithe:

What is the policy of adding new functions to libraw1394? Is any attempt made to maintain backwards compatibility through the use of weak linkage?

... But my understanding is that there would be a new function added to libraw1394, libraw1394's minor version number be incremented, and that would be it.

Hmm, ok, so if this were to be utilised without breaking things for people who didn't need the added burden of upgrading libraw1394 just to try FFADO trunk, FFADO would have to check libraw1394 for the existance of the new function at compile time and conditionally use it if it was found. One can do this with autoconf so I guess scons can too (I've just got to look up the documentation).

This wouldn't make the feature runtime detectable (for that you need weak linkage) but in the scheme of things I doubt that is going to be a practical issue. In cases like this where few people really require the new libraw1394 feature, I'm more concerned about making it possible for users to keep trying FFADO trunk without having to mess with other libraries on the system.

I guess another option which avoids the need for people to upgrade libraw1394 immediately would be to add the function to FFADO's code base ...

The libraw1394 API does not expose a file descriptor that could be used for such a purpose.

Ah, ok. Well, that kills off that alternative then.

It is not the mere act of running ntpd that throws FFADO off. It becomes an issue only if the RTC is exceedingly bad (like the one on my main PC which is >2000 ppm too slow, causing ntpd to reset the clock by +2 seconds every 15 minutes, instead of its usual harmless gradual adjustments). However, of course also any other clock adjustment, e.g. manually by the user, kills FFADO.

Thanks for the qualification.

On balance I think the best way to resolve this issue is the following:

  • get the new function into libraw1394 as soon as is practical.
  • code FFADO to use the new function in the event that it is present at compile time.

I should be able to deal with the FFADO side of things. Can you take care of libraw1394?

(in reply to: ↑ 7 ; follow-up: ↓ 11 ) 03/30/12 00:13:40 changed by stefanr

Replying to jwoithe:

Hmm, ok, so if this were to be utilised without breaking things for people who didn't need the added burden of upgrading libraw1394 just to try FFADO trunk, FFADO would have to check libraw1394 for the existance of the new function at compile time and conditionally use it if it was found. One can do this with autoconf so I guess scons can too (I've just got to look up the documentation).

This would be desirable, even though libraw1394 upgrades are not overly hard.

(...)

Can you take care of libraw1394?

OK, I will do as soon as spare time permits. Keep an eye on linux1394-devel.

(in reply to: ↑ 7 ) 03/31/12 12:03:35 changed by stefanr

Replying to jwoithe:

FFADO would have to check libraw1394 for the existance of the new function at compile time and conditionally use it if it was found.

http://scons.org/doc/HTML/scons-user/x4167.html looks like the configuration test is as easy as

    if conf.CheckFunc('raw1394_read_cycle_timer2')
        conf.env.Append(CPPDEFINES = '-DHAVE_RAW1394_READ_CYCLE_TIMER2')

Or the libraw1394 version number could be tested.

04/01/12 05:51:45 changed by jwoithe

In relation to comment 9: yes, it looks to be that easy. Either approach mentioned would work - it would depend on one's mood at the time as to which one was chosen. The generally accepted approach in situations like this is (I think) to test for a feature rather than a version number so I'd probably go for that (of course version number checks are useful when there's been a critical bugfix in existing functions).

(in reply to: ↑ 8 ) 05/11/12 05:00:43 changed by jwoithe

Replying to stefanr:

Can you take care of libraw1394?

OK, I will do as soon as spare time permits. Keep an eye on linux1394-devel.

I gather time has not yet permitted (either that or I've missed a bunch of linux1394-devel digests). Realistically, when do you think you'll have a chance to look into this new libraw1394 function? Is it something we can expect in the next month or so (in which case we can have ffado use it for the 2.1 release), or will it slip into Q3 2012?

05/11/12 11:13:10 changed by stefanr

I have not done this yet; I will Cc you when I get to it. I am thinking of a new function which is extended relative to raw1394_read_cycle_timer() similar to how FW_CDEV_IOC_GET_CYCLE_TIMER2 is extended relative to FW_CDEV_IOC_GET_CYCLE_TIMER: http://lxr.linux.no/#linux+v3.3/include/linux/firewire-cdev.h#L853

As the first and maybe even only user of this extended API, do you prefer to keep the local_time out-parameter as

u_int64_t *local_time /* local system time in microseconds since Epoch */

or should we change it to

struct timespec *tp /* local system time in seconds and nanoseconds (since the Epoch) */

like the respective out-parameter of clock_gettime()?

05/12/12 04:58:57 changed by jwoithe

To be honest I'm not all that fussed. Obviously if we maintain the system of the call this is modeled after it'll make it truly a drop-in replacement within the FFADO code. However, in the interests of future-proofing the API (for cases where extended resolution may be needed) perhaps we ought to go with the microseconds approach. At the end of the day, running with the timespec parameter isn't going to create any issues from FFADO's end that I can see.

06/19/12 05:17:53 changed by jwoithe

Initial patches which add raw1394_read_cycle_timer_and_clock() to libraw1394 (to be included in libraw1394 2.1.0) have been released. Svn r2167 includes a first cut at code to make use of this function if it is available at run time. The intent here is that if one runs with libraw1394 2.1.0 with r2167 or later, FFADO should be relatively immune to disruptions caused by NTP time adjustments due to the use of the CLOCK_MONOTONIC_RAW clock source. In other words, if the theory developed in this ticket is correct, r2167 with a suitably new libraw1394 should resolve the NTP interference problem.

r2167 has been tested and verified not to cause trouble when using libraw1394 earlier than 2.1.0. It is entirely possible that the need for some tweaks will become apparent once testing starts with the revised libraw1394.

06/22/12 12:16:00 changed by stefanr

Here is what I tested now:

  • installed the modified libraw1394
  • built and installed ffado r2166
  • stopped ntpd
  • started jackd with Saffire PRO24 on a TI XIO2213A/B card
  • ran it for about 100 minutes without a single xrun
  • started ntpd, got jackd killed after a few minutes

(So all this confirms that I have a well-working system with r2166 if ntpd is off, but still get instant crash of jackd if the realtime clock is reset. Next steps:)

  • checked out r2167, did scons -c and deleted all the ffado stuff in /usr/local (the scons install routines have frequently been extremely unreliable and unpredictable for me), then built and installed ffado r2167
  • started jackd with Saffire PRO24 (needed to switch it off and on again before ffado succeeded to recognize the Saffire as such)
  • got frequent xruns, connected with a brief audible gap (10 xruns in 25 minutes) and unhandled xruns with crash (2 of those in 25 minutes)
  • tried "date -s ..." and thus provoked xruns or even unhandled xruns = crash.

The startup message says "ffado_streaming_init: libffado 2.999.0-2167 built Jun 22 2012 19:39:10", so there is at least a chance that I am indeed running r2167.

Starting jackd with trailing -v6 shows boatloads of lines like

2279659063900: Warning (cycletimer.h)[  97] wrapAtMaxTicks: insufficient wrapping: 32885376728681314
2279659063905: Warning (IsoHandlerManager.cpp)[1651] getPacket: reconstructed CTR counter discrepancy
2279659063908: Warning (IsoHandlerManager.cpp)[1657] getPacket:  ingredients: 51A, 9851A000, 9651A106, 96790B62, 9678C106, 76, 75, 75, 1847212294
2279659063913: Warning (IsoHandlerManager.cpp)[1658] getPacket:  diffcy = -626

continuously scrolling by pretty fast. This is regardless whether ntpd is running or not.

  • switched off ntpd
  • reverted to ffado r2166
  • started jackd with -v6, got none of those ctr discrepancies
  • kept jackd running without xruns for another hour

So, it seems the new CLOCK_MONOTONIC_RAW usage conflicts with other parts in ffado itself or in the entire application stack that still use CLOCK_MONOTONIC or CLOCK_REALTIME. AFAIU all three clocks run at the same speed if left alone (i.e. if no ntpd or date or friends are being used); they merely have different start offsets.

06/22/12 14:22:37 changed by cladisch

Oh well:

src/bebob/bebob_dl_mgr.cpp:    clock_gettime(CLOCK_REALTIME, &timeout);
src/debugmodule/debugmodule.cpp:    clock_gettime(CLOCK_MONOTONIC, &ts);
src/libieee1394/IsoHandlerManager.cpp:    if (clock_gettime(CLOCK_REALTIME, &ts) == -1) {
src/libieee1394/ieee1394service.cpp:            err = raw1394_read_cycle_timer_and_clock(m_util_handle, &cycle_timer, &local_time, CLOCK_MONOTONIC_RAW);
src/libieee1394/ieee1394service.cpp:        err = raw1394_read_cycle_timer_and_clock(m_util_handle, cycle_timer, local_time, CLOCK_MONOTONIC_RAW);
src/libstreaming/StreamProcessorManager.cpp:        if (clock_gettime(CLOCK_REALTIME, &ts) == -1) {
src/libutil/PosixMessageQueue.cpp:        clock_gettime(CLOCK_REALTIME, &timeout);
src/libutil/PosixMessageQueue.cpp:    clock_gettime(CLOCK_REALTIME, &timeout);
src/libutil/PosixMessageQueue.cpp:    clock_gettime(CLOCK_REALTIME, &timeout);
src/libutil/SystemTimeSource.cpp:    clock_gettime(CLOCK_REALTIME, &ts);
src/libutil/SystemTimeSource.cpp:    clock_nanosleep(CLOCK_REALTIME, 0, &ts, NULL);
src/libutil/SystemTimeSource.cpp:    int err = clock_nanosleep(CLOCK_REALTIME, TIMER_ABSTIME, &ts, NULL);
tests/systemtests/realtimetools.cpp:    clock_gettime(CLOCK_REALTIME, &ts);
tests/systemtests/realtimetools.cpp:    clock_nanosleep(CLOCK_REALTIME, 0, &ts, NULL);
tests/systemtests/realtimetools.cpp:    clock_nanosleep(CLOCK_REALTIME, TIMER_ABSTIME, &ts, NULL);
tests/systemtests/test-clock_nanosleep.cpp:    clock_gettime(CLOCK_REALTIME, &ts);
tests/systemtests/test-clock_nanosleep.cpp:    clock_nanosleep(CLOCK_REALTIME, 0, &ts, NULL);
tests/systemtests/test-clock_nanosleep.cpp:    int err = clock_nanosleep(CLOCK_REALTIME, TIMER_ABSTIME, &ts, NULL);

There is that libutil/SystemTimeSource; I wonder why nobody is using it ...

This also would be the place to check if CLOCK_MONOTONIC_RAW is available; not every kernel has it.

06/23/12 01:39:40 changed by jwoithe

Whoa, it seems that CLOCK_REALTIME is being used in almost every other location throughout FFADO. I guess that's what I get for changing things in bits of FFADO that up to now I haven't had a lot to do with.

Hmm. This calls for some sort of central define I think to determine which clock to use. Of course, CLOCK_MONOTONIC_RAW can only be activated if a sufficiently recent libraw1394 is in place. I'll see about getting a suitable patch up soon.

Stefan: thanks for the initial testing. I need to get the patched libraw1394 onto my development system to check this out further I think.

06/23/12 02:29:01 changed by stefanr

There is a cut-and-paste recipe for getting libraw1394 from git at https://ieee1394.wiki.kernel.org/index.php/Juju_Migration#libraw1394, but it should also be possible to apply the patch from linux1394-devel to a released libraw1394 source tarball.

06/23/12 07:10:49 changed by jwoithe

I have committed r2171. This should ensure that all timing-related parts of the FFADO streaming system reference the same clock source. Thanks Clemens for doing the obvious (a grep for CLOCK_ in the FFADO source) to highlight the fact that there were were more time calls than just the one related to the cycle timer.

I've tested r2171 against an "old" libraw1394 and it doesn't appear to have caused any regressions under those conditions. I haven't had time to get a "new" libraw1394 onto this system yet and I'm out of time right now, so it remains untested on systems where raw1394_read_cycle_timer_and_clock() is available.

Stefan wrote:

There is a cut-and-paste recipe for getting libraw1394 from git ...

Thanks for the reference.

... it should also be possible to apply the patch from linux1394-devel to a released libraw1394 source tarball

That's the approach I intend to take.

06/23/12 09:37:24 changed by stefanr

  • attachment jackd_r2171_new-libraw1394.txt added.

With ffado r2171 and new libraw1394, jackd fails to start.

06/23/12 09:38:45 changed by stefanr

  • attachment ffado-test_r2171_new-libraw1394.txt added.

Inability of r2171 to bring up the streaming threads also shows through in "ffado-test Discover".

06/25/12 07:24:03 changed by jwoithe

Thanks Stefan for the latest round of testing. I finally got a chance to put the latest libraw1394 (from git) on my development machine (it was easier doing that than merging patches) and thus was able to confirm your observations. I should note in passing that "ffado-test Discover" doesn't actually start the streaming threads. The timing errors you were seeing were from the cycle timer DLL which is started by libffado regardless of its mode of operation (which is why "ffado-test Discover" experienced them even though it doesn't do streaming).

To cut a long story short, it turns out that clock_nanosleep() doesn't implement support for CLOCK_MONOTONIC_RAW at the kernel level and simply returns immediately (see kernel/posix-timers.c in the kernel source). This messed pretty badly with FFADO's timing in general and the cycle timer DLL in particular, as one might expect. There was no chance streaming could ever work in this scenario.

FFADO svn r2172 addresses this problem; it simply uses CLOCK_MONOTONIC for the clock_nanosleep() call whenever FFADO's system clock source is set to CLOCK_MONOTONIC_RAW. With this in place I was able to confirm that streaming works once again with an RME FF800.

Note that I've chosen to use CLOCK_MONOTONIC_RAW as the preferred system time source because CLOCK_MONOTONIC can still be slewed by NTP changes (CLOCK_MONOTONIC doesn't jump though, nor does it go backwards). Therefore to my eyes CLOCK_MONOTONIC_RAW would appear to be the better source of "system time" for FFADO.

Please re-test with r2172 or later on your system with the problematic clock to see if, after all these false starts, we are any closer to making FFADO less susceptible to NTP-induced clock changes.

06/25/12 15:18:00 changed by stefanr

  • attachment jackd_r2174_new-libraw1394.txt added.

r2174 is alas still unabe

06/25/12 15:20:47 changed by stefanr

r2174 is alas still unable to start [replaced keyboard, still practicing]

06/25/12 15:29:17 changed by stefanr

PS, it's the same whether ntpd is on or off. To eliminate all other factors, I tried

and got consistent results: r2166 works, r2174 does not.

06/26/12 04:11:39 changed by jwoithe

Hmmm. It gets trickier now because I can no longer reproduce the problem on my system (r2174 without ntpd works fine with the RME FF800, libraw1394 git from last night, kernel 3.2.4). To be sure I just re-ran it again.

You're still seeing the CycleTimerHelper? negative step errors, which is pretty much exactly what we saw prior to r2172. On my system this was caused by clock_nanosleep(CLOCK_MONOTONIC_RAW,...) returning without delay. The continued existence of timing errors would ordinarily suggest that the driver you're using (DICE) calls a clock function directly which doesn't utilise the official system clock (giving rise to a clock value mismatch). However, as far as I can tell this doesn't happen, so I'm quite puzzled.

06/26/12 05:01:03 changed by stefanr

The tests with r2166 <-> r2171 in comment 15 were done with the DICE driver (Focusrite Saffire PRO 24), but the tests with r2166 <-> r2174 in comment 22 were done with the BeBoB driver (Terratec PhaseX24FW). I shall repeat the latter with the DICE driver/ Saffire.

06/26/12 05:52:40 changed by stefanr

Same result with DICE driver, Saffire PRO 24, and three different controllers (TI XIO2213A/B, LSI/Agere FW643e, JMicron JMB381). r2166 is fine, r2174 not. Tested without ntpd, current libraw1394 of course (double-checked that no older versions exist in {/usr,/usr/local}/{include,lib}), and performance CPU frequency scaling governor (although ondemand works on my system too).

Here is the tail of a session with r2174, Saffire PRO24, JMB381:

2602487413684: Warning (IsoHandlerManager.cpp)[1568] putPacket: reconstructed CTR counter discrepacy                                                                                                
2602487413689: Warning (IsoHandlerManager.cpp)[1574] putPacket:  ingredients: 50, 28050000, 2A05012, 2A025550, 2A023192, 20, 21, 21, 516342162
2602487413693: Warning (IsoHandlerManager.cpp)[1575] putPacket:  diffcy = 45
2602487413697: Warning (IsoHandlerManager.cpp)[1568] putPacket: reconstructed CTR counter discrepacy
2602487413701: Warning (IsoHandlerManager.cpp)[1574] putPacket:  ingredients: 51, 28051000, 2A05112, 2A0256A8, 2A023192, 20, 21, 21, 516345234
2602487413705: Warning (IsoHandlerManager.cpp)[1575] putPacket:  diffcy = 46
2602487413710: Warning (IsoHandlerManager.cpp)[1568] putPacket: reconstructed CTR counter discrepacy
2602487413716: Warning (IsoHandlerManager.cpp)[1574] putPacket:  ingredients: 52, 28052000, 2A05212, 2A0257CF, 2A023192, 20, 21, 21, 516348306
2602487413720: Warning (IsoHandlerManager.cpp)[1575] putPacket:  diffcy = 47
2602487413724: Warning (IsoHandlerManager.cpp)[1568] putPacket: reconstructed CTR counter discrepacy
2602487413729: Warning (IsoHandlerManager.cpp)[1574] putPacket:  ingredients: 53, 28053000, 2A05312, 2A025940, 2A023192, 20, 21, 21, 516351378
2602487413733: Warning (IsoHandlerManager.cpp)[1575] putPacket:  diffcy = 48
2602487413740: Warning (IsoHandlerManager.cpp)[1568] putPacket: reconstructed CTR counter discrepancy
2602487413743: Warning (IsoHandlerManager.cpp)[1574] putPacket:  ingredients: 54, 28054000, 2A054192, 2A025AB0, 2A023192, 20, 21, 21, 516354450
2602487413749: Warning (IsoHandlerManager.cpp)[1575] putPacket:  diffcy = 49
2602487413855: Warning (StreamProcessorManager.cpp)[ 909] alignReceivedStreams: xrun while aligning streams...
2602487413865: Error (StreamProcessorManager.cpp)[ 873] syncStartAll: Could not align streams...
2602487413870: Fatal (StreamProcessorManager.cpp)[1021] start: Could not syncStartAll...
2602487413877: Warning (devicemanager.cpp)[ 929] startStreaming: Failed to start SPM!
firewire ERR: Could not start streaming threads: -1
2602487424170: Fatal (ffado.cpp)[ 220] ffado_streaming_start: Could not start the streaming system
DRIVER NT: could not start driver
cannot start driver
2602487575860: Error (CycleTimerHelper.cpp)[ 508] Execute: negative step: -7282477201442.834961! (correcting to nominal)
2602487776199: Error (CycleTimerHelper.cpp)[ 508] Execute: negative step: -7282477126033.470703! (correcting to nominal)
2602487975504: Error (CycleTimerHelper.cpp)[ 508] Execute: negative step: -7282477045726.331055! (correcting to nominal)
2602488173988: Error (CycleTimerHelper.cpp)[ 508] Execute: negative step: -7282476960558.433594! (correcting to nominal)
2602488374161: Error (CycleTimerHelper.cpp)[ 508] Execute: negative step: -7282476870401.022461! (correcting to nominal)
2602488573279: Error (CycleTimerHelper.cpp)[ 508] Execute: negative step: -7282476775365.975586! (correcting to nominal)
2602488776155: Error (CycleTimerHelper.cpp)[ 508] Execute: negative step: -7282476675226.498047! (correcting to nominal)
2602488973224: Error (CycleTimerHelper.cpp)[ 508] Execute: negative step: -7282476570374.581055! (correcting to nominal)
2602489176290: Error (CycleTimerHelper.cpp)[ 508] Execute: negative step: -7282476460376.753906! (correcting to nominal)
2602489373550: Error (CycleTimerHelper.cpp)[ 508] Execute: negative step: -7282476345654.441406! (correcting to nominal)
2602489429428: Fatal (IsoHandlerManager.cpp)[ 348] Execute: (0xd2b8d0, Receive) Handler died: now: 2E0A348B, last: 2A054192, diff: 49395449 (max: 49152000)
2602489429448: Warning (StreamProcessor.cpp)[ 173] handlerDied: Handler died for 0xd62af0
2602489438844: Warning (ieee1394service.cpp)[1498] freeIsoChannel: Invalid channel number: -1
2602489438862: Error (dice_avdevice.cpp)[1365] startstopStreamByIndex: Could not deallocate iso channel for SP 0 (ATX 0)
2602489438867: Warning (devicemanager.cpp)[ 963] stopStreamingOnDevice: Could not stop stream 0 of device 0xd59bc0
2602489440436: Error (dice_avdevice.cpp)[1352] startstopStreamByIndex: ISO_CHANNEL register != 0xFFFFFFFF (=0x00000001) for ARX 0
2602489440444: Warning (devicemanager.cpp)[ 963] stopStreamingOnDevice: Could not stop stream 1 of device 0xd59bc0
2602489576091: Error (CycleTimerHelper.cpp)[ 508] Execute: negative step: -7282476225792.833984! (correcting to nominal)
no message buffer overruns
stefan@stein ~ $ 

It looked the same on the other two controllers.

This and the prior tests were on the PC with the exceedingly bad clock which causes ntpd to force a jump by about two seconds every two minutes. I don't know if this is of any importance: Due to ntpd usage for most of the time prior to this test session, a considerable offset from CLOCK_MONOTONIC_RAW to CLOCK_MONOTONIC has accumulated now. Here is a part of the output of testlibraw:

  - cycle timer: 70 seconds, 7026 cycles, 1937 sub-cycles
    local time from CLOCK_REALTIME: 1340714283538424 us = Tue Jun 26 14:38:03 2012
    cycle timer: 70 seconds, 7027 cycles, 771 sub-cycles
    local time from CLOCK_MONOTONIC: 2604599289565 us
    cycle timer: 70 seconds, 7027 cycles, 943 sub-cycles
    local time from CLOCK_MONOTONIC_RAW: 2603303207421 us

(30 days uptime, 22 minutes difference between the two monotonic clocks)

To do: Test on my second PC which has a normally working RTC.

06/26/12 06:02:12 changed by stefanr

  • attachment ffado-test_Discover_-v6_DICE_r2166_new-libraw1394.txt added.

for completeness: "ffado-test Discover -v", r2166, Saffire PRO24

06/26/12 06:04:00 changed by stefanr

  • attachment ffado-test_Discover_-v6_DICE_r2174_new-libraw1394.txt added.

for completeness: "ffado-test Discover -v6", r2174, Saffire PRO24

06/26/12 06:05:12 changed by jwoithe

I've been running further tests. Although I can't reproduce the negative step errors you're seeing, I do observe glitches in the playback with the FF800 if I use "-p256 -n3". By way of comparison, "-p1024 -n4" works fine. The glitches with "-p256 -n3" go away if I force the clock source to be CLOCK_REALTIME. CLOCK_MONOTONIC behaves in much the same way as does CLOCK_MONOTONIC_RAW. The result of testlibraw on this system is:

  - cycle timer: 29 seconds, 1124 cycles, 1245 sub-cycles
    local time from CLOCK_REALTIME: 1340715422771679 us = Tue Jun 26 22:27:02 2012
    cycle timer: 29 seconds, 1125 cycles, 471 sub-cycles
    local time from CLOCK_MONOTONIC: 10345790071 us
    cycle timer: 29 seconds, 1125 cycles, 1098 sub-cycles
    local time from CLOCK_MONOTONIC_RAW: 10345780173 us

Uptime is only a shade under 3 hours but in any case the clock on this PC is quite good so I wouldn't expect to see the massive differences that you see. It smells as if something somewhere is still using a clock that's not the same as that used by the SystemTimeSource? object. Investigations along these lines have so far come up empty (places where usleep() etc are called are either part of device setup or don't impact on DLL operation). I will keep looking.

06/26/12 06:13:54 changed by stefanr

I don't know if it is related: Why does ffado-test Discover -v6 on my system complain about "Debug (CycleTimerHelper?.cpp)[ 393] Execute: (0x81e7b0) have to retry CTR read, diff unrealistic: diff: 13598, max: +/- 3072 (try: 5) 203949968"? Both r2166 and r2174 show those ten cycle timer read retries due to "unrealistic" difference.

06/26/12 06:29:52 changed by stefanr

jackd -P66 -t2000 -dfirewire -r44100 -p1024 -n4 -v3

dies after about 15 seconds,

jackd -P66 -t2000 -dfirewire -r44100 -p1024 -n4

dies after about 20 seconds,

jackd -P66 -t2000 -dfirewire -r44100 -p2048 -n4

dies after about 45 seconds,

jackd -P66 -t2000 -dfirewire -r44100 -p4096 -n4

dies after about 100 seconds, with "firewire ERR: wait status < 0! (= -1)", "DRIVER NT: could not run driver cycle". During runtime my untrained ear did not hear glitches in playback.

06/26/12 07:03:27 changed by jwoithe

In relation to comment 27, I noticed that both your "working" (r2166) and "nonworking (r2174) logs showed those "diff unrealistic" messages. The fact that they occur in both leads me to believe that they are not the root cause of the issues we see when switching the clock to CLOCK_MONOTONIC / CLOCK_MONOTONIC_RAW.

FYI there is a thread which periodically reads the cycle timer; it's set to wake up once every 200000 microseconds (according to the ffado-test_Discover_-v6_DICE_r2174_new-libraw1394.txt dump) to do this. Before it goes to sleep it estimates what time it should be woken and what "tick" value this corresponds to. When it wakes it compares the actual time/ticks with the predicted ones and if the difference is "too much" you get the message we're talking about. "Too much" here means a single cycle - 3072 ticks. The inclusion of detection attempts for unrealistic cycle counter values seems to stem from buggy hardware hardware - see comments in CycleTimerHelper?.cpp, CycleTimerHelper::Execute().

In your case we're seeing positive diffs, which means wake-up occurred quite a bit later than was expected.

Thinking this through, I wonder whether the difficulties we're seeing are related to the fact that while we can read the clock from CLOCK_MONOTONIC_RAW we can't use that clock to time a sleep. Currently the fallback is to use CLOCK_MONOTONIC for the sleeps, but if the speed of the two clocks differ (1) it's feasible that this could upset some of the FFADO timing code - especially those parts which attempt to predict wake up times. This doesn't, however, explain why r2166 showed the "diff unrealistic" messages. In r2166 everything to my knowledge used CLOCK_REALTIME, so there should be no difference between the clock used to source localtime and that used to time sleeps.

It wouldn't surprise me if it turns out there is more than one gremlin at work here.

(1) Off-hand, I can't see how the speed of CLOCK_MONOTONIC and CLOCK_MONOTONIC_RAW could be different in the absence of ntpd; my understanding was that the two only differed due to ntp-induced clock slew of the former.

(follow-up: ↓ 33 ) 06/26/12 07:44:18 changed by adi

CLOCK_MONOTONIC_RAW cannot sleep. From the kernel's kernel/posix-timers.c:

        struct k_clock clock_monotonic_raw = {
                .clock_getres   = hrtimer_get_res,
                .clock_get      = posix_get_monotonic_raw,
        };

while CLOCK_MONOTONIC has an nsleep entry:

        struct k_clock clock_monotonic = {
                .clock_getres   = hrtimer_get_res,
                .clock_get      = posix_ktime_get_ts,
                .nsleep         = common_nsleep,
                .nsleep_restart = hrtimer_nanosleep_restart,
                .timer_create   = common_timer_create,
                .timer_set      = common_timer_set,
                .timer_get      = common_timer_get,
                .timer_del      = common_timer_del,
        };

(follow-up: ↓ 34 ) 06/26/12 08:06:13 changed by adi

Addendum: r2174 works fine on my DICE-based io14. Let me know if I should test other devices, too.

06/26/12 11:21:05 changed by stefanr

  • attachment jackd_r2174_new-libraw1394_Mac-mini_bebob-driver.txt added.

r2174 does not work on my 2nd PC with good RTC either

(follow-up: ↓ 35 ) 06/26/12 12:20:35 changed by stefanr

Addendum to my comment on jackd_r2174_new-libraw1394_Mac-mini_bebob-driver.txt:

While r2174 consistently fails to start on my 2nd PC with -p256 -n3 (new libraw1394, ntpd off or on), it actually works flawless on this 2nd PC with -p2048 -n4 (new libraw1394, ntpd off or on).

The "Warning (TimestampedBuffer?.cpp)[1068] incrementFrameCounter: (0x9fdb940) difference rather large" which is seen in the log with the smaller buffer setting message never occured in a session with the larger buffer setting which is now running for an hour already. In this good session, there was actually no warning at all (except for the usual SPLIT_TIMEOUT thing, the two "updateState: ignoring identity state update from/to ePS_Created", and the one "getPacket: Instantanous samplerate more than 1% off nominal" directly at sturtup).

1st PC with bad RTC: ASUS M3A78-EM, Phenom II x4 2.5 GHz, 8 GB RAM, kernel 3.3.1 with a few kernel debugging options and firewire drivers circa at version 3.5, x86-64 kernel (mainline) and x86-64 userland (Gentoo, except for mainline libraw1394 and ffado), 30 days uptime

2nd PC with good RTC:¹ Apple Mac mini, Core 2 Duo 2.33 GHz, 2 GB RAM, kernel 3.4 with some more kernel debugging options than on the 1st PC, same firewire drivers like on 1st PC, x86-32 kernel and x86-32 userland, 1.5 hours uptime

¹) ffado has always been working on this 2nd PC while ntpd was running; ntpd never had to resort to clock reset on this PC.

(in reply to: ↑ 30 ) 06/26/12 17:10:47 changed by jwoithe

Replying to adi:

CLOCK_MONOTONIC_RAW cannot sleep.

Indeed - see comment 20. This is precisely why I've used CLOCK_MONOTONIC for sleeps as of r2172.

(in reply to: ↑ 31 ) 06/26/12 17:14:37 changed by jwoithe

Replying to adi:

Addendum: r2174 works fine on my DICE-based io14.

That's interesting. What period size was used, and how many buffers were requested? I've found with the RME that "-p1024 -n256" works great with r2174, but "-p256 -n3" produces choppy audio.

Let me know if I should test other devices, too.

If it's easy to arrange it might be useful to build up a more complete picture of what's going on. In any case, there's something about the use of CLOCK_MONOTONIC_RAW that's messing things up, and somewhat confusingly the symptoms seem to vary between devices and/or drivers.

(in reply to: ↑ 32 ) 06/26/12 17:20:49 changed by jwoithe

Replying to stefanr:

While r2174 consistently fails to start on my 2nd PC with -p256 -n3 (new libraw1394, ntpd off or on), it actually works flawless on this 2nd PC with -p2048 -n4 (new libraw1394, ntpd off or on).

The result on the second PC more or less mirrors what I've found: with CLOCK_MONOTONIC_RAW (or CLOCK_MONOTONIC) in use, larger buffers seem to work considerably better than smaller ones. Of course with CLOCK_REALTIME the smaller buffers work so there's definitely a regression when we shift to these alternative clocks.

The "Warning (TimestampedBuffer?.cpp)(1068) incrementFrameCounter: (0x9fdb940) difference rather large" which is seen in the log with the smaller buffer setting message never occured in a session with the larger buffer setting ...

Again, that's similar to what I've seen with the RME (I don't get this particular message at small buffer sizes, but I do see others related to timing issues).

... the one "getPacket: Instantanous samplerate more than 1% off nominal" directly at sturtup).

This will occur in many cases right at the start while things adjust to the actual clock values delivered by the system. It's not, as far as I can tell, implicated in the monotonic clock issues.

(follow-up: ↓ 40 ) 06/27/12 13:09:58 changed by stefanr

Is this perhaps a startup race? (Something uses SystemTimeSource::clock_id before it was initialized? And large buffers defer the first critical use long enough to put it fater initialization...)

(follow-up: ↓ 38 ) 06/27/12 14:01:35 changed by stefanr

Tested a few things again on my "bad" PC and noticed:

  • "jackd -dfirewire -p256 -n3 -v3" failed as before.
  • Surprisingly, one atttempt of "jackd -dfirewire -p2048 -n4 -v3" worked for half an hour without problem. A few minutes into this session I started ntpd and forced realtime clock reset by means of ntp-client and "date -s ...", which did not disturb the session as expected.
  • I stopped this good session and then tried "jackd -dfirewire -r44100 -p2048 -n4 -v3". This went fine for about 10 minutes, then suddenly the series of "negative step"s showed up and jackd died. After this, I was unable to restart jackd, no matter what parameters I tried.

Tested with Saffire PRO 24 on Agere FW643e which I also used yesterday, among other combinations.

06/27/12 14:09:30 changed by stefanr

  • attachment clock_monotonic.patch added.

suggested patch for simplification and perhaps more robustness even: replace CLOCK_MONOTONIC_RAW by CLOCK_MONOTONIC

(in reply to: ↑ 37 ; follow-up: ↓ 42 ) 06/27/12 15:02:48 changed by stefanr

Re clock_monotonic.patch: The syntax highlighting viewer is broken, it misses a hunk in src/libutil/SystemTimeSource.cpp. Please view the patch in "original format".

Replying to stefanr:

Tested a few things again on my "bad" PC and noticed: - "jackd -dfirewire -p256 -n3 -v3" failed as before. - Surprisingly, one atttempt of "jackd -dfirewire -p2048 -n4 -v3" worked for half an hour without problem. A few minutes into this session I started ntpd and forced realtime clock reset by means of ntp-client and "date -s ...", which did not disturb the session as expected.

Actually, this first bad session and second good session was with clock_monotonic.patch applied on top of r2174. I reverted that patch, then ran the following session which lasted only 10 minutes and could not be recovered:

- I stopped this good session and then tried "jackd -dfirewire -r44100 -p2048 -n4 -v3". This went fine for about 10 minutes, then suddenly the series of "negative step"s showed up and jackd died. After this, I was unable to restart jackd, no matter what parameters I tried.

After this, I applied clock_monotonic.patch again and "jackd -dfirewire -p2048 -n3 -v3" worked again immediately and ran nicely for another hour.

So at the moment it looks like my bad PC and r2174 really do not cope well with the mixture of CLOCK_MONOTONIC_RAW and CLOCK_MONOTONIC, whereas using CLOCK_MONOTONIC alone seems to work (with large buffers). Note that CLOCK_MONOTONIC is indeed supposed to be sufficient: The gradual clock shifts which ntpd induced never were a problem on my two PCs (and anybody else's I presume), only clock resets are the issue.

Now on to tackle the problem with small buffers.

(follow-up: ↓ 43 ) 06/27/12 15:14:23 changed by stefanr

On kernel-version-recommendation.patch: The firewire-core/-ohci drivers of kernel 2.6.28 don't actually work for FFADO. Version 2.6.32 was the very first that worked; kernel 2.6.36 had several issues fixed that affected some audio related use cases. (And raw1394 of the old linux1394 stack never gained a request which lets clients choose a clock other than CLOCK_REALTIME.)

06/27/12 15:14:39 changed by stefanr

  • attachment kernel-version-recommendation.patch added.

(in reply to: ↑ 36 ) 06/28/12 07:19:22 changed by jwoithe

Replying to stefanr:

Is this perhaps a startup race? (Something uses SystemTimeSource::clock_id before it was initialized?

I thought of this possiblity, but as far as I can see none of the streaming code can possibly be started before SystemTimeSource::clock_id is initialised by Ieee1394Service::initialize(). If anything earlier did happen to use SystemTimeSource::clock_id earlier it would use CLOCK_REALTIME (that's what clock_id is set to on startup), but since none of those times could propagate into the streaming system it would be irrelevant if it did.

While this is my present understanding, I could well be wrong.

06/28/12 07:23:17 changed by jwoithe

Regarding clock_monotonic.patch, your findings are interesting. I tried precisely this approach at one point and observed the same problematic behaviour as when CLOCK_MONOTONIC_RAW was used. I therefore discounted the use of CLOCK_MONOTONIC as a fix because my testing seemed to indicate that it didn't fix the problem.

However, given that it seems to work for you I will re-test after forcing the use of CLOCK_MONOTONIC to make sure the result I observed wasn't due to some other issue at the time.

(in reply to: ↑ 38 ) 06/28/12 07:30:39 changed by jwoithe

Replying to stefanr:

Note that CLOCK_MONOTONIC is indeed supposed to be sufficient: The gradual clock shifts which ntpd induced never were a problem on my two PCs (and anybody else's I presume), only clock resets are the issue.

Hmm, ok. I was concerned that the time slews also created issues.

Now on to tackle the problem with small buffers.

I think this is related to the use of CLOCK_MONOTONIC (or CLOCK_MONOTONIC_RAW). In comment 41 I mentioned that CLOCK_MONOTONIC was broken for me; on reflection I think this was probably tested with small buffers (-p256 -n3). However, if CLOCK_REALTIME was forced then "-p256 -n3" worked perfectly - this suggests to me that it's not the small buffer sizes which are the problem. So even if CLOCK_MONOTONIC works for large buffer sizes, it still seems to be the cause of some underlying problem. That problem, it seems, is papered over by the use of large buffer sizes, but is almost certainly still lurking under the surface even then.

In light of these things I will re-test all this on my setup so we can be definite about what's going on. It may be a few days before I get a chance to do so though since I'm busy for a couple of days with other things.

(in reply to: ↑ 39 ) 06/28/12 07:36:22 changed by jwoithe

Replying to stefanr: regarding kernel-version-recommendation.patch: that makes sense to me. Applied as r2175. Thanks for the background info and the patch.

06/28/12 14:27:32 changed by stefanr

The -p2048 -n3 session which I started yesterday with r2174 + clock_monotonic.patch on my bad PC is still running --- now for 24 hours, with ntpd doing its thing. During this session, just one recoverable xrun occurred when I ran "emerge-webrsync" on my root filesystem which is a reiserfs filesystem for historical reasons.

(follow-up: ↓ 48 ) 06/28/12 14:57:35 changed by stefanr

For what it's worth, I applied this hack on top of r2175 + clock_monotonic.patch:

--- a/src/libutil/SystemTimeSource.cpp
+++ b/src/libutil/SystemTimeSource.cpp
@@ -38,7 +38,7 @@ DECLARE_GLOBAL_DEBUG_MODULE;
 
 namespace Util {
 
-static clockid_t clock_id = CLOCK_REALTIME;
+static clockid_t clock_id = CLOCK_MONOTONIC;
 
 bool
 SystemTimeSource::setSource(clockid_t id)

With this, "jackd -dfirewire -p256 -n3 -v3" still fails to start. Similar to or same as before, lots of "reconstructed CTR counter discrepancy" warnings and "Execute: Timeout while waiting for activity" precede the ending in "Could not align streams..." ... "Could not syncStartAll..." ... "Could not start the streaming system" ... "cannot start driver".

(follow-ups: ↓ 47 ↓ 49 ) 06/28/12 16:28:56 changed by stefanr

Some more data:

The "reconstructed CTR counter discrepancy" came from IsoHandlerManager?.cpp line 1657 in getPacket, not from line 1568 in put_packet.

Back on r2175 + clock_monotonic.patch. I noticed that the threshold of successful startup for my 1st PC + Saffire PRO 24 + Agere controller is around -p256 -n4.

  • "jackd -dfirewire -p256 -n3 -v3" = throws lots of "reconstructed CTR counter discrepancy" and "Execute: Timeout while waiting for activity", then dies soon.
  • "jackd -dfirewire -p256 -n4 -v3" = starts and client applications can connect to jackd, but ffado throws a loop of "Execute: Timeout while waiting for activity", audio playback may or may not be heard,
  • "jackd -dfirewire -p256 -n5 -v3" = good,
  • "jackd -dfirewire -p512 -n3 -v3" = good,

(good meaning it does start normally and keeps running for 15 minutes or longer without warnings after start and without perceivable glitches).


Questions:

  • Is there any exchange of timestamps going on between ffado and upper jack layers or client processes?
  • Do the nanosleep() and gettimeofday() calls in debugmodule or the usleep() calls in libieee1394 and elsewhere matter?
  • Could there be some arithmetic going on somehwere in FFADO which relies on absolute system timestamps being a few decades greater than 0, rather than just some weeks or hours?
  • Is it that smaller buffers cause FFADO to call clock_nanosleep() for briefer sleeping periods which happen to work with CLOCK_REALTIME but not with CLOCK_MONOTONIC? (Would be counter intuitive; and clock_getres() advertizes 1 ns precision for CLOCK_REALTIME, CLOCK_MONOTONIC, and CLOCK_MONOTONIC_RAW on my PC.)
  • Could it be that CLOCK_REALTIME is synchronous on all CPUs but CLOCK_MONOTONIC and CLOCK_MONOTONIC_RAW differ between CPUs (cf. man clock_getres), and FFADO gets confused by that when being moved from CPU to CPU?

(in reply to: ↑ 46 ) 06/28/12 17:00:26 changed by stefanr

Replying to stefanr:

- Could it be that CLOCK_REALTIME is synchronous on all CPUs but CLOCK_MONOTONIC and CLOCK_MONOTONIC_RAW differ between CPUs (cf. man clock_getres), and FFADO gets confused by that when being moved from CPU to CPU?

Note to self: Determine the threshold between "doesn't-start" and "starts" on the 2nd PC. Reboot and test again to see whether the threshold shifted notably. Reboot again but with "nosmp" kernel parameter and retest smaller buffer settings.

(in reply to: ↑ 45 ) 06/28/12 17:48:59 changed by jwoithe

Replying to stefanr:

For what it's worth, I applied this hack on top of r2175 + clock_monotonic.patch: ... With this, "jackd -dfirewire -p256 -n3 -v3" still fails to start. Similar to or same as before ...

I'm not surprised. In reality the value this is initialised to will only ever be used if the new libraw1394 API is not detected. If the new API is present then SystemTimeSource::setSource(CLOCK_MONOTONIC_RAW) will be called before any other SystemTimeSource? calls are made (assuming my earlier analysis is correct regarding calling order). Therefore on your system - which has the new API - there will be no runtime difference when you made this change; by the time any SystemTimeSource? timer functions are called clock_id will be CLOCK_MONOTONIC_RAW (or CLOCK_MONOTONIC if you've forced it to this in ieee1394service).

In other words, with the new libraw1394 API present the value of clock_id at the time SystemTimeSource? is used to get time values (or instigate sleeps) will be that which is set via SystemTimeSource::setSource() as called from ieee1394service.

If you want to override the actual clock used, the best way is to simply add a SystemTimeSource::setSource() call of your choosing to the ieee1394service initialization method to override the setting made by the code that's already there.

(in reply to: ↑ 46 ) 06/28/12 18:03:48 changed by jwoithe

Replying to stefanr:

- Is there any exchange of timestamps going on between ffado and upper jack layers or client processes?

No. The generated timestamps are only used internally by the streaming layer within FFADO itself.

- Do the nanosleep() and gettimeofday() calls in debugmodule or the usleep() calls in libieee1394 and elsewhere matter?

I don't think so. Those in debugmodule seem to exist only to induce delays in the processing of the debug messages. Since this is entirely isolated within the debugmodule it really doesn't matter if the actual sleep time varies slightly from what's requested or what timer is used to measure the sleep. THe usleep() calls are similar in that they are acting outside of the streaming system: either during device setup, configuration or control. The streaming layer doesn't see any of this so their presence is irrelevant as far as I can tell.

- Could there be some arithmetic going on somehwere in FFADO which relies on absolute system timestamps being a few decades greater than 0, rather than just some weeks or hours?

I don't know but I suspect not. I can't think of a case where this would be significant. When dealing with timestamps the only arithmetic issue I can think of is overflow, but this is the opposite scenario to what we're dealing with (that is, if overflow were a problem it would be worse for larger timestamp values).

- Is it that smaller buffers cause FFADO to call clock_nanosleep() for briefer sleeping periods which happen to work with CLOCK_REALTIME but not with CLOCK_MONOTONIC? (Would be counter intuitive; and clock_getres() advertizes 1 ns precision for CLOCK_REALTIME, CLOCK_MONOTONIC, and CLOCK_MONOTONIC_RAW on my PC.)

It's certainly possible that smaller buffer sizes are causing certain clock_nanosleep() calls to be briefer. I agree that it's counter-intuitive that this would work for CLOCK_REALTIME and not the others, especially since the resolutions are all reported identically. However, perhaps it's worth exploring this, if for no other reason than to definitely rule it out.

- Could it be that CLOCK_REALTIME is synchronous on all CPUs but CLOCK_MONOTONIC and CLOCK_MONOTONIC_RAW differ between CPUs (cf. man clock_getres), and FFADO gets confused by that when being moved from CPU to CPU?

That's also a good point. This is getting into the lower level detail of the clock implementations which I'm not familar with: I'll have to do some reading. If CLOCK_MONOTONIC and CLOCK_MONOTONIC_RAW are not synchronous across CPUs then that would certainly give trouble if FFADO was migrated. How often would migration occur though? We're seeing the problem occcur pretty much immediately and continuously; if caused by migration there's a lot of migration going on.

Of course different FFADO threads could be running on different CPUs. However, I don't think the issues we're seeing are associated with timestamps which are passed between the streaming threads. I'll take a look at this though since it's worth checking.

06/29/12 11:57:25 changed by stefanr

I now tested my 2nd PC with the good RTC, unmodified r2175, Terratec PhaseX24FW (BeBoB), Agere FW323, kernel 3.4. Exactly the same results as with the 1st PC per comment:46:

  • "jackd -dfirewire -p256 -n3 -v3" = throws lots of "reconstructed CTR counter discrepancy" and "Execute: Timeout while waiting for activity", then dies soon.
  • "jackd -dfirewire -p256 -n4 -v3" = starts and typically plays, but ffado throws a neverending loop of "Execute: Timeout while waiting for activity"
  • "jackd -dfirewire -p256 -n5 -v3" = works fine,
  • "jackd -dfirewire -p512 -n3 -v3" = works fine.

I do get exactly this result also after reboot and also with "nosmp" kernel parameter, i.e. CPU degraded to single core. IOW the issue is not SMP related.

(follow-up: ↓ 52 ) 06/30/12 13:04:56 changed by stefanr

Repeated testing on my first PC confirmed that clock_monotonic.patch is absolutely necessary. Without it, I cannot start jackd due to negative timesteps, no matter what buffer sizes I set. I.e. the second sentence in the patch description has proven to be wrong: The patch is not just an improvement, rather it is indispensable.

(in reply to: ↑ 51 ; follow-up: ↓ 53 ) 06/30/12 21:37:10 changed by jwoithe

Replying to stefanr:

Repeated testing on my first PC confirmed that clock_monotonic.patch is absolutely necessary. Without it, I cannot start jackd due to negative timesteps, no matter what buffer sizes I set.

Thanks for this latest round of testing. I am yet to get time to run the tests I eluded to in comment 42 but hope to soon. There definitely seems to be a driver dependency in this behaviour though, since I can run the FF800 with "-p1024 -n4" quite happily while for you this isn't possible.

In any case, as per your earlier tests, using CLOCK_MONOTONIC isn't sufficient because it causes thing to go wrong with small buffer sizes. Likewise, on my system with the FF800 I get problems at lower buffer sizes with CLOCK_MONOTONIC_RAW or CLOCK_MONOTONIC that don't occur at all if CLOCK_REALTIME is used throughout. It seems to me that while CLOCK_MONOTONIC does something to improve things, there's still something wrong - something which happens to be covered over when larger buffers are in use. To be sure of all this I'll retest everything here and report back, at which case we can try to work out a way forward.

(in reply to: ↑ 52 ; follow-up: ↓ 55 ) 07/01/12 02:13:39 changed by stefanr

Replying to jwoithe:

Replying to stefanr:

Repeated testing on my first PC confirmed that clock_monotonic.patch is absolutely necessary. Without it, I cannot start jackd due to negative timesteps, no matter what buffer sizes I set.

Thanks for this latest round of testing. I am yet to get time to run the tests I eluded to in comment 42 but hope to soon. There definitely seems to be a driver dependency in this behaviour though, since I can run the FF800 with "-p1024 -n4" quite happily while for you this isn't possible.

Just to be clear, this is not dependent on drivers. I do get the same result like you with either CLOCK_MONOTONIC (r2174 + patch) or CLOCK_MONOTONIC_RAW/CLOCK_MONOTONIC mixture (plain r2174) on my 2nd PC with good RTC, FW323 controller, and DICE as well as BeBoB driver. For a time (but not anymore), I also got this result with plain r2174 like with r2174 + patch on my 1st PC with bad RTC, XIO2213/ JMB381/ FW643e controllers, and DICE as well as with BeBoB driver. (The result being that small buffers regressed relative to CLOCK_REALTIME whereas larger buffers work, here: beginning with -p256 -n5, or -p256 -n4 with warnings.)

Only on my 1st PC with bad RTC and large uptime, the following difference between plain r2174 and r2174 + patch showed up: At first, the former survived only about 10 minutes while the latter could run for hours. Right now I can't start the former anymore but the latter is still reliable. Both with sufficiently large buffers. Again, my 2nd PC does not have this problem, or at least not yet.

I don't know whether this special problem of my 1st PC is a result of the bad RTC or of the larger uptime or of both in combination. It is on the other hand quite safe to say that it is not the result of different drivers or controllers or whatever.

I now moved my 2nd PC to another room and will have it gather several days uptime there such that the offset between CLOCK_MONOTONIC_RAW and CLOCK_MONOTONIC increases, then see whether the 2nd PC becomes affected the same way as the 1st PC. There are probably quicker and more definite ways to determine whether this is about the offset between the two clocks or about the speed difference between the two clocks, but I guess there is no rush.

So far I did all my tests after ntpd already ran at least at boot. (It was subsequently switched off during some of the tests, but that made no difference except of course with CLOCK_REALTIME.) One test which I also should do eventually is to boot the 1st PC with bad RTC without ntpd ever started, so that CLOCK_MONOTONIC_RAW and CLOCK_MONOTONIC remain at the same offset and speed.

In any case, as per your earlier tests, using CLOCK_MONOTONIC isn't sufficient because it causes thing to go wrong with small buffer sizes.

Yes. Maybe we need to instrument the clock_nanosleep calls somehow, to check that they indeed delay for about the period which is requested.

(follow-up: ↓ 56 ) 07/06/12 06:25:57 changed by jwoithe

I have made further progress I think. It turns out that we call sem_timedwait() in two places, and the absolute timeout value it takes is only checked against CLOCK_REALTIME - there is no way to change this. Revision r2177 checks in a fix for this issue: the time used to calculate the timeout value is obtained from CLOCK_REALTIME rather than Util::SystemTimeSource?. Since the time obtained in the affected functions is only ever used with sem_timedwait() it is safe to do this.

With the changes in r2177 I can successfully run the RME FF800 with "-p256 -n3" on my development machine whereas beforehand these settings were unstable.

Stefan: I would be interested to see what r2177 (or later) does on your two boxes.

(in reply to: ↑ 53 ) 07/08/12 07:53:07 changed by stefanr

Replying to stefanr:

I don't know whether this special problem of my 1st PC is a result of the bad RTC or of the larger uptime or of both in combination. It is on the other hand quite safe to say that it is not the result of different drivers or controllers or whatever. I now moved my 2nd PC to another room and will have it gather several days uptime there such that the offset between CLOCK_MONOTONIC_RAW and CLOCK_MONOTONIC increases, then see whether the 2nd PC becomes affected the same way as the 1st PC.

I had my 2nd PC with good RTC collect 7 days uptime. This shifted CLOCK_MONOTONIC by +8_385_207 µs (+8.4 seconds) relative to CLOCK_MONOTONIC_RAW. This means that CLOCK_MONOTONIC is 13 ppm faster than CLOCK_MONOTONIC_RAW.

During the entire 7 days I had unmodified FFADO r2175 running with "jackd -dfirewire -p256 -n5" without xrun. I stopped this session in order to try current svn.

For comparison: My 1st PC with bad RTC currently has 42 days uptime and gathered 1_812_593_820 µs (30 minutes) difference of CLOCK_MONOTONIC - CLOCK_MONOTONIC_RAW. This amounts to CLOCK_MONOTONIC running 500 ppm faster than CLOCK_MONOTONIC_RAW. (Yet CLOCK_REALTIME still needs to be reset because that 500 ppm speed adjustment is not enough by far to keep up with the rest of the world.)

Conclusion of the 7 days long experiment on my first PC: Several seconds absolute difference between CLOCk_MONOTONIC and CLOCK_MONOTONIC_RAW are not a problem for FFADO r2175.

(in reply to: ↑ 54 ; follow-up: ↓ 57 ) 07/08/12 14:03:12 changed by stefanr

Replying to jwoithe:

I have made further progress I think. It turns out that we call sem_timedwait() in two places, and the absolute timeout value it takes is only checked against CLOCK_REALTIME - there is no way to change this. Revision r2177 checks in a fix for this issue: the time used to calculate the timeout value is obtained from CLOCK_REALTIME rather than Util::SystemTimeSource?. Since the time obtained in the affected functions is only ever used with sem_timedwait() it is safe to do this.

But if a clock reset happened during sem_timedwait(), it will wait for too short or too long, right?

With the changes in r2177 I can successfully run the RME FF800 with "-p256 -n3" on my development machine whereas beforehand these settings were unstable. Stefan: I would be interested to see what r2177 (or later) does on your two boxes.

I tested r2178 on both PCs now, together with what is to become libraw1394 v2.1.0.

Results on my 2nd PC with good RTC:

  • -p256 -n3 and -p128 -n3 work fine. I did not test any other buffer settings yet.
  • DICE driver with Saffire PRO 24 as well as BeBoB driver with PhaseX24FW work alike.
  • CLOCK_REALTIME can be reset into the past and into the future without audible glitch and without xrun.
  • Unmodified r2178 as well as r2178 patched with clock_monotonic.patch work alike on this PC.

Results on my 1st PC with bad RTC:

  • I was unable to start unmodified r2178 with BeBoB driver and PhaseX24FW.
  • I could start r2178 + clock_monotonic.patch with BeBoB driver and PhaseX24FW. I let it run for more than an hour and it survived many clock resets from me and several from ntpd during that time (with glitches, see below).
  • I could start unmodified r2178 with DICE driver and Saffire PRO 24 twice. It ran for a while and survived some resets of CLOCK_REALTIME (more on that below) but died when a long series of "reconstructed CTR counter discrepancy" happened after a clock reset. After this, I was no longer able to start unmodified r2178 with DICE driver and Saffire PRO 24. The driver startup could not recover from series of "CTR counter discrepancy".
  • I could start r2178 + clock_monotonic.patch with DICE driver and Saffire PRO 24. I let it run for more than an hour and it survived many clock resets from me and several from ntpd during that time (with glitches like the BeBoB driver, see below).
  • -p256 -n3, -p128 -n3, and -p64 -n3 work alike.
  • CLOCK_REALTIME can be reset into the past without glitch/ without xrun.
  • If CLOCK_REALTIME is reset into the future by seconds or by hours, it almost always causes a clearly audible gap in playback and a reported xrun if buffers are set to -256 -n3 or less. Further observations:
    • The xrun begins with one "(IsoHandlerManager?.cpp)[ 292] Execute: Timeout while waiting for activity" warning, then a series of "(IsoHandlerManager?.cpp)[1656] getPacket: reconstructed CTR counter discrepancy" and associated warnings follow, concluded by "(devicemanager.cpp)[ 998] waitForPeriod: XRUN detected" and another incrementFrameCounter and getPacket warning.
    • There was one occasion at which only one "Execute: Timeout while waiting for activity" was logged but unlike the other times, no CTR counter discrepancy/ xrun/ audible glitch happened.
    • The xrun at CLOCK_REALTIME reset into the future happens with unpatched r2178 (when it could started at all) as well as with r2178 + clock_monotonic.patch.
  • CLOCK_REALTIME can be reset into the past as well as into the future without glitch and without xrun with larger buffer sizes: -p256 -n5, or -p512 -n3, or -p2048 -n3 all prevent the glitch.

To summarize:

  1. All is fine on my 2nd PC with good RTC. libraw1394-git plus r2177 or later improve FFADO on that PC in one regard: FFADO is now immune against clock resets. While ntpd never resets the clock on this PC, user intervention could cause clock resets, or a switch between standard time and daylight saving time would cause it.
  1. Results of libraw1394-git plus r2177 or later are mixed on my 1st PC with bad RTC:
    1. There is a regression relative to 2166 (or relative to r2177 with libraw1394 v2.0.9): FFADO is no longer able to start, or occasionally starts but dies soon after. This is fixed or at least successfully worked around by clock_monotonic.patch.
    2. Without or with clock_monotonic.patch, but only on my 1st PC with bad RTC, there is a handled xrun with audible gap when the system clock is reset into the future and buffers are -p256 -n3 or smaller.
    3. That xrun can escalate into an unhandled xrun and hence terminate jackd if clock_monotonic.patch was not applied. (But without clock_monotonic.patch, ffado is unlikely to work at all anyway.)
    4. This glitch does not happen if buffers are -p256 -n5 or bigger.
    5. This glitch does not happen if the clock is reset into the past.

(in reply to: ↑ 56 ; follow-up: ↓ 58 ) 07/08/12 17:17:08 changed by jwoithe

Replying to stefanr:

Thanks for this latest round of extensive tests and the detailed report.

Replying to jwoithe:

It turns out that we call sem_timedwait() in two places, and the absolute timeout value it takes is only checked against CLOCK_REALTIME - there is no way to change this. Revision r2177 checks in a fix for this issue: the time used to calculate the timeout value is obtained from CLOCK_REALTIME rather than Util::SystemTimeSource?. Since the time obtained in the affected functions is only ever used with sem_timedwait() it is safe to do this.

But if a clock reset happened during sem_timedwait(), it will wait for too short or too long, right?

To be honest I'm not sure. I can't find any references which point to the behaviour of sem_timedwait() across a clock reset. I guess it depends on how it's implemented internally. If it sticks with the absolute time then I would expect spurious returns as you suggested. If instead it uses relative times it might work (clock_nanosleep() apparently delays correctly across time resets when working with a relative sleep interval).

If sem_timedwait() happens to return early I don't think that would be a huge issue. The early wakeup will be noted but I think the rest of the streaming system would just determine that there was no work to do and send things back to sleep again. I note that in the relevant places there is no check/retry in place in the event the call was interrupted, which further leads me to believe that early wakeups are relatively harmless. An early wakeup would be caused by a forward time set.

A late wakeup on the other hand (as caused by a time set into the past) could cause issues. I'm not sure there's much we can do about it though. The code is structured to use sem_timedwait(), and sem_timedwait() only ever uses CLOCK_REALTIME. Unlike pthread_cond_timedwait() there's no way to set the clock to anything else. Without a functional sem_timedwait() equivalent which can use a clock other than CLOCK_REALTIME I can't see any way to deal with this case.

Results on my 2nd PC with good RTC: :

Ok, that pretty much backs up the test results I had on my development PC. Progress of sorts has therefore been made.

Results on my 1st PC with bad RTC: - I was unable to start unmodified r2178 with BeBoB driver and PhaseX24FW.

That is strange. About the only thing left to be suspicious about is the speed difference between CLOCK_MONOTONIC and CLOCK_MONOTONIC_RAW, but I can't immediately understand why things would be so sensitive to the actual clock speed. About the only thing I can think of is that the initial values of the DLL may just be too far out when there's a gross error in the clock speed. The initial values are calculated assuming sample rates observed relative to an ideal clock. No clock is ideal of course and that's what the DLL's operation is meant to correct for. But if the actual clock used is miles out (as yours is on this PC) perhaps the DLL can't stabilise quick enough. I'll have a look into this in the next day or so to see if this is even remotely feasible.

- I could start r2178 + clock_monotonic.patch with BeBoB driver and PhaseX24FW. I let it run for more than an hour and it survived many clock resets from me and several from ntpd during that time (with glitches, see below).

This is puzzling. Aside from the speed difference (as discussed above) there should be no difference between CLOCK_MONOTONIC and CLOCK_MONOTONIC_RAW, and yet we get this differing behaviour. Even if a switch to CLOCK_MONOTONIC is ultimately made I would like to try to get to the bottom of this because it may well crop up for others. For example, if one is running their audio PC isolated from the network there's no NTP and CLOCK_MONOTONIC will not be speed corrected - it would be indistinguishable from CLOCK_MONOTONIC_RAW in this case I think. Such a user, when facing a PC like yours, would still see the problems you're seeing with CLOCK_MONOTONIC_RAW even if FFADO switches to CLOCK_MONOTONIC.

- I could start unmodified r2178 with DICE driver and Saffire PRO 24 twice. ... After this, I was no longer able to start unmodified r2178 with DICE driver and Saffire PRO 24. The driver startup could not recover from series of "CTR counter discrepancy".

That's odd too. Why could it start fine twice and then never again? And why was the DICE driver seemingly ok in this scenario while the BeBoB driver was apparently not. Perhaps you were just lucky with the DICE test; maybe the timer behaviour at that precise moment was just sufficiently "good" to permit FFADO startup.

- I could start r2178 + clock_monotonic.patch with DICE driver and Saffire PRO 24. ... - If CLOCK_REALTIME is reset into the future by seconds or by hours, it almost always causes a clearly audible gap in playback and a reported xrun if buffers are set to -256 -n3 or less. Further observations: - The xrun begins with one "(IsoHandlerManager?.cpp)[ 292] Execute: Timeout while waiting for activity" warning, then a series of "(IsoHandlerManager?.cpp)[1656] getPacket: reconstructed CTR counter discrepancy" and associated warnings follow, concluded by "(devicemanager.cpp)[ 998] waitForPeriod: XRUN detected" and another incrementFrameCounter and getPacket warning.

The "Timeout while waiting for activity" warning comes about as a result of an ETIMEDOUT errno from sem_timedwait(). If the time reset occurred between the clock_gettime() and the sem_timedwait() call I would expect this: the constructed timeout time would be well and truly passed by the time sem_timedwait() was called, so it would immediately time out. FFADO prints the warning and we move on. It's those "reconstructed CTR counter discrepancy" warnings that I don't immediately have an answer for. They usually indicate a loss of sync between the computer and the device, but it's not immediately clear to me why they would be prompted by an early return from sem_timedwait() in this context. Hmm, unless the streaming code is being iterated over unconditionally after the early sem_timedwait() return; then it's possible that assumptions within the streaming code about the expected changes in clock values may play a part.

In any case, an xrun is usually the ultimate result of a series of "reconstructed CTR counter discrepancy" messages, so the xrun is probably just a result of collateral damage.

- There was one occasion at which only one "Execute: Timeout while waiting for activity" was logged but unlike the other times, no CTR counter discrepancy/ xrun/ audible glitch happened.

Interesting.

- The xrun at CLOCK_REALTIME reset into the future happens with unpatched r2178 (when it could started at all) as well as with r2178 + clock_monotonic.patch.

Ok, that's a good observation since it means that the series of events which lead to the xrun are not really affected by CLOCK_MONOTONIC or CLOCK_MONOTONIC_RAW. It more or less confirms that the premature return from sem_timedwait() is to blame.

Another thought I will try to pursue: when the time is reset (either by ntp or manually), how long does it take for this to become "active", and during the transition what's the behaviour of clock_gettime(CLOCK_REALTIME, ...)? The reason I bring this up is that if these problems are traceable to sem_timedwait() then they would only occur if the clock reset occurred between the clock_gettime() and sem_timedwait() call. While clearly of finite probability the chances are rather small, and yet the problems we're chasing here are observed far more often than not. I'm wondering whether there's some other mechanism at play here. One thought I had was that maybe clock_gettime() was blocking for a while while the time reset was being done, but the problem with this hypothosis is that the same blocking behaviour would be present on your second "good RTC" PC.

To summarize: 1. All is fine on my 2nd PC with good RTC. libraw1394-git plus r2177 or later improve FFADO on that PC in one regard: FFADO is now immune against clock resets. While ntpd never resets the clock on this PC, user intervention could cause clock resets, or a switch between standard time and daylight saving time would cause it.

This on the whole is good because clearly it will cover a vast majority of the systems out there. However, I would still very much like to get to the bottom of the continued mis-behaviour of your "bad RTC" PC. If nothing else, solving this will, I expect, make FFADO more robust.

Thoughts and feedback about any of the above is more than welcome.

(in reply to: ↑ 57 ) 07/08/12 22:47:09 changed by stefanr

Replying to jwoithe:

Replying to stefanr:

- I could start r2178 + clock_monotonic.patch with BeBoB driver and PhaseX24FW. I let it run for more than an hour and it survived many clock resets from me and several from ntpd during that time (with glitches, see below).

This is puzzling. Aside from the speed difference (as discussed above) there should be no difference between CLOCK_MONOTONIC and CLOCK_MONOTONIC_RAW, and yet we get this differing behaviour. Even if a switch to CLOCK_MONOTONIC is ultimately made I would like to try to get to the bottom of this because it may well crop up for others. For example, if one is running their audio PC isolated from the network there's no NTP and CLOCK_MONOTONIC will not be speed corrected - it would be indistinguishable from CLOCK_MONOTONIC_RAW in this case I think. Such a user, when facing a PC like yours, would still see the problems you're seeing with CLOCK_MONOTONIC_RAW even if FFADO switches to CLOCK_MONOTONIC.

I should remove ntpd from the system start config, then reboot --- thus get a session with identical speed of both monotonic clocks --- and try without and with patch.

- I could start unmodified r2178 with DICE driver and Saffire PRO 24 twice. ... After this, I was no longer able to start unmodified r2178 with DICE driver and Saffire PRO 24. The driver startup could not recover from series of "CTR counter discrepancy".

That's odd too. Why could it start fine twice and then never again? And why was the DICE driver seemingly ok in this scenario while the BeBoB driver was apparently not. Perhaps you were just lucky with the DICE test; maybe the timer behaviour at that precise moment was just sufficiently "good" to permit FFADO startup.

This aspect is unchanged since r2174. I didn't take notes about it, but I suspect the few successful startups without patch (all which terminated with unrecoverable xrun after less than half an hour) were all with the DICE driver. But even so, the number of these occasions was too small to conclude whether DICE and BeBeB driver or streams differ in this regard. My gut feeling is they don't differ.

When unmodified r2178 fails immediately, it looks just like in attachment:jackd_r2174_new-libraw1394.txt.

- I could start r2178 + clock_monotonic.patch with DICE driver and Saffire PRO 24. ... - If CLOCK_REALTIME is reset into the future by seconds or by hours, it almost always causes a clearly audible gap in playback and a reported xrun if buffers are set to -256 -n3 or less. Further observations: - The xrun begins with one "(IsoHandlerManager?.cpp)[ 292] Execute: Timeout while waiting for activity" warning, then a series of "(IsoHandlerManager?.cpp)[1656] getPacket: reconstructed CTR counter discrepancy" and associated warnings follow, concluded by "(devicemanager.cpp)[ 998] waitForPeriod: XRUN detected" and another incrementFrameCounter and getPacket warning.

Since yesterday evening, I let jackd continue running.

2nd PC, BeBoB, -p128 -n3: running without xrun for 11 hours now

1st PC (bad RTC), DICE, monotonic_clock.patch applied, -p64 -n3: running for 9 hours now, clock resets have presumably happened in periods of a little bit more than a quarter hour (36 times), qjackctl shows only 30 xruns. I.e. it seems to be a race condition with about 80...85% probability of occurrence (on this PC).

22:22:50.002 XRUN callback (1).
  /* one reset without xrun */
22:54:24.831 XRUN callback (2).
23:10:02.788 XRUN callback (3).
23:25:49.750 XRUN callback (4).
23:41:34.650 XRUN callback (5).
23:57:14.556 XRUN callback (6).
00:12:51.511 XRUN callback (7).
00:28:29.398 XRUN callback (8).
00:44:16.350 XRUN callback (9).
  /* one reset without xrun */
01:15:43.198 XRUN callback (10).
01:31:27.102 XRUN callback (11).
01:47:14.086 XRUN callback (12).
02:02:56.968 XRUN callback (13).
02:18:39.910 XRUN callback (14).
  /* one reset without xrun */
02:50:09.779 XRUN callback (15).
03:05:51.708 XRUN callback (16).
  /* two resets without xrun */
03:52:57.463 XRUN callback (17).
04:08:38.374 XRUN callback (18).
04:24:21.339 XRUN callback (19).
04:40:01.265 XRUN callback (20).
04:55:41.153 XRUN callback (21).
05:11:24.088 XRUN callback (22).
05:27:07.051 XRUN callback (23).
05:42:53.960 XRUN callback (24).
05:58:30.882 XRUN callback (25).
06:14:11.829 XRUN callback (26).
06:29:57.762 XRUN callback (27).
  /* one reset without xrun */
07:01:26.608 XRUN callback (28).
07:17:07.530 XRUN callback (29).
07:32:52.489 XRUN callback (30).

Likewise, manual clock resets into the future which I tried yesterday did almost always (but sometimes not) cause an xrun on the PC with bad RTC. I did not count, hence don't have a probability.

On the other PC with good RTC, I tried about a dozen manual clock resets into the future (interleaved with about the same number of clock resets into the past), of which none caused an xrun. So the race condition does either not exist on the good PC or is at least far less probable than on the bad PC.

07/16/12 05:28:40 changed by jwoithe

Thanks again for the testing. I'm a little unsure of how to proceed from here. As far as I can tell, unmodified r2178 (and presumedly later) works fine on the PC with a good RTC - right? Then on the bad PC, unmodified r2178 fails to start while the modified variant starts but still gives xruns and the like. Please correct where I've misunderstood.

Assuming the above is vaguely accurate it does mirror what I'm seeing here. On my test PC (which has a reasonable RTC) I am seeing no practical difference between occasions where CLOCK_MONOTONIC and CLOCK_MONOTONIC_RAW are used. This doesn't surprise me because as far as I can tell both clock sources should be more than capable of supporting FFADO start up. My understanding of these two sources is that for FFADO CLOCK_MONOTONIC_RAW is probably the appropriate one to use. In any case, I am puzzled why with a bad RTC there is such a dramatic difference between the two.

My present theory is that perhaps CLOCK_MONOTONIC_RAW - being uncorrected in every way - may be so far off that the values used to initialise the DLLs result in the DLLs not locking. I would have expected them to eventually, but maybe we don't wait long enough for the cases where there's a dramatic difference between theory and practice. Or perhaps it doesn't have a regular period due to some hardware problem (which is possibly related to the inaccurate RTC timekeeping).

The start up fail log you attached produces a lot of negative time steps in the cycle timer helper. I find the existence of these interesting but don't know what to make of it yet. However, I do suspect that these negative time steps may be at least partly to blame for the "reconstructed CTR counter discrepancy" messages you get in abundance later on.

07/16/12 23:44:48 changed by jwoithe

stefan: I've done some more thinking about all this. Regarding the difference between CLOCK_MONOTONIC and CLOCK_MONOTONIC_RAW, I think I've identified one possible way where the behaviour could be different on your "bad-RTC" PC. On this PC, CLOCK_MONOTONIC and CLOCK_MONOTONIC_RAW clearly do run at different rates, and the difference is significant. In FFADO's timing code, a request to sleep when CLOCK_MONOTONIC_RAW is in use will in fact use CLOCK_MONOTONIC (see comment 20). For most systems the speed difference between the two clocks will be very small, so the distinction is largely irrelevant. However, on systems like your bad-RTC PC, I can imagine that the speed differences could cause some issues. I'm not convinced it's the sole reason for the negative time steps we've noted in the logs, but it could at least have something to do with the overall stability issues and explain the apparent differences in behaviour we see when the different clocks are used.

In light of this I intend to run some more tests myself in the next few days. Hopefully I'll have a fresh revision to test in reasonably short order.

07/17/12 06:06:12 changed by jwoithe

Further to comment 60, I've checked in r2186. In light of the realisation that rate differences between CLOCK_MONOTONIC and CLOCK_MONOTONIC_RAW could cause inconsistent sleeps when times are taken from the latter but the former is used for sleep timing, we'll try using CLOCK_MONOTONIC for everything. The ABS_TIMER mode is also utilised where appropriate if flagged as available at compile time.

Stefan: based on tests here and earlier ones done by yourself I expect that this change should permit correct operation on your "good-RTC" PC. The behaviour on the "bad-RTC" PC is unlikely to be different to that which you detailed in comment 58. However I think we're approaching a solution which may allow this ticket to be closed off.

Regarding the xruns flagged on the bad-RTC PC, I am beginning to think that we've done almost all we can do to deal with PCs whose RTCs are so badly out of whack. While you're still getting xruns flagged, it seems that FFADO is at least able to continue now. As to why these xruns are triggered I'm pretty much out of fresh ideas. A race condition during the setting of the clock (as you suggested) seems to be the most likely candidate at the moment, and if that's the case there's probably not a whole lot more FFADO can do.

As always, thoughts and test results are welcome.

07/18/12 07:04:04 changed by stefanr

I have r2186 up and running for 18 hours now.

1st PC with bad RTC, "jackd -dfirewire -p128 -n3 -v3", DICE driver:

  • starts and runs stable
  • backwards jumps of CLOCK_REALTIME (by date -s or the likes): no xrun, no audible glitch
  • forwards jumps of CLOCK_REALTIME: cause handled xrun with audible gap with a probability of circa 90%

2nd PC with good RTC, "jackd -dfirewire -p128 -n3 -v3", BeBoB driver:

  • starts and runs stable
  • backwards or forwards jumps of CLOCK_REALTIME: no xrun, no audible glitch

(Note to self: Have yet to test the PC with bad RTC booted without ntpd, so that CLOCK_MONOTONIC and CLOCK_MONOTONIC_RAW are theoretically identical, and then with ntpd so that the former gets faster than the latter but still hs got only an offset in the order of seconds rather than minutes. But I guess these remaining handled xruns are more likely connected to the speed difference rather than the offset between the monotonic clocks. Even if this will be confirmed, it still wouldn't quite explain why the clock jump related race only affects the bad PC, not the good PC.)

Jonathan, thanks much for your work on this.

Folks with sane hardware can now set their clock without crashing jackd.

Folks with bad hardware like mine now don't get mysterious unhandled xruns = crashes anymore (whose connection to ntpd would not be obvious because it seemed to become standard for ntpd to do clock resets without syslog message, unlike in earlier times when I noticed the problem). Instead, there are now handled xruns in regular intervals (here: periods of a bit more than 15 minutes), which at least point users' or supporters' suspicion toward system services like ntpd, so that the respective workaround to disable the service is more likely to be found.

(follow-up: ↓ 64 ) 07/18/12 20:11:28 changed by jwoithe

Thanks for this latest round of testing. I think we're converging on a solution.

Note that the current code is using CLOCK_MONOTONIC for everything at present. The relative speeds of CLOCK_MONOTONIC and CLOCK_MONOTONIC_RAW are therefore not important because the latter isn't being used. Your suggested test would still be worth doing but only if ieee1394service was hacked to use CLOCK_MONOTONIC_RAW for timekeeping. This test would then give us an indication as tp whether the stability issues with CLOCK_MONOTONIC_RAW in use come about due to clock speed differences (this being the current theory).

At present I can't quite see where the remaining xruns are coming from. With CLOCK_MONOTONIC in use everywhere they can't be blamed on clock speed differences, and as you said the issue is only occuring on the PC with the bad RTC. About the only thing I can think of is that the two systems (the "good-RTC" and "bad-RTC" PCs) are using different clock sources at the kernel level (for example: one using TSC and one using hpet), and that the two systems therefore respond slightly differently to forward time jumps.

With these caveats I'm inclined to suggest that this ticket can be closed as "fixed". Does anyone see a need for it to remain open now?

(in reply to: ↑ 63 ) 07/19/12 11:02:39 changed by stefanr

Replying to jwoithe:

Note that the current code is using CLOCK_MONOTONIC for everything at present.

Yep.

The relative speeds of CLOCK_MONOTONIC and CLOCK_MONOTONIC_RAW are therefore not important because the latter isn't being used.

Well, I thought it shouldn't be important because it shouldn't be used. :-)

At present I can't quite see where the remaining xruns are coming from. With CLOCK_MONOTONIC in use everywhere they can't be blamed on clock speed differences, and as you said the issue is only occuring on the PC with the bad RTC. About the only thing I can think of is that the two systems (the "good-RTC" and "bad-RTC" PCs) are using different clock sources at the kernel level (for example: one using TSC and one using hpet), and that the two systems therefore respond slightly differently to forward time jumps.

O, right. I forgot that there may be quite a few other factors. Regarding clock sources:

# bad PC
/sys/devices/system/clocksource/clocksource0/available_clocksource:tsc hpet acpi_pm 
/sys/devices/system/clocksource/clocksource0/current_clocksource:tsc
# good PC
/sys/devices/system/clocksource/clocksource0/available_clocksource:hpet acpi_pm 
/sys/devices/system/clocksource/clocksource0/current_clocksource:hpet

So I should test the bad PC with hpet as clock source.

With these caveats I'm inclined to suggest that this ticket can be closed as "fixed".

I for one agree.

07/19/12 23:25:19 changed by jwoithe

  • status changed from new to closed.
  • resolution set to fixed.

Thanks for the exhaustive rounds of testing Stefan.

For the purposes of bug tracking I'll close this as "fixed" but comments can still be added after closure. It would certainly be interesting to know whether changing the clock source on the "bad" PC has any effect on the remaining xruns (or on any other aspects of the clocks operation).

09/10/12 11:02:36 changed by stefanr

A minor update:

As you may remember, my "bad" PC went from unrecoverable xrun at clock reset to recoverable xrun at clock reset into the future (and no xrun at clock reset into the past). This was observed while the bad PC had already gathered an uptime in the order of multiple weeks.

Meanwhile the PC was power-cycled once and rebooted a few times since then. Now the bad PC behaves like intended and like my "good" PC behaves: No xrun, neither at clock resets into the future nor into the past.