Ticket #370 (closed bug: fixed)

Opened 1 year ago

Last modified 1 year ago

RME FF800 doesn't start from rev 2310

Reported by: juanramon Assigned to: jwoithe
Priority: major Milestone: FFADO 2.2
Component: devices/rme Version: FFADO SVN (trunk)
Keywords: Cc:
The device the bug applies to:

Description

I've tried every posterior revisions (2306 and up) and I cannot start jackd with my RME FF800 ( it works OK with Konnekt48) here is jackdbus message when it crashes

monon@userpc:~$ jackdbus auto no message buffer overruns no message buffer overruns no message buffer overruns 1371576536849970: (ffado.cpp)[ 92] ffado_streaming_init: libffado 2.1.9999-2331M built Jun 18 2013 01:24:45 *** glibc detected *** jackdbus: double free or corruption (!prev): 0x00007effd4015920 ***

jackdbus: malloc.c:2451: sYSMALLOc: Assertion `(old_top == (((mbinptr) (((char *) &((av)->bins[((1) - 1) * 2])) - builtin_offsetof (struct malloc_chunk, fd)))) && old_size == 0) ((unsigned long) (old_size) >= (unsigned long)((((builtin_offsetof (struct malloc_chunk, fd_nextsize))+((2 * (sizeof(size_t))) - 1)) & ~((2 * (sizeof(size_t))) - 1))) && ((old_top)->size & 0x1) && ((unsigned long)old_end & pagemask) == 0)' failed. ViolaciĆ³n de segmento (`core' generado)

Attachments

jack.log (87.2 kB) - added by juanramon on 06/20/13 08:38:07.
diag-ffado-rev2010-commented.out (7.9 kB) - added by juanramon on 06/21/13 07:13:01.
jack-ffado-rev2010-commented.log (1.1 MB) - added by juanramon on 06/21/13 07:13:49.
jackd-debian-rev2336.log (80.7 kB) - added by juanramon on 06/21/13 08:09:25.
diag-debian-rev2336.out (7.7 kB) - added by juanramon on 06/21/13 08:09:50.
jack-ubuntu_x86-64_r2338-extract.log (3.7 kB) - added by juanramon on 06/22/13 05:49:14.
jackd-1.log (5.5 kB) - added by juanramon on 06/27/13 05:55:46.

Change History

06/18/13 16:55:44 changed by jwoithe

Thanks for the report.

This is rather interesting for a number of reasons. The fundamental issue seems to be the double-free. The fact that the same FFADO revision works with a different device suggests that there's something that the RME driver is doing which is precipitating this. A further complication is that plain jackd1 does not suffer from any problems that I've been able to detect (jackd1 is what is on my development machine, and I've used all recent revisions with a FF800 without any apparent problems).

From your initial comment I assume that r2305 is in fact ok. Checking r2306, I see that this was a two line patch in the MatrixMixer? pyqt code associated with ffado-mixer. This means that r2306 did not touch the driver code in libffado which is in turn used by jackd. I therefore don't understand how r2306 could result in any behavioural change under any jackd: libffado is unchanged by this revision so the resulting code executed by jackd in r2035 and r2306 should be identical. Even if there's a memory corruption bug in the RME driver which "randomly" stomps on memory, both revisions should behave identically.

Could you post the jackdbus command line you're using and information about the version of jackdbus which is installed on your computer? It might also be helpful to obtain a stack trace from the core file that's being generated by the crash - run "gdb <jackdbus executable including full path> core" and use the "where" command.

Finally, if that doesn't turn anything obvious up there may be some clues in the output from FFADO when verbose messages are enabled. This can be done most easily by adding "-v 6" to the end of the jackdbus command line (anywhere after the "-d firewire" should be fine).

06/20/13 08:38:07 changed by juanramon

  • attachment jack.log added.

06/20/13 08:42:42 changed by juanramon

  • summary changed from RME FF800 doesn't start from rev 2305 to RME FF800 doesn't start from rev 2310.

Ok you are right. rev 2309 is the last revision that works for me After compiling rev 2310 I get the output in the attachet file "jack.log"

06/20/13 08:44:10 changed by juanramon

I've changed title

06/20/13 23:27:18 changed by jwoithe

Thanks for the additional information. Having a behavioural change in r2310 is more believable since the driver did actually change in that revision. Having said that, the change is not in the streaming part of the driver. The most significant change in r2310 is the reading of the flash memory as part of the startup sequence. So maybe something's going wrong with that on your system for some reason.

To see whether this idea is possible I'd like you to make some changes to src/rme/fireface_hw.cpp. We'll work from r2310 since that is the first revision to break. Please comment out the "if" clause from line 82 to 94 inclusive. Rerun "scons", then "scons install" and retest. This prevents FFADO reading from the device flash on start up and should result in a behaviour which is very similar to r2309. If this change makes things work then it almost certainly indicates that your system is choking on the flash read.

As to why this might be I don't presently understand: all revisions since r2309 have worked for me on my development system. To assist me in pinpointing possible causes, could you tell me what distribution you're running (name and version) and whether you're running a 32 or 64 bit environment? Attaching the output of ffado-diag might also be useful for future reference.

An interesting observation is that with the exception of the "bandwidth" field, the values read from flash as reported in your log file seem quite sensible. On the grounds of that it seems unlikely to be related to the reading of the flash. Furthermore, FFADO falls over only during the startup of streaming; if the flash reading was somehow corrupting memory I'd expect trouble to hit much earlier. Anyway, we'll see.

06/21/13 07:12:40 changed by juanramon

Efectively after comment (/*...*/) this part of code it has worked ok. I attach a jackd log and ffado-diag output.

I'm using a 64 bits ubuntu-studio 12.04 with KXstudio repositories.

I'm going to try original and commented code in my Debian. I'will say something.

06/21/13 07:13:01 changed by juanramon

  • attachment diag-ffado-rev2010-commented.out added.

06/21/13 07:13:49 changed by juanramon

  • attachment jack-ffado-rev2010-commented.log added.

06/21/13 07:47:26 changed by jwoithe

Thanks for the extra information. I will take a closer look at the logs tomorrow (it's getting late here). With those lines commented out we still read the mixer settings from flash so we've ruled those functions out as a source of trouble. We are left only with the issue of reading device configuration from flash and parsing that. It's starting to smell like a subtle memory overflow bug.

The outcome of your tests in Debian will be interesting.

06/21/13 08:09:09 changed by juanramon

Hi. I've tried the last revision (2336) in my Debian 32 bits and it has the same behaivor.

With original code jack fails (see new attached files) If I comment the code you have said, jack works.

In both distros I'm using jack2 with dbus.

06/21/13 08:09:25 changed by juanramon

  • attachment jackd-debian-rev2336.log added.

06/21/13 08:09:50 changed by juanramon

  • attachment diag-debian-rev2336.out added.

06/21/13 17:45:16 changed by jwoithe

Thank you for testing this under 32 bits as well. This at least proves that the problem is not some obscure 64-bit issue within the RME driver code. For what it's worth I still can't reproduce the fault here with my FF800, but while yours continues to malfunction we at least have something to go on.

Regarding jackd-debian-rev2336.log, it appears that this is an unmodified r2336 and unsurprisingly it fails at the end. However, for some reason I'm not seeing any evidence of read_device_flash_settings() lines in the log (I would expect them between the last "addElement" line and the "discovery successful" line. These lines would not be printed if the lines were commented out, so I'm wondering whether there is any chance that this particular log file applies to a modified r2336 (even though the libffado version number reported at the top of the log would normally indicate the modification with an "M" suffix). I was particularly interested to observe the value reported for "limit bandwidth" on your 32 bit system. Of course if r2336 really was crashing with the lines commented out then we're back to the start again, so it's important to clarify this.

Assuming this issue can be reconciled I have another test for you. Starting from a fresh svn revision (either 2310 or the latest - it doesn't matter), comment out only line 92 in fireface_hw.cpp. This will leave the flash read call in place, but will omit the configuration write-back to the device. What I'm hoping to show is whether the problem is triggered by the flash read call or the configuration of the device based on what has been read from flash.

06/21/13 18:40:12 changed by juanramon

Yes, with this lines commented it works in both systems 32 and 64 bits. I'm going to comment line 92 in fireface_hw.cpp in both system aswell. I will tell.

Thanks for your effort.

06/21/13 18:59:02 changed by juanramon

As you have suspect that line (92) was the impediment for jack starts. Just only commenting it jack has strated fine.

If you need some more info (logs, outputs, etc) say me and I'll attach here

06/22/13 01:09:13 changed by jwoithe

It's very odd that removing the call to set_hardware_params() would resolve the segfault because that function doesn't really write to any data structures at all. I'm therefore thinking that we're dealing with memory corruption which is not necessarily related to the actual act of reading the flash and writing a configuration to the device. The effect of the corruption just happens to be different if we execute a certain code path.

Consequently I've returned to my earlier observation from your first log file that the "limit bandwidth" flag (as read from flash) contained a very random looking number, and one which bore no resemblance to the values which would normally be expected. Since all the other flash values look sensible there's no obvious reason why this particular flag would be bogus - it seems that this really is the value being read from flash. Anyway, I've taken a look at the consequences of this strange value on the code and there is a fairly serious one: the number of channels is calculated to be 12, but a full set of channel ports are created. This means that the requested size of the data packets (being based on the number of channels - 12) is too small for the number of ports created. This will almost certainly cause memory corruption within the transmit streaming code and a segfault is not all that surprising.

The reason I don't see this is that on my system the limit bandwidth returned from flash is one of the expected values. The only significant difference I can see between our systems at present is the hardware version reported by the RME: yours reports 0x24d while mine gives 0x242. It would be unlikely if this was the root cause though.

To this end, the latest svn revision (r2338) contains code to sanity check this bandwidth flag. If my theory is correct, you should be able to run an unmodified r2338 and not suffer segfaults. Could you please try this revision and report back what happens. If your "-v 6" output includes a "bogus firewire bandwidth limit flag" warning line (regardless of whether the fault is fixed) I'd like to see it. You should find it in the vicinity of a bunch of "read_device_flash_settings" log entries.

If r2338 does not cure the segfault, could you please post a fresh log (I only need one, so take your pick between your 32 and 64 bit systems)?

06/22/13 05:48:52 changed by juanramon

After compile and try rev 2338 there is no "bogus firewire bandwidth limit flag" warning line. And jack starts and works OK with FF800. I attach an extract of significant lines of jack log

06/22/13 05:49:14 changed by juanramon

  • attachment jack-ubuntu_x86-64_r2338-extract.log added.

06/23/13 00:02:02 changed by jwoithe

Thanks for the fresh test - the result is encouraging. The lack of the warning line indicates that during this latest test the bandwidth limit field was in fact read correctly from the flash (the latest log reports it having a value of 0, compared to 2117862543 in the first log). The fact that ffado/jackd worked correctly in this latest test goes a long way towards proving that the bogus flag was the root cause of the original problem. What remains unresolved is why the bogus flag value was reported in the first place. It may have been a transient problem and if so we may never get to the bottom of it.

In any case, could you please checkout r2339? Compared to r2338 it contains some minor code cleanups around the fix added in r2338. Test jackd startup with this new revision a number of times, each time checking to see if the warning appears. This will at least give us an idea whether the strange flag value was a transient issue or if it's merely intermittent. To save looking through the extensive "-v 6" logs, you could pipe stdout to grep with something like

  jackd ... |& grep read_device_flash_settings

If r2339 seems to check out ok after repeated testing, I would be inclined to close this ticket as resolved regardless of whether the warning is shown at all during the tests. At this stage I'm fairly confident that r2338 has addressed the fault but if the original problem is observed again in future this ticket can always be reopened.

06/27/13 05:55:26 changed by juanramon

I've tried to start jackd 7 times after reboot my PC 3 times. Just the first time after each PC reboot I get the output than you can see in the atachhed file. The rest of times, jackd starts without warnings about "read_device_flash_settings".

06/27/13 05:55:46 changed by juanramon

  • attachment jackd-1.log added.

06/27/13 19:30:52 changed by jwoithe

  • milestone changed from FFADO 3.0 to FFADO 2.2.

Thanks for the followup testing. Everything you've said makes sense. The reason you only see the read_device_flash_settings output after a reboot is that for subsequent runs the settings are almost certainly coming out of a shared memory cache. This means that ffado adopts to those rather than re-reading the flash contents. This means, for example, that any temporary changes you might have made to a control will be preserved across jackd and ffado-mixer invocations even when the settings have not been explicitly stored in the device flash.

What's also clear here is that we're no longer seeing unexpected bandwidth flag values returned by the device when flash is read. If such a bogus value were to reappear I am confident that ffado will correctly deal with it now. At the same time, it seems that the strange value which caused your initial crash was due to some condition within the FF800 which has since been corrected. On the basis of this I'd like to mark this ticket as "fixed". Are you comfortable with that based on your subsequent test results?

06/30/13 03:02:23 changed by juanramon

Yes, it finaly works and I can follow testing and working with it.

The last important issue to resolve is the ticket #363. He he he ;-)

Thank you very much.

06/30/13 03:17:55 changed by jwoithe

  • status changed from new to closed.
  • resolution set to fixed.

Thanks juanramon for confirming that things appear to be ok now. I'll close this as "fixed". If you do observe a recurrence, this ticket could be reopened.

As far as ticket #363 goes: yes, indeed. :-) It hasn't been forgotten, but it's proving difficult to identify just where the varying latency is coming from. It may have something to do with the stabilisation time of the DLL; testing and experimentation continues.

06/30/13 09:06:42 changed by juanramon

I supose is a hard work and I value it a lot. It's just I'm ever impatient to get FF800 working at 100 per cent. Thanks and regards