Log from the debugging of a segfault

The following is a cleaned up log I wrote for myself while debugging a bug. Writing a log while working helps me keep track of the debugging effort in case I’m interrupted (life, sleep, work, etc.). It also requires me to explain all findings to myself in fully spelled out sentences, making my thoughts considerably easier to follow.

In addition to serving as an example of a personal debug log, I hope it can be useful as an introduction to debugging segfaults or other low-level bugs.

How to reproduce

I have three computers running Ubuntu 11.04. Mopidy revision 9c23949 consistently crashes with a segfault on one of them when I do the following:

  1. Start Mopidy.
  2. Connect with an MPD client, e.g. ncmpcpp.
  3. Search for anything, e.g. foo.
  4. Wait less than 10 seconds for the segfault to happen. It always happens directly after a log message from Mopidy’s Spotify backend stating that it is Updating metadata.

Rule out the obvious

I check that I’m actually using the latest versions of the important pieces of software, not just some old version I installed by hand and forgot about.

I try uninstalling pyspotify v1.1+mopidy20110405 installed from my own Debian package, and install the latest revision (c6e2a02) from Git instead. Still the same error.

Nobody has reported the same problem, even though both other developers and users should have been on the same mix of versions for the last couple of months.

Digging in

Luckily I can consistently reproduce the segfault in less than 20 seconds of work. I’m not too familiar with debugging C programs and especially the infamous segfaults, but I’m taking on this fight.

I expect the problem to be in pyspotify, as the rest of the code is either pure Python or more well-tested and broadly used libraries. Also the segfault consistently happens directly after a log message from Mopidy’s Spotify backend.

Antoine Pierlot-Garcin, the new main contributor to pyspotify, says to rebuild pyspotify with CFLAGS="-g -O0". -g will include debug information that gdb will understand in the resulting binary. -O0 will override the -O2 default of the pyspotify build system, and turn off any optimizations to ease the debugging.

Googling “debugging segfaults” yields a nice howto on debugging segfaults.

In summary: Get a core dump, inspect it with gdb.

Possible causes: “There are four common mistakes that lead to segmentation faults: dereferencing NULL, dereferencing an uninitialized pointer, dereferencing a pointer that has been freed (or deleted, in C++) or that has gone out of scope (in the case of arrays declared in functions), and writing off the end of an array.”

Rerunning Mopidy nothing more interesting happens, except the usual segfault. No core dump is produced.

Hum, how to get a core dump? I’ve done it before, but I can’t remember. Google helps.

I try ulimit -c 50000 to get a core dump of maximum 50MB, and then reproduce the segfault.

Run python mopidy -v, connect with ncmpcpp, search for foo, wait less than 10 seconds. Kaboom:

DEBUG    2011-05-23 20:07:02,030 [7659:SpotifySMThread] mopidy.backends.spotify.session_manager
  Metadata updated
  Segmentation fault (core dumped)

Loads the core dump up in gdb:

gdb /usr/bin/python core

gdb complains that the core dump is truncated, and that is should be approximately 180MB.

man bash and search for ulimit. Aha. I can specify unlimited. But I’m not allowed to? Restart shell, run ulimit -c unlimited again. Works.

Rerun Mopidy to get an untruncated core dump.

Analyzing the core dump

Yay! gdb loads debug symbols from all the linked libraries that got them available. Notably, the proprietary libspotify library does not include debug symbols. So what does gdb say?

Core was generated by `python mopidy -v'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007fef50dae5f0 in sp_album_add_ref () from /usr/lib/libspotify.so.7
(gdb)

Verifying a hypothesis

According to the libspotify docs for the album subsystem sp_album_add_ref takes an sp_album struct and increases the reference count of the album. Looking back at the list of reasons for segfaults from the segfault debugging howto, it may sound as we increase the reference count of NULL, which obviously isn’t good. Let’s see if that hypothesis is correct…

Looking in the howto for ways to proceed, backtrace reveals the events happening just before the segfault:

(gdb) backtrace
#0  0x00007fef50dae5f0 in sp_album_add_ref () from /usr/lib/libspotify.so.7
#1  0x00007fef51048801 in Track_album (self=0x7fef4d155a50) at src/track.c:72
#2  0x00000000004970ef in call_function (f=<value optimized out>, throwflag=<value optimized out>) at ../Python/ceval.c:3997
#3  PyEval_EvalFrameEx (f=<value optimized out>, throwflag=<value optimized out>) at ../Python/ceval.c:2666

Moving one step up the stack, we get to the pyspotify code:

(gdb) up
#1  0x00007fef51048801 in Track_album (self=0x7fef4d155a50) at src/track.c:72
72	    sp_album_add_ref(album);

Our guess was that we’re increasing the reference count on an album that is null. Let’s see what album actually was…

(gdb) print album
$1 = (sp_album *) 0x0

A pointer to an sp_album struct at the address 0x0, also known as NULL. Hypothesis confirmed.

Finding a solution

Let’s take a look at the pyspotify code in question:

static PyObject *Track_album(Track *self) {
    sp_album *album;

    album = sp_track_album(self->_track);
    Album *a = (Album *)PyObject_CallObject((PyObject *)&AlbumType, NULL);
    sp_album_add_ref(album);
    a->_album = album;
    return (PyObject *)a;
}

We create a pointer to a sp_album struct. Given a Spotify track, we request a reference to the related album, and assign the result to our pointer. Then we create an Album Python object, before we increase the reference count on the sp_album and give the reference to the Python object. Finally, the Python object is returned.

Lets try just returning NULL if we get NULL from sp_track_album:

diff --git a/src/track.c b/src/track.c
index f0d5956..3284029 100644
--- a/src/track.c
+++ b/src/track.c
@@ -68,6 +68,9 @@ static PyObject *Track_album(Track *self) {
     sp_album *album;

     album = sp_track_album(self->_track);
+    if (album == NULL) {
+        return NULL;
+    }
     Album *a = (Album *)PyObject_CallObject((PyObject *)&AlbumType, NULL);
     sp_album_add_ref(album);
     a->_album = album;

If we rebuild pyspotify and try to reproducing the segfault, we now get a familiar Python traceback instead:

/usr/lib/python2.7/threading.py:828: RuntimeWarning: tp_compare didn't return -1 or -2 for exception
  return _active[_get_ident()]
ERROR    2011-05-23 22:33:05,214 [10571:SpotifySMThread] mopidy.utils.process
  error return without exception set
Traceback (most recent call last):
  File "/home/jodal/dev/mopidy/mopidy/utils/process.py", line 21, in run
    self.run_inside_try()
  File "/home/jodal/dev/mopidy/mopidy/backends/spotify/session_manager.py", line 40, in run_inside_try
    self.connect()
  File "/usr/local/lib/python2.7/dist-packages/pyspotify-1.1-py2.7-linux-x86_64.egg/spotify/manager.py", line 48, in connect
    self.loop(sess) # returns on disconnect
  File "/usr/local/lib/python2.7/dist-packages/pyspotify-1.1-py2.7-linux-x86_64.egg/spotify/manager.py", line 58, in loop
    self.timer = threading.Timer(timeout/1000.0, self.awoken.set)
  File "/usr/lib/python2.7/threading.py", line 731, in Timer
    return _Timer(*args, **kwargs)
  File "/usr/lib/python2.7/threading.py", line 742, in __init__
    Thread.__init__(self)
  File "/usr/lib/python2.7/threading.py", line 446, in __init__
    self.__daemonic = self._set_daemon()
  File "/usr/lib/python2.7/threading.py", line 470, in _set_daemon
    return current_thread().daemon
  File "/usr/lib/python2.7/threading.py", line 828, in currentThread
    return _active[_get_ident()]
  File "/home/jodal/dev/mopidy/mopidy/backends/spotify/session_manager.py", line 73, in metadata_updated
    self.refresh_stored_playlists()
  File "/home/jodal/dev/mopidy/mopidy/backends/spotify/session_manager.py", line 131, in refresh_stored_playlists
    SpotifyTranslator.to_mopidy_playlist(spotify_playlist))
  File "/home/jodal/dev/mopidy/mopidy/backends/spotify/translator.py", line 62, in to_mopidy_playlist
    if str(Link.from_track(t, 0))],
  File "/home/jodal/dev/mopidy/mopidy/backends/spotify/translator.py", line 34, in to_mopidy_track
    if dt.MINYEAR <= int(spotify_track.album().year()) <= dt.MAXYEAR:
SystemError: error return without exception set

Note that we get a SystemError and not e.g. AttributeError: 'NoneType' object has no attribute 'year' which we would expect if album() returned None. This is because we return NULL from Track_album, which Python considers an “error return”, without setting an exception code first. If we want to return None on failure instead of throwing an exception, we can change track.c as follows:

diff --git a/src/track.c b/src/track.c
index f0d5956..82897de 100644
--- a/src/track.c
+++ b/src/track.c
@@ -68,6 +68,8 @@ static PyObject *Track_album(Track *self) {
     sp_album *album;

     album = sp_track_album(self->_track);
+    if (!album)
+        Py_RETURN_NONE;
     Album *a = (Album *)PyObject_CallObject((PyObject *)&AlbumType, NULL);
     sp_album_add_ref(album);
     a->_album = album;

Rebuilding pyspotify again, and reproducing the error, we now get the expected Python error which we can handle nicely:

File "/home/jodal/dev/mopidy/mopidy/backends/spotify/session_manager.py", line 73, in metadata_updated
    self.refresh_stored_playlists()
  File "/home/jodal/dev/mopidy/mopidy/backends/spotify/session_manager.py", line 131, in refresh_stored_playlists
    SpotifyTranslator.to_mopidy_playlist(spotify_playlist))
  File "/home/jodal/dev/mopidy/mopidy/backends/spotify/translator.py", line 62, in to_mopidy_playlist
    if str(Link.from_track(t, 0))],
  File "/home/jodal/dev/mopidy/mopidy/backends/spotify/translator.py", line 34, in to_mopidy_track
    if dt.MINYEAR <= int(spotify_track.album().year()) <= dt.MAXYEAR:
AttributeError: 'NoneType' object has no attribute 'year'

By this point, I consider the bug squashed and ready to be reported.

The only part left is to decide how to best fix it. E.g. if album() should return None, throw an exception, or maybe return an empty album object. That’s another story.