The three-way hand-off has a problem: there's no way to arrange for the
state of the migration to be unambiguous in case of failure. If the
final "disconnect" message is lost (as in, the destination never
receives it whether it is sent by the sender or not), the destination
has no option but to quit with an error status and let a human sort it
out. However, at that point we can either arrange to have a .INCOMPLETE
file still on disc or not - and it doesn't matter which we choose, we
can still end up with dataloss by picking a specific calamity to have
befallen the sender.
Given this, it makes sense to fall back to a simpler protocol: just send
all the data, then send a "disconnect" message. This has the same
downside that we need a human to sort out specific failure cases, but
combined with --unlink before sending "disconnect" (see next patch) it
will always be possible for a human to disambiguate, whether the
destination quit with an error status or not.
Now that we have 3 mutexes lying around, it's important that we check
and free these if necessary if error() is called in any thread that can
hold them. To do this, we now have flexthread.c, which defines a
flexthread_mutex struct. This is a wrapper around a pthread_mutex_t and
a pthread_t. The idea is that in the error handler, the thread can
check whether it holds the mutex and can free it if and only if it does.
This is important because pthread fast mutexes can be freed by *any*
thread, not just the thread which holds them.
Note: it is only ever safe for a thread to check if it holds the mutex
itself. It is *never* safe to check if another thread holds a mutex
without first locking that mutex, which makes the whole operation rather
pointless.
First, Leaving off the source address caused a segfault in the
command-sending process because there was no NULL check on the ARGV
entry.
Second, while the migration thread sent a signal to the server to close
on successful completion, it didn't wait until the close actually
happened before releasing the IO lock. This meant that any client
thread waiting on that IO lock could have a read or a write queued up
which could succeed despite the server shutdown. This would have meant
dataloss as the guest would see a successful write to the wrong instance
of the file. This patch adds a noddy serve_wait_for_close() function
which the mirror_runner calls to ensure that any clients will reject
operations they're waiting to complete.
This patch also adds a simple scenario test for migration, and fixes
TempFileWriter#read_original.
trouble and into predictable cleanup functions (one for each of serve,
client & control contexts). We use 'fatal' to mean 'kill the thread' and
'error' to mean 'don't kill the thread', assuming some recovery action,
except I don't use error anywhere yet.