I started this morning hoping to sort out the issue whereby spans were not being split up correctly. First, I realised that this website isn't backed up at all, so I set up a script to sort that out. Then I got back to work. The issue is that matches are currently split into contiguous spans based only on whether the input words they match are contiguous. But what if the input words are contiguous, but the matched words in the example are not? We need to split this into multiple spans. Further, what if the words are contiguous in both the input and the example, but the ordering is changed?
Clearly, this is a problem which requires a fair bit of thought. My initial thoughts led to thinking about what I do with the spans, which is translate them. There's a further (and somewhat similar) splitting step here, as a contiguous phrase in the source language may not translate to a contiguous phrase in the target language. My approach here until now has just been to use the first translation phrase for a span and throw away the rest. This is obviously not great, so I decided to have a quick look at just how bad it is.
I threw in a line of code to report how many target language phrases are thrown away from each span. It's quite a lot! over the 500 test sentences, 1602 target phrases were thrown away, which is an average of 0.51 per span. This clearly needs fixing, but has been an ongoing issue as phrase-based systems don't do well with this kind of one-to-many phrase translations (e.g. 'not' -> 'ne ... pas').
I decided to delve back into the Moses code to see if there was anything I could do about it. To my immediate horror I discovered that in the updates since the last time I synced my Moses code, the stuff I added to provide the 'linked' XML options had disappeared! Except it hadn't, it had just been moved. Which was a big relief, despite the fact that it took over an hour to rediscover it. Anyway, looking back at the code got me thinking that maybe the functionality I need is already there, provided by my 'linked' interface...