Dependency Matching

There was a problem with the dependency matching algorithm which resulted in some bad matches ranking more highly than they should. The algorithm allowed a single dependency in the input to match several dependencies in an example, provided they all had the same lemmas and relationships. For example, with the input:


amendments nos 10 , 11 and 15 are acceptable in principle .

...
(conj ,_3 11_4)
(conj ,_3 nos_1)
(ncmod _ nos_1 amendments_0)
(conj and_5 11_4)
...

A highly ranked match would be:


the commission welcomes , above all , amendments nos 1 , 2 , 7 , 8 , 9 , 10 , 11 and 13 , which apply to the committee procedures .

...
(conj ,_20 11_21)
(conj ,_18 11_21)
(conj ,_16 11_21)
(conj ,_14 11_21)
(conj ,_12 11_21)
(conj ,_10 11_21)
(conj ,_6 11_21)
(conj ,_10 nos_8)
(conj ,_6 nos_8)
(conj and_22 11_21)
(ncmod _ nos_8 amendments_7)
...

The match is ranking far higher than it should (based on the number of matching dependencies) due to a single dependency in the input matching several dependencies in the output.

This is fixed in rev 177 by modifying CCGSentence.java to keep track of which input dependency a connected match covers as it is being built up. Multiple matches which cover the same input dep cannot be contained in the same connected tree.

This fix led to a significant performance improvement, moving the BLEU score from 23.20 to 24.50.