I have never reported results on CALLHOME because of the (apparent) lack of an official train/validation/test split (or at least validation/test split).
What experimental protocol does BUT use for reporting results?
Validation on part1, test on part2?
Validation on part2, test on part1?
Both?
cc @fnlandini