Seminarium i datorlingvistik

  • Date: –15:00
  • Location: Engelska parken Room: 9-3042
  • Lecturer: Margita Šoštarić
  • Contact person: Ali Basirat
  • Seminarium

Co-reference phenomena in English and Croatian

Handling co-reference phenomena constitute a well-known stumbling block in the construction of MT systems. Various approaches have been proposed for co-reference resolution, both in the form of individual systems (Clark and Manning, 2016; Lee et al., 2017; Versley, 2008) and in the form of attempts at integrating the co-referential information in the MT systems in order to improve their performance (Miculicich Werlen and Popescu-Belis, 2017; Hardeier and Frederico, 2010; Jean et al., 2017; Guillou, 2012; Hardmeier et al., 2013; Le Nagard and Koehn, 2010). As these phenomena differ greatly between individual languages, the subject can also be interesting from the point of view of linguistics and translation studies. Here we investigate the matter for the language pair English-Croatian because, as these languages are structurally fairly different, they also differ in how they make use of co-referential information. We start from an analysis of the data obtained from the corpora based on the approach laid out by Lapshinova-Koltunski and Hardmeier (2017). Much like the authors, we are interested to see if there are specific patterns in the use of co-referential elements which should be kept in mind in practical applications, such as machine translation. The extracted phenomena are analyzed in three different corpora, pertaining to different domains and types of text: we use legal texts (the DGT translation memory), newspaper articles (the SETimes corpus) and translations of prepared speeches (the TED talks corpus). Based on the frequency and our initial manual analysis, a number of phenomena in both languages are selected and examined in more detail. We extend the research by training an SMT and a number of slightly differing NMT systems, to gain insight into how the baseline systems handle the selected phenomena in practice. We devise a classification of errors for the analysis of MT output and compare the overall system performance (based on BLEU scores) to the manual evaluation of system performance focusing only on the observed phenomena.