Detecting Rhetorical Figures Based on Repetition of Words: Chiasmus, Epanaphora, Epiphora
- Plats: Humanistiska teatern, Thunbergsvägen 3H, Uppsala
- Doktorand: Dubremetz, Marie
- Om avhandlingen
- Arrangör: Institutionen för lingvistik och filologi
- Kontaktperson: Dubremetz, Marie
This thesis deals with the detection of three rhetorical figures based on repetition of words.
This thesis deals with the detection of three rhetorical figures based on repetition of words: chiasmus (“Fair is foul, and foul is fair.”), epanaphora (“Poor old European Commission! Poor old European Council.”) and epiphora (“This house is mine. This car is mine. You are mine.”). For a computer, locating all repetitions of words is trivial, but locating just those repetitions that achieve a rhetorical effect is not. How can we make this distinction automatically?
First, we propose a new definition of the problem. We observe that rhetorical figures are a graded phenomenon, with universally accepted prototypical cases, equally clear non-cases, and a broad range of borderline cases in between. This makes it natural to view the problem as a ranking task rather than a binary detection task. We therefore design a model for ranking candidate repetitions in terms of decreasing likelihood of having a rhetorical effect, which allows potential users to decide for themselves where to draw the line with respect to borderline cases.
Second, we address the problem of collecting annotated data to train the ranking model. Thanks to a selective method of annotation, we can reduce by three orders of magnitude the annotation work for chiasmus, and by one order of magnitude the work for epanaphora and epiphora. In this way, we prove that it is feasible to develop a system for detecting the three figures without an unsurmountable amount of human work.
Finally, we propose an evaluation scheme and apply it to our models. The evaluation reveals that, even with a very incompletely annotated corpus, a system for repetitive figure detection can be trained to achieve reasonable accuracy. We investigate the impact of different linguistic features, including length, n-grams, part-of-speech tags, and syntactic roles, and find that different features are useful for different figures. We also apply the system to four different types of text: political discourse, fiction, titles of articles and novels, and quotations. Here the evaluation shows that the system is robust to shifts in genre and that the frequencies of the three rhetorical figures vary with genre.