Explore great works with their world-wide translations
Project Overview By Tom Cheesman, Kevin Flanagan and Stephan Thiel
The most important works of world literature, philosophy, and religion have been re-translated over and over again in many languages. The differences between these re-translations can help us understand (1) cross-cultural dynamics, (2) the histories of translating cultures, and (3) the translated works themselves, and their capacity to provoke re-interpretation.
We are building digital tools to help people explore, compare and analyse re-translations. As a ‘telescope array’ examines a celestial object from many slightly different angles, so a Translation Array explores a cultural work through its refractions in different languages, times and places.
The scope is vast. Numerous re-translations exist of Aesop, Aristotle, Hans Christian Andersen, Avicenna, the Bhagavad Gita, the Bible, Buddhist scriptures, Chekhov, Confucius, Dante, Dostoyevsky, Dumas…
Shakespeare makes a great case study. Our experiment here uses 37 of the German translations of Shakespeare’s Othello. That’s a small fraction of Othello translations: see our crowd-sourcing site.
We aim to create useful, instructive and enjoyable experiences for many kinds of text users, and to build cross-cultural exploration networks.
Our report on the work on this site, submitted to our main funder, the AHRC, in October 2012, is attached here.
The backbone of this site is Ebla, the corpus store, and Prism, the segmentation and alignment tool – both developed by Kevin Flanagan. See section 4 below.
The navigation and interaction interfaces, designed by Stephan Thiel and Sebastian Sadowski (Studio Nand), retrieve and visualize text data and mathematical data from the segmented, aligned corpus. See sections 5 and 6 below.
Another exploratory interface, ‘ShakerVis’, has been developed by Zhao Geng and Robert Laramee, using a sample of our Othello texts (seven speeches in ten versions). Currently there is no online installation. See section 6 below (video). Further documentation is here.
To stock this Translation Array with texts, we:
To create a Translation Array, a base text must be aligned with numerous versions in a given language. Some of those alignments may be straightforward, and others, less so. This application provides tools for doing this and for working with the results. It is designed to be as generally useful as possible. The user can define any given base text and align it with an arbitrary number of versions. Base text and versions are referred to here as documents. The base text and set of versions of it are together referred to as a corpus.
The kinds of relationship that may exist between a text and a translation (or version) of it are complex, and their definitions contested. Alignment refers to dividing up a text and a translation somehow into parts and drawing correspondences between the respective parts. This may lead to a straightforward, sequential, one-to-one correspondence between (say) sentences in a text and their renderings in a translation. However, with literary translation in particular, the correspondences are often much less straightforward.
The software powering this website has two main components:
Areas of interest in a document are demarcated using segments, which also can be nested or overlapped. Each segment can have an arbitrary number of attributes. For a play these might be ‘type’ (with values such as ‘Speech’, ‘Stage Direction’), or ‘Speaker’ (with values such as ‘Othello’, ‘Desdemona’), and so on.
In a similar way to CATMA, segment positions are stored as character offsets within the document. (But unlike in CATMA, texts in Ebla can be edited without losing this information.) Documents uploaded can contain markup, so HTML documents can be stored (in which case, some sanitisation is performed). When uploaded HTML documents are retrieved from Ebla and displayed by Prism, their HTML markup is preserved, so providing a WYSIWYG rendering of the document when segmenting, aligning or visualizing. If HTML formatting is to be applied to segments for display within Prism – which can fall anywhere within an HTML document structure – Ebla applies a recursive ‘greedy’ algorithm to add the minimum additional tags required to achieve the formatting without breaking the document structure. If only part of a document is to be displayed – say, a section from the middle – Ebla ensures that any tags required from outside that part are appended or prepended, to ensure the partial content rendered is both valid and correctly formatted.
After segmentation, documents can be aligned using an interactive WYWISYG tool. Attribute filters can be applied to select only segments of interest, and limited ‘auto-align’ functionality used to expedite the process.
Ebla can be used to calculate different kinds of variation statistics for base text segments based on aligned corpus content. These can potentially be aggregated-up for more coarse-grained use. The results can be navigated and explored using the visualization functionality in Prism. However, translation variation is just one of the corpus properties that could be investigated. Once aligned, the data could be analysed in many other ways.
Interface design should provide simple ways to explore complex data, so as to support research, but also appeal to non-research users who are interested in important cultural texts. The design work demonstrated here consists of experimental approaches, developed in an ongoing dialogue with linguists and software engineers.
The designs offer high-level visualizations of the corpus and of the structures of versions, as well as text-based views.
The ‘German Othello Timemap’ provides an interactive overview of the corpus metadata.
‘Alignment Maps’ are high-level structure visualizations, based on segmentation and alignment information. They make it possible to find comparative patterns, and identify texts of special interest.
For text-based views, text readability is a key aspect. The ‘Parallel View’ (i.e. for ‘" >Schiller and Voss, 1805’) offers visualization tools to simplify navigation in the base text and a selected version. Filter and sort functions, using segment attributes, speed up a search for a specific passage in one or more versions, or comparison of versions of speeches by a selected speaker. The ‘Parallel View’ could be extended to multiple parallel texts, with further navigational aids.
The ‘Eddy and Viv’ view enables a new kind of understanding of the base text, which is visually ‘annotated’ on the basis of metrics derived algorithmically from the translations, while the varying versions of any selected segment are displayed (see section 6).
Another aspect of text-based views, which we aim to implement in future releases, is the ability to edit text in a new collaborative way.
If we quantify differences among multiple translations, segment by segment, then we can explore how different parts of a work provoke different responses among translators as a group, and we can also devise new kinds of navigational aids for exploring a large corpus of translations. This is the basic idea behind the ‘Eddy and Viv’ view (‘E&V’). Eddy and Viv are both calculated by algorithms (explained below).
In the ‘Eddy and Viv’ view, a colour underlay annotates the base text, segment by segment. This shows which segments provoke most and least variation among translations. It means you can read Shakespeare through his German translators, even if you can’t read German.
The colour underlay represents ‘Viv’ values for base text segments. These values are derived from the ‘Eddy’ values calculated for each segment version. ‘Eddy’ values measure how different one segment version is from all others. So the more variation there is among versions, the bigger the range of Eddy values is, and the higher the Viv value is.
In the ‘Eddy and Viv’ view, when you select a base text segment, you see all the segment translations displayed. They are ranked in order from most typical to most unusual: that is, from lowest to highest Eddy value. Machine translations into English are also supplied. You can assess the variation (up to a point!) without knowing German.
Eddy values can also be used to visualize higher-level differences among translations. Our ‘Eddy History’ graph shows average Eddy values for each version, on a timeline: this gives an overview of translators’ general behaviour over time. Our ‘Eddy Variation Overview’ graph shows Eddy values for all segments in all versions: this enables detailed comparison of version variation.
Geng and Laramee’s work also uses Eddy analysis to explore how versions vary. Their ‘ShakerVis’ interface presents Eddy values in scatterplots and parallel coordinates, and enables us to select and compare groups of segments and versions. Word lists are presented in heatmaps, and the use of words can be tracked across versions.
Screencast of the interface and visualisations developed by Geng and Laramee
Eddy is a measure of how much one translation of a segment differs from all the other translations of it, in the same language, calculated in terms of the words used. Viv is a measure of how much the set of Eddys associated with one base text segment differs from the sets of Eddys for other base text segments.
(To put it differently: Eddy is the disturbance created when the flows of cultural histories intersect with the flows of individual re-translation decisions. Viv is the energy in the relation between the translated text and all its translations: all those disturbances, cultural changes and individual decisions.)
Eddy and Viv are not measures of properties of translations. They are measures of differences among translations. The purpose of implementing Eddy and Viv is to enable us to create new navigational aids for exploring such differences.
We are experimenting with different mathematical formulae for Eddy and Viv, based on methods from information retrieval and stylometry.
The algorithms applied here operate on a corpus of texts, consisting of a base text and a number of versions. But these algorithms process only the text within those version segments which are aligned with one base text segment. They do not compare the version segment texts in any way with the base text, nor with the rest of the text of the versions.
The algorithms begin by tokenising each of the version segments, to produce a list of words used and their frequencies (total number of occurrences) for each version. These are referred to below as 'version word lists'. These lists are combined to produce a list of all words found across all versions, with their respective overall frequencies. This is referred to below as the 'corpus word list'.
Note that we do not exclude some function words (‘stopwords’), nor do we stem or lemmatise the texts, or analyse compound words. Words of ‘low’ semantic value, and inflections, can be important in the differences between small-scale segment translations. However, we will experiment with these kinds of processing in future work.
Each word in the corpus word list is considered as representing an axis in N-dimensional space, where N is the length of the corpus word list. For each version, a point is plotted within this space whose co-ordinates are given by the word frequencies in the version word list for that version. (Words not used in that version have a frequency of zero.) The position of a notional 'average' translation is established by finding the centroid of that set of points.
An initial 'Eddy' variation value for each version is calculated by measuring the Euclidean distance between the point for that version and the centroid.
The values produced provide a meaningful basis for comparison of versions within that corpus (that is, comparing version segments aligned with a given base text segment). However, their magnitude is affected by the lengths of the texts involved. In order to arrive at variation values that allow (say) translation variation for a long segment to be compared with translation variation for a short segment, a normalisation needs to be applied to compensate for the effect of text length. To establish a suitable normalisation, we calculated variation for a large number of base text segments of varying lengths, then plotted average Eddy value against segment length. We found a logarithmic relationship between the two, and arrived at a normalisation function that gives an acceptably consistent average Eddy value regardless of text length.
The Viv value for a base text segment is the average Eddy value of its aligned version segments.
A ‘distinctiveness value’ for each word in the corpus word list is derived by dividing the corpus frequency value into the number of versions in the corpus. This produces a high value for rarely-used words, and a low value for commonly-used words. The Eddy value for a version is the sum of these values for all the words in that version word list. The Viv value for a base text segment is again defined as the average Eddy value of its aligned version segments. (See TC, ‘Translation Sorting’.)
Eddy is calculated as in Metric A, but Viv is calculated as the standard deviation of a segment version’s Eddy values, rather than the average. This takes account of the distribution of differences in the corpus, i.e. the varying numbers of identical segment versions.
If two translations of a sentence use the same words, but in a different order, or with different punctuation (potentially making a major difference to the meaning), they have the same Eddy value. Conversely, orthographic differences with no semantic significance lead to different Eddy values: this includes contractions, different spellings, and different renderings of compound words. But standardisation of orthography would be an enormous task, and would efface real textual differences.
In a short segment, when we compare 37 translations using Metric A, a unique version can have the lowest Eddy value, if its word list happens to place it closest to the notional ‘average’ version. Conversely, ten or more versions may have identical wording, but nevertheless share a surprisingly high Eddy value, if several other versions are closer to the calculated ‘average’. See for example the First Officer’s line “Here is more news”, or Roderigo’s line “What say you?” These kinds of cases call for refinement of the metrics.
The calculation of Viv is problematic as long as Eddy is based solely on word-counts. Semantic content analysis can be envisaged, to distinguish differences in surface wording from differences in meaning: translators’ differing interpretations. This may be becoming possible for texts in English, but for other languages, the necessary corpus linguistics resources are not available.
But Eddy and Viv are navigational aids based on relative differences, not exact measures of the properties of translations. They are intended to facilitate explorations of texts, not to substitute for reading them. A certain fuzziness is no bad thing.
More collaboration. Our work aims to excite and harness the knowledge and interests of users: students, researchers, theatre people, publishers, translators, fans (as with our ongoing crowd-sourcing project). We aim to build global networks of collaboration, by enabling users to interact and input data, analysis, and comment. We are currently seeking funding for further work, under the general title: ‘TransVis – Translation Visualized’.
More texts. The more translations, the more powerful the Array. In German alone we can include more printed, typescript, and born-digital texts, prompt-scripts from theatre archives, and texts from audiovisual archives (film, radio, television, online multimedia). And of course …
More languages. With multiple translations from many languages, we can explore which differences among translations are due to properties of target languages and cultures, and which are due to properties of the translated work.
More Arrays. Our approach can be applied to any corpus of comparable translations of whole works or parts of works, from and to any languages.
More visualizations. We have only scratched the surface of ways in which information encoded in multiple translations can be extracted and presented with interactive visualizations. There is great scope for more flexible, scalable tools. We aim to develop an adaptable toolsuite which will allow users to develop their own modes of analysis.
More cultural context. To explore not just how but why translators translate differently, we need contextual data: author/translator biographies, data on editions and other related events: theatrical and media productions, with reviews, etc., including audio-visual material. Relevant contextual data expands to include all aspects of cultural and intercultural history.
More audio-visual data. Our approach is unapologetically ‘text-centric’, but performance dimensions should not be neglected. Media documents are not just sources of more texts. Performances show how words (whatever the language) are re-interpeted by dramatic context, intonation, address, gesture, etc.– and ideally we would also encompass audience reactions.
Paratexts. In many source text editions as well as translations, paratexts carry important information: introductions, afterwords, footnotes, endnotes, glosses, glossaries, back-cover texts, etc. The prototype excludes them only for pragmatic reasons.
Source text instability. Not only translations vary: so do source works. Our ‘base text’ is yet another English version of Othello. Translations usually have plural sources: translators work from different editions, often more than one, and they usually also work from previous translations.
Genetics. Translators look at previous translations (whether or not they say so) and these influence their decisions. Translation history can be explored in Translation Arrays, revealing citations, dependencies, possibly plagiarism, but also negative relations: resistance through change. The history of translation theory is also implicit in translation texts, as well as sometimes explicit in paratexts.
Page images. The texts presented have been normalized in a process involving manual checking, hence errors, as well as omission of some formatting features. Page images will help users appreciate the materiality of the texts, as well as check the Array versions.
All German texts are reproduced on our website with permission as follows:
|Bärfuss||2001||typescript theatre script (pdf)||Hartmann & Stauffacher GmbH|
|Baudissin ed. Brunner||1947 (1832)||book (dual language) (study edition)||Brunner © not identified|
|Baudissin ed. Mommsen||1855 (1832)||book||out of copyright|
|Baudissin ed. Wenig||2000 (1832)||online (Gutenberg)||non-copyright|
|Bodenstedt||1867||book||out of copyright|
|Bolte and Hamblock||1985||book (dual langauge) (study edition)||Philipp Reclam jun.|
|Buhss||1996||typescript theatre script (pdf)||Henschel Schauspiel Theaterverlag|
|Engel||1939||typescript theatre script (typewriter)||Felix Bloch Erben|
|Engler||1976||book (dual language) (study edition)||Stauffenberg Verlag Brigitte Narr|
|Felsenstein||1980||book (dual language: Italian/German)||G Rircodi & Co C.S.p.A., Milan|
|Flatter||1952||book and theatre script (print)||Theater-Verlag Desch|
|Fried||1970||book (dual language)||Felix Bloch Erben|
|Gildemeister||1871||book||out of copyright|
|Gundolf||1909||book||out of copyright|
|Günther||1992||book (dual language) and theatre script (print)||Hartmann & Stauffacher GmbH|
|Karbus||2006||typescript theatre script (pdf)||Thomas Sessler Verlag|
|Laube||1978||typescript theatre script||Verlag der Autoren|
|Lauterbach||1973||typescript theatre script (typewriter)||Henschel Schauspiel Theaterverlag|
|Leonard||2010||typescript theatre script (pdf)||Christian Leonard & Shakespeare Company|
|Motschach||1992||typescript theatre script (typewriter)||Drei Masken Verlag|
|Ortlepp||1839||book||out of copyright|
|Rothe||1956||book and theatre script (print)||Thomas Sessler Verlag|
|Rüdiger||1983||typescript theatre script (typewriter)||Felix Bloch Erben|
|Schaller||1959||book||Pegasus Theater- und Medienverlag GmbH / Verlag Autorenagentur|
|Schiller||1805||book (scholarly edition)||Hermann Böhlaus Nachfolger / J. B. Metzler|
|Schröder||1962||book||Suhrkamp Theater & Medien|
|Schwarz||1941||typescript theatre script (typewriter)||Shakespeare-Bibliothek München|
|Swaczynna||1972||typescript theatre script (typewriter)||Jussenhoven & Fischer|
|Vischer||1887||book (study edition)||out of copyright|
|Wachsmann||2005||typescript theatre script (typewriter)||Gustav Kiepenheuer Bühnenvertriebs GmbH|
|Wieland||1766||online (Gutenberg)||out of copyright|
|Wolff||1920||book||out of copyright|
|Zaimoglu and Senkel||2003||book and typescript theatre script (pdf)||Rowohlt Theaterverlag|
|Zeynek||1948||typescript theatre script (typewriter)||Ahn & Simrock Bühnen- und Musikverlag|
|Zimmer||2007||typescript theatre script (pdf)||Deutscher Theaterverlag Weinheim|
Tom Cheesman, Kevin Flanagan and Stephan Thiel, ‘Translation Array Prototype 1: Project Overview’, at www.delightedbeauty.org/vvv (September 2012 - January 2013)