7.1 Text and Image
Political texts are not monomodal but frequently include images in the form of photographs, graphics, cartoons or videos, whose semiotic features may be indexical of wider stanced discourses, and which therefore contribute in ideologically significant ways to processes of meaning construction. Stochetti (Reference Stocchetti, Stocchetti and Kukkonen2011: 33) goes as far as to argue that visual communication has in fact become ‘the dominant form of political communication’. In news discourse, Caple (Reference Caple2013: 5) similarly argues that images tend to dominate the verbal text they accompany and in some cases the image itself may be the news story. Certainly, modern political discourse, especially online, is replete with attention grabbing, memorable and impactful images which, like slogans, reduce complex issues to a single dimension and are used in targeted ways to reach and persuade sympathetic audiences. Wodak (Reference Wodak, Stocchetti and Kukkonen2011: 71) argues that, throughout the twentieth century, iconic images have provided snapshots of political and historical events and have thereby served to ‘condense complex political processes in simplistic ways’ (Reference Wodak, Stocchetti and Kukkonen2011: 72). The result is that history is reduced to static events captured by images which neglect socio-political and historical contexts (Reference Wodak, Stocchetti and Kukkonen2011: 74–75).
In so far as such images typically feature alongside written text, they may, in a way that is analogous with co-speech gestures and which is not intended to assign any privileged status to one mode over the other, be described as co-text images. Co-text images are images which occur with a sufficient degree of proximity to written forms of language as to be considered part of the same text with the result that the meanings expressed in each mode are interpreted in light of one another. Language and image in multimodal texts do not function independently of one another then, but, rather, constitute multimodal ensembles (Kress Reference Kress2010: 28) in which modes interact in the production of collective, layered and interrelated meanings. The meaning of any multimodal text, in other words, is not just a product of the individual modes that contribute to it but of the complex interplay between them resulting in a synthesised meaning that is greater than the sum of its parts. This view of multimodal meaning-making is supported by eye-tracking studies. For example, Bucher and Schumacher (Reference Bucher and Schumacher2006) used eye-tracking to monitor attention sequences when subjects read pages of print and online newspapers. Their results showed that ‘elements on a page – be it a printed or online newspaper – are perceived in an alternating manner in order to build up an understanding of one element within the context of the other’ (Reference Bucher and Schumacher2006: 360).
Studies of multimodality in political texts from the perspective of cognitive linguistics have been largely focussed on metaphor and show that conceptual metaphors identified in language as constitutive of a particular discourse receive similar expression in visual and multimodal forms of representation. For example, Bounegru and Forceville (Reference Bounegru and Forceville2011) analysed political cartoons representing the global financial crisis of 2008 and found that the cartoons relied on metaphors including financial crisis is catastrophic weather event and financial crisis is contagious virus. Catalano and Musolff (Reference Catalano and Musolff2019) analysed metaphor and metonymy in texts including online news reports and official documents such as a Border Patrol recruitment video concerned with immigration to the USA. A key finding was that war metaphors of the kind discussed in Chapter 6 were expressed both verbally and visually in the texts they analysed. For example, they show how the Border Patrol recruitment video depicts officers in scenes typical of war or military training to equate being in the Border Patrol with being in the army (Reference Catalano and Musolff2019: 33–34). The ‘enemy’ is not shown in these scenes but it is clear that the implied ‘enemy’ which the Border Patrol are ‘defending’ the country from is migrants (Reference Catalano and Musolff2019: 33–34). In Hart (Reference Hart2017), I analysed media representations of the 1984–1985 British Miners’ Strike and showed how the strike is war metaphor that permeated linguistic descriptions of the strike was similarly realised in news photographs which were reminiscent of warfare including culturally ingrained images of the First World War, such as the barbed wire barricades of German defences in the Battle of the Somme or the mythologised football matches played between German and Allied forces along the Western Front on a day of truce held at Christmas in 1914. In some cases, the image was itself sufficient to invite a metaphorical reading, regardless of whether the metaphor was also expressed in verbal co-text. In other cases, however, the allusion was more vague, and apparent only in connection with verbal expressions of the metaphor found elsewhere in the text.
Metaphor, however, is not the only dimension of meaning with respect to which language and image may interact in political discourse. News discourse in particular can reveal how other dimensions of construal, like schematisation, viewpoint and attention, feature in the meanings of multimodal constructions. As Steen and Turner state:
Cognitive linguists routinely study basic mental operations and phenomena that are not exclusive to language but that are deployed in language and leave their mark on its structure … Since the news deploys other modalities than speech and text, it is an obvious project to look for the ways in which these basic mental operations and phenomena are deployed in those other modalities.
7.2 Intersemiotic Relations
A key endeavour of multimodal semiotics, as Royce (Reference Royce, Royce and Bowcher2007: 63) states, is to work out ‘what features make a multimodal text visually-verbally coherent’. In other words, what is it that gives a multimodal text texture? The question is generally approached in terms of language-image relations. However, language-image relations are modelled differently within different paradigms of linguistics (see Bateman Reference Bateman2014 for an overview). The most well-known distinction, and where most discussions of language-image relations begin, is made by Barthes (Reference Barthes and Heath1977). Barthes distinguishes between relations of anchorage and relay. In anchorage, language performs an elucidatory function. Starting from the position that all images are polysemous, Barthes views language as providing anchorage to images in so far as it ‘directs the reader through the signifieds of the image, causing him to avoid some and receive others’ (Reference Barthes and Heath1977: 40). According to Barthes, anchorage is ‘the most frequent function of the linguistic message’ and the one ‘commonly found in press photographs’ (Reference Barthes and Heath1977: 40–41). In relay, language performs a diegetic function with language and image cohering as a function of the larger story which they are together advancing (Reference Barthes and Heath1977: 41). According to Barthes, relay is less common with respect to fixed images (though it occurs in comic strips) and is more important for film. Another influential approach to language-image relations comes from the perspective of systemic functional linguistics where researchers have extended different aspects of functional grammar as a basis for modelling intersemiotic relations (Bednarek and Caple Reference Bednarek and Caple2012; Caple Reference Caple2013; Caple and Knox Reference Caple and Knox2015; Liu and O’Halloran Reference Liu and O’Halloran2009; Martinec and Salway Reference Martinec and Salway2005; Royce Reference Royce1998, Reference Royce, Royce and Bowcher2007; van Leeuwen Reference van Leeuwen1991). Liu and O’Halloran (Reference Liu and O’Halloran2009) focus on one particular relation in the form intersemiotic parallelism, which they define as ‘a cohesive relation that interconnects both language and images when the two semiotic components share a similar form’ (Reference Liu and O’Halloran2009: 372) (cf. also intersemiotic repetition [Royce Reference Royce1998, Reference Royce, Royce and Bowcher2007]). Parallel structures, based on shared transitivity configurations, make ‘significant contributions to establishing co-contextualisation relations between different modes and cause textual convergence’ (Reference Liu and O’Halloran2009: 374). However, Forceville (Reference Forceville1999: 170) argues that approaches to language-image relations based in systemic functional linguistics ‘compare visual structures too much with surface language instead of with the mental processes of which both surface language and images are the perceptible manifestations’. From a cognitive linguistic perspective, images do not cohere with linguistic structures per se but with mental imagery in the form of conceptualisations which language usages evoke. Indeed, it is precisely the conceptual import of language usages as described in cognitive linguistics that makes it a particularly useful framework for analysing language-image relations. A key claim made in cognitive linguistics is that the meanings evoked by linguistic expressions are modal rather than purely propositional in form with the conceptual processes involved having a grounding in other areas of cognitive experience like reason and perception. Two things follow from this position. On the one hand, it means that a number of semiotic properties inherent to visual forms of representation, such as spatial arrangement, perspective and salience, figure also in language where they are discernible in the meanings attached to linguistic expressions. On the other hand, since the processes involved in language are not unique to language but are manifestations of more general cognitive processes, it follows that semiotic features normally associated with language, such as metaphor, are likely to show up also in non-linguistic modes of expression. This situation provides a theoretical basis on which the meanings expressed across multiple modes can be compared. Moreover, since various dimensions of imagery contribute to the meaning of verbal and visual forms, a multi-dimensional consideration of intersemiotic relations is afforded.
From a cognitive linguistic perspective, three basic types of intersemiotic relation can be proposed which feature differently in different communicative genres: complement, supplement and, drawing on Barthes, anchorage. When language and image stand in a complementary relationship with one another, there is a degree of overlap in the meanings they contribute. Complementary relations therefore include intersemiotic parallelisms. From a cognitive linguistic perspective, however, the shared form that characterises this echoic relation does not reside in the linguistic and visual structures of the text itself but holds between images in the text and mental imagery in the form of conceptualisations evoked by verbal parts of the text. It is therefore appropriate to speak of intersemiotic convergence in the sense that both modes converge on a common way of viewing and understanding the target situation. From this perspective, intersemiotic convergence can be observed with respect to particular dimensions of construal which are co-instantiated across verbal and visual representations. Thus, language and image may share the same schematisation patterns, distributions of attention, viewpoint specifications and/or metaphorical framings. The different affordances of language and image, as well as register and genre constraints operating over any text, dictate that the two modes are unlikely to overlap in every respect. Any reduplication between the two modes is necessarily scalar and a matter of degree rather than being total or absolute. In the genre of news reporting, however, we can expect to find a high degree of intersemiotic convergence. Rhetorically, intersemiotic convergence serves to tell the same story, from the same perspective. It upholds a consistent narrative with the version of events presented in each mode corroborating the version presented in the other. Another kind of complementary relation that is found particularly in connection with metaphor is frame-consistency. A metaphor may not be simultaneously expressed in language and image but where it is expressed in one mode, the representation in the other mode may be consistent with it by indexing some aspect of the source-frame. When language and image stand in a supplementary relationship with one another, one mode contributes meaning that the other does not. This can take the form of addition where, for example, a viewpoint value or metaphorical framing is provided by only one mode. Or it can take the form of specification where a conceptual element indexed in both modes is elaborated at a finer level of specificity in one mode compared to the other. To some extent, language and image will always stand in a supplementary relation with one another where, on the one hand, a clause may provide additional details concerning where and when the events depicted in an image took place while on the other hand the visually depicted elements of an image provide specific instantiations of the categories denoted by language. A further form of supplementarity lies in contradiction where language and image may be explicitly divergent with respect to one or more dimensions of construal. Contradiction is a feature of more artistic genres where incongruent representations are used to create tension, confusion or humour. Finally, in anchorage, the representation in one mode enables reference resolution or disambiguation in the other. Complement, supplement and anchor are not mutually exclusive but work together throughout a multimodal text to weave a complex tapestry of intersemiotic relations which are what make the text a text.
7.3 Multimodal Constructions: News Photographs and Their Captions
7.3.1 Multimodal Constructions
According to cognitive linguistics, linguistic structure emerges, via processes of abstraction, from usage events, whereby recurrent form-meaning pairings become conventionalised inside a system of symbolic units or constructions (Goldberg Reference Goldberg1995, Reference Goldberg2006; Langacker Reference Langacker1987, Reference Langacker1991, Reference Langacker2008). A symbolic unit consists of two poles: a phonological pole and a semantic pole. The semantic pole consists of semantic structure in the form of conceptualisations, while the phonological pole consists of representations whose only essential feature is ‘that of being overtly manifested, hence able to fulfil a symbolising role’ (Langacker Reference Langacker2008: 15). A key claim of cognitive grammar is that representations included under the rubric of phonological structures ‘include not only sounds but also gestures and orthographic representations’ (Reference Langacker2008: 15). This has led to the notion of multimodal constructions (Dancygier and Vandelanotte Reference Dancygier and Vandelanotte2017b; Kok and Cienki Reference Kok and Cienki2016; Steen and Turner Reference Steen, Turner, Borkent, Dancygier and Hinnell2013; Zima Reference Zima, de Mendoza Ibáñez, Oyón and Pérez-Sobrino2017; Zima and Bergs Reference Zima and Bergs2017). A multimodal construction, as shown in Figure 7.1, consists of a semantic structure that is conventionally associated with representations in more than one semiotic mode which figure regularly alongside one another in usage events.

Figure 7.1 Multimodal construction
From this perspective, any form of expression, whether visual, manual or auditory, that features alongside language in a usage event has the potential to become part of a multimodal construction. It follows, as Zima and Bergs (Reference Zima and Bergs2017: 1) argue, that from a theoretical standpoint the usage-based model presented in cognitive linguistics is ‘particularly well-equipped to unite the natural interest of linguists in the units that define language systems with the multimodality of language use’. Following Goldberg (Reference Goldberg2006: 5), Zima and Bergs (Reference Zima and Bergs2017) outline two criteria for a multimodal form-meaning pairing to achieve constructional status: (i) that the non-verbal feature is used recurrently with a given verbal structure and its meaning contribution is ‘not strictly predictable’ from its form; or (ii) that the two forms co-occur with ‘sufficient frequency’. The notion of multimodal constructions leads to a radical understanding of what constitutes the language system. As Kok and Cienki state:
Whether or not elements of expression qualify as linguistic does not depend on the modality through which they are expressed. Rather the grammatical potential of co-verbal behaviours is to be assessed according to their degree of entrenchment as symbolic structures in an individuals’ mind and the degree of conventionalisation of those symbolic structures within a given community.
The sociocultural level at which ‘community’ is defined has important consequences for the level at which constructions may be identified. While most of the research in construction grammar has addressed more general constructions found at the level of a given language, constructions may also be particular to a given discourse or genre (Antonopoulou and Nikiforidou Reference Antonopoulou and Nikiforidou2011; Groom Reference Groom2019). In other words, a specific discourse or genre may have its own repertoire of conventionalised form-meaning pairings that includes multimodal constructions.
In the genre of news discourse, one ideal site where multimodal constructions are especially likely to be instantiated is news photographs and their captions which, owing to their proximity within the text, are likely to be interpreted as elements of a single semiotic syntagm. Of course, co-text images are likely to be interpreted with respect to other salient elements of the text such as headlines, by-lines and lead paragraphs also. Barthes (Reference Barthes and Heath1977: 16) recognised this when he stated that ‘the structure of the photograph is not an isolated structure; it is communication with at least one other structure, namely the text – title, caption or article – accompanying every press photograph’. Where news photographs and accompanying written text enter into systematic relations with one another, such that the form presented in one mode predicts the form presented in the other, they may be considered instantiations of a multimodal construction. Moreover, it is expected that the semiotic forms featuring in multimodal constructions within the genre of news discourse will exhibit a high degree of intersemiotic convergence. Where this is found to be the case, it may be taken as evidence in support of the claim that linguistic expressions include modal properties as part of their meanings.
7.3.2 News Photographs
News photographs are generally regarded as performing a documentary function, providing evidence of the events described in the story (Bednarek and Caple Reference Bednarek and Caple2012: 115). They are seen as ‘transparent window[s] on the world, capturing the reality in front of the camera lens’ (Schwartz Reference Schwartz and Finn2012: 223). It is certainly the case that, in contrast to language, the relationship between an image and its denotation is iconic rather than purely symbolic and thus images have greater claim to objectivity. However, from a semiotic perspective, it is recognised that news photographs are not neutral recordings of reality. As Machin (Reference Machin2007: 24) states, ‘there is no neutral documentation’. Images do not objectively portray events. Rather, in their content, composition and co-textual embedding, images construct and evaluate the realities they depict, standing as symbolic representations which are experienced with sometimes visceral emotions. Barthes recognised this when he highlighted the paradox between the received denotative status of the press photograph and its real connotative function:
[The] purely ‘denotative’ status of the photograph, the perfection and plenitude of its analogy, in short its ‘objectivity’, has every chance of being mythical … In actual fact, there is a strong probability … that the photographic message too – at least in the press – is connoted.
Like language, images impose a particular construal on the situations they depict. For example, news photographs necessarily represent only a specific window in space and time so that what is included in the frame is ‘only one fleeting moment in an entire event’ (Caple Reference Caple2013: 144). Similarly, by necessitation, news photographs present the depicted scene from a particular viewpoint placing some elements in the foreground and others in the background. And news photographs, like other image-types, are capable of recalling or alluding to images documenting other events, including images of historically significant events which are ingrained in cultural memory. When this is the case, news photographs are symbolic of the myths and emotions associated with past events, which are brought to bear in understanding the current one as a consequence. This ‘alternative’ view of news photographs leads Caple to argue that:
the press photograph is a key participant in the news storytelling process. Like the verbal text that surrounds it, it is a social construct that makes a significant contribution to the meaning of a news story. As such, it deserves to be scrutinized to the same extent as the verbal text.
Caple (Reference Caple2013: 10) further points out that while there is now a growing body of research investigating the news photograph from a social semiotic perspective, certain areas remain vastly under-researched. One such area is the way press photographs combine with captions, headlines and other written text (Reference Caple2013: 10). However, a few noteworthy exceptions are to be found which point to a general tendency for language and image to overlap, thus mutually reinforcing one another (e.g. Chovanec Reference Chovanec2019; Martinez Lirola and Zammit Reference Martínez Lirola and Zammit2017; Romano and Porto Reference Romano, Porto, Filardo-Llamas, Morales-López and Floyd2021). Specifically within a cognitive linguistics framework, Belmonte and Porto (Reference Belmonte and Porto2020) analysed European media representations of events on the Gaza border in May 2018. They examined four categories of framing device that correspond to focus, prominence, specificity and perspective as defined within cognitive grammar and observed a tendency for these facets of construal to ‘manifest indistinctively in the textual and the visual modes’ (2020: 59). Romano and Porto (Reference Romano, Porto, Filardo-Llamas, Morales-López and Floyd2021) analysed representations of the Syrian refugee crisis in the UK and Spanish press. They present a quantitative analysis of path windowing and schematisation patterns with a focus on force-dynamics. With respect to path windowing, they found that in both language and image attention is directed primarily towards the routes taken by refugees (trajectory) followed by their point of arrival (goal). Refugees’ place of origin (source) is not represented at all in language and represented only very minimally in images. The photographs therefore ‘corroborate the headlines’ interest in the middle stage’ (Reference Romano, Porto, Filardo-Llamas, Morales-López and Floyd2021: 161) as they tend to depict refugees ‘on their way, either on the move – walking, crossing rivers, in small boats in the sea – waiting to catch a train or ship, trapped in camps or just taking a rest along their journeys’ (Reference Romano, Porto, Filardo-Llamas, Morales-López and Floyd2021: 161). Likewise, at the final goal stage, language and image converge in representing the force-interactions that refugees are subject to. In half of the texts analysed, images and headlines reinforce one another with blockage, diversion and repulsion schemas accounting for the majority of reinforcing relations observed as refugees ‘are shown being stopped and sent back both in the verbal and visual modes’ (Reference Romano, Porto, Filardo-Llamas, Morales-López and Floyd2021: 164).
In the analysis that follows, dimensions of construal are isolated for expository purposes. However, as I have suggested, language and image may converge with respect to multiple dimensions of construal which operate concomitantly with one another in any one instance of conceptualisation. For example, schematisation necessarily involves a viewpoint in both language and image (Dancygier and Vandelanotte Reference Dancygier and Vandelanotte2017c; Langacker Reference Langacker2008). Similarly, attentional selection and distribution are necessary features of all conceptions (Langacker Reference Langacker2008; Talmy Reference Talmy2000). I will therefore return to some of the same images more than once where they serve as illustrations of intersemiotic convergence in more than one dimension of construal. Data in the proceeding analysis does not represent a particular political topic but a genre. Examples are taken from a purposive sample of language-image combinations in online news discourse that cuts across political topics previously covered in the book.
7.3.3 Schematisation
Arguably the most fundamental dimension of conceptualisation lies in the image-schematic structuring of events (Langacker Reference Langacker2008; Talmy Reference Talmy2000). Image schemas represent recurrent patterns of sensory-motor and perceptual experience and stand as the meaningful base of lexical and grammatical units inside a system of symbolic assemblies. In discourse, the linguistic expressions selected in an event’s description impose upon it a particular image-schematic construal, defining the domain to which the event belongs and its internal organisation. Image-schematic structuring through language is therefore a matter of construal. Although constrained, it is not determined by any objective properties of the referential event. The same or ostensibly the same type of material situation may be schematised differently through alternate linguistic formulations. In multimodal texts, the same image-schematic construal may be co-instantiated in co-text images. For example, one image schema that constitutes an archetypal conception (Langacker Reference Langacker2008) is an action-chain subsuming a motion event. Events conceived this way involve one active participant, an agent, imparting an entity (a mover) with energy and propelling it toward another entity (the goal). This schema is represented in Figure 7.2. It is instantiated in the image given as example (1) where all three participants are represented and the interaction between them is suggested by the dynamicity of the image. For Kress and van Leeuwen (Reference Kress and van Leeuwen2006), this dynamicity is what gives the image its ‘narrative’ structure and is a product of vectors formed by visually depicted elements. According to Kress and van Leeuwen (Reference Kress and van Leeuwen2006: 59), ‘the hallmark of a narrative visual “proposition” is the presence of a vector’ which ‘may be formed by bodies or limbs or tools “in action”’. In (1), such a vector is formed by the outstretched limb of the refugee which implies the bottle’s direction of travel toward the police. The vector formed in the image may therefore correspond with the arrows representing energy transference and path of motion in Figure 7.2. From a perspective akin to simulation semantics (e.g. Bergen Reference Bergen2012), it has been shown that static photographs of human actions where there is implied motion activate motor areas of the brain (Kim and Blake Reference Kim and Blake2007; Kourtzi and Kanwisher Reference Kourtzi and Kanwisher2000; Proverbio, Dederica and Zani Reference Proverbio, Dederica and Zani2009). This suggests that in understanding images, viewers ‘complete the picture’ by performing dynamic simulations which unfold along the lines laid down by image-schematic structures inhering in the image. The same schema is instantiated verbally in the caption that accompanied the image. The language-image combination in (1) may therefore be described as intersemiotically convergent with respect to the conceptual dimension of schematisation where the same image-schematic structure is instantiated across the two modes.Footnote 1
(1)

A refugee throws a bottle toward Hungarian police at the ‘Horgos 2’ border crossing into Hungary, near Horgos, Serbia. (telegraph.co.uk, 22 Sept. 2015; © Associated Press/Alamy Stock Photo)

Figure 7.2 Action-chain subsuming motion event
In the image schema instantiated multimodally in (1), only one participant is active with the police in the passive role of goal. However, an archetypal conception that involves two active participants is a two-sided action-chain in which there is a bidirectional exchange of energy. This schema, represented in Figure 7.3, is the one attached to reciprocal constructions. In discourse, it serves to construe physical interaction as two-sided and thus assigns mutual blame and responsibility for the event (Hart Reference Hart2018b). It is discernible in images where both participants are shown actively participating in the event. In example (2), language and image can therefore be seen to converge as co-instantiations of a two-sided action schema. In the verbal component of (2), the two-sided schema is indexed by the reciprocal construction ‘Skirmishes broke out between X and Y’ while in the visual component we see police and protesters physically interacting with one another with vectors emanating from both participants.
(2)

Skirmishes broke out between demonstrators with masks and scarves covering their faces and makeshift police barricades at Piccadilly Circus. (dailymail.co.uk, 19 Mar. 2016; © Lee Thomas)

Figure 7.3 Two-sided action-chain
The two-sided schema inhering in (2) is instantiated in other images within the text as well as in other reciprocal forms found elsewhere in salient verbal portions of the text, including the headline and by-lines as in (3)–(5), suggesting a close association between verbal and visual forms.
(3)
Protesters clash with police as thousands take to streets of London for Refugees Welcome march. (dailymail.co.uk, 19 Mar. 2016)
(4)
Clashes broke out today between police and crowds of demonstrators at 15,000-strong pro-refugee London march. (daiymail.co.uk, 19 Mar. 2016)
(5)
Demonstrators wore masks and scarves over their faces as they clashed with police officers’ makeshift barricades. (dailymail.co.uk, 19 Mar. 2016)
In the domain of motion, Talmy (Reference Talmy2000) identifies several different types of motion event. However, a basic distinction that is made is between motion events which are force-dynamically neutral and those that involve a force-dynamic component. In a force-dynamically neutral event, the impetus for motion begins with the figure and their ability to move is not hindered in any way. Motion events that involve a force-dynamic component include caused motion and impeded motion. In impeded motion, the figure’s ability to move freely is constrained by the presence of some ‘barrier’ which they are able to circumvent or penetrate in order to complete the intended translocation. These two types of impeded motion event are represented conceptually by image schemas as modelled in Figure 7.4. The arrows in Figure 7.4 represent a path of motion which the figure undertakes with respect to the ground. The stepped arrow in Figure 7.4b represents the change in state to the barrier brought about in the course of realising the motion event.


Figure 7.4 Impeded motion schemas, (a) and (b)
Many linguistic expressions, both open and closed class, include a force-dynamic conceptualisation as part of their meaning (Talmy Reference Talmy2000). This includes the try + infinitive construction which focuses on an effort to overcome an obstacle (without making known the outcome of this effort) (Talmy Reference Talmy2000/I: 436–437).Footnote 2 In news texts reporting immigration, language-image combinations such as (6) can be found where an impeded motion schema is instantiated verbally by forms like trying to reach Y and attempting to get to Y but also visually by the image of a fence, which immigrants are shown climbing over or through.
(6)

Migrants in Calais attempting to get to Britain. (express.co.uk, 15 Aug. 2015; © Rob Stothard/Stringer/Getty Images)
Language-image combinations such as (6) may thus be said to converge in schematising immigration in force-dynamic terms. They are candidates for a discourse-level multimodal construction where, in the visual mode, the barrier element of the impeded motion schema is conventionally instantiated by images of fences (see Dancygier and Vandelanotte Reference Dancygier and Vandelanotte2017c for discussion of visual instantiations of the barrier schema). However, there is a difference between the two modes in levels of specificity. While in the verbal mode, the form or nature of the impediment is not specified and the manner by which it is overcome (e.g. circumvented versus penetrated) is not expressed, this information is contained within the co-text image. In this sense, while convergent at a basic level of schematisation, the image in (6) may also be said to supplement the information expressed verbally. The schema instantiated by the image in (6) is the one shown in Figure 7.4b.
Images of fences being breached are a recurrent visual trope in anti-immigration discourse (Martínez Lirola Reference Martínez Lirola2017, Reference Martínez Lirola2022). Such images depict immigrants entering the country illicitly and thus emphasise the criminal nature of their activities. As a result, the basic image-schematic conception inhering in them may be elaborated metaphorically in the verbal mode as a more specific criminal act of ‘breaking and entering’. Such a framing is found in connection with (6) where the headline which the image in (6) accompanies took the form in (7):
(7)
Migrants trying to ‘break into’ Britain. (express.co.uk, 15 Aug. 2015)
Intersemiotic convergence in multimodal texts is not limited to events that fall within the domains of action, force and motion. Langacker (Reference Langacker2008) argues that (in English at least) the linguistic means of describing a perceptual event is a transitive clause. The image schema evoked by constructions like X saw Y, representing an another archetypal conception, is therefore one derived metaphorically from an action-chain where, instead of agent and patient, participant roles are experiencer and zero and the process that connects them is one of ‘mental contact’ (Langacker Reference Langacker2008: 358). This schema is represented in Figure 7.5.

Figure 7.5 Perceptual event schema
In example (8), the cognitive verb ‘watch’ instantiates the perceptual schema given in Figure 7.5. Israeli soldiers are the experiencer who establishes ‘mental contact’ with Palestinian protesters. In the co-text image, Israeli soldiers are not shown engaged in acts of physical combat but are shown instead observing the protesters. As with depictions of action, Kress and van Leeuwen (Reference Kress and van Leeuwen2006: 117) argue that the sightlines of participants in an image form vectors that connect the sensing participant with other represented participants (or the viewer themselves where the direction of gaze is looking ‘out of’ the image directly at the viewer). Again, such vectors may correspond with the arrows representing energy transference or mental contact in depictions of image schemas. The image in (8) may therefore be analysed as a visual instantiation of the same schema that is instantiated verbally in the caption. In other words, language and image in (8) are intersemiotically convergent with respect to schematisation with both modes construing the event as one of perception rather than action. Ideologically, this construal serves a sanitising and thus legitimising function. As Machin (Reference Machin2007: 125–126) notes of photographs of US and allied soldiers in Iraq, such depictions do not show soldiers engaged in violent actions such as shooting the enemy. Although they are seen holding guns, the Israeli soldiers in (8) are not shown actually firing (despite the fact that on the day in question Israel Defence Forces shot and killed fifty-two Palestinian demonstrators). Instead, such depictions show that ‘the soldier keeps guard, vigilant but peaceful and disciplined’ (Reference Machin2007: 125).
(8)

Israeli soldiers across the border from the Gaza Strip watched the protesters. (nytimes.com, 14 May 2018; © Jack Guez/Getty Images)
Another aspect of schematisation concerns not the event-structure itself but the participants within it and relates to configurational schemas rather than conceptual archetypes. Talmy (Reference Talmy2000) suggests that distinctions within the linguistic category of number, namely singular versus plural (as well as aspectual distinctions like semelfactive versus iterative), are accounted for conceptually in terms of plexity of structure (see Figure 7.6).

Figure 7.6 Plexity of structure
For Talmy (Reference Talmy2000: 48), plexity is ‘a quantity’s state of articulation into equivalent elements’. While singular nouns specify a uniplex referent, plural nouns specify a multiplex one. In the context of immigration discourse, the construal evoked by examples like (6), involving the plural form ‘immigrants’, is therefore one in which the referents are treated as multiplex. However, other nominal forms, including singular forms and collective nouns and noun phrases, construe referents in a uniplex fashion. In contrast to what Talmy describes as multiplexing, such expressions may therefore be said to reflect a cognitive operation of uniplexing. An example is found in (9) indexed by the collective noun phrase ‘column of migrants’, which construes the group of immigrants described as a uniplex structure of a specific oblong shape.
(9)

The huge column of migrants passes through field in Rigonce, Slovenia, after having been held at the Croatia border for several days. (dailymail.co.uk, 25 Oct. 2015; © Associated Press/Alamy Stock Photo)
In images, plexity is realised in the dispersion of, or degree of agglomeration of, visually depicted elements. The higher the degree of agglomeration (and therefore the lower the dispersion), the more uniplex the structure. The language-image combination in (9) is therefore intersemiotically convergent with respect to the dimensions of plexity and shape where the co-text image displays a high degree of agglomeration such that the individual migrants coalesce to form a uniplex structure that is similarly oblong. It is a strong candidate for multimodal constructional status where similar language-image combinations are found in other newspapers. For example, a near-identical image was published in the Telegraph with the caption in (10). In both texts, the collective noun phrase ‘column of migrants’ or ‘column of refugees and migrants’ features more than once in different regions of the text.
(10)
A column of migrants moves through fields after crossing from Croatia, in Rigonce, Slovenia. (telegraph.co.uk, 26 Oct. 2015)
Rhetorically, the conceptualisation associated with this multimodal construction does several things which are worth commenting on. In both modes, there is a loss in granularity with the result that immigrants are aggregated or de-individuated (van Leeuwen Reference van Leeuwen, Coulthard and Coulthard1996). Their own personal stories are not recognised and they can therefore all be viewed and treated in the same way. The image in (9) is also unbounded, other than by the frame of the photograph. That is, the endpoint of the ‘column’ does not fall within the frame of the photograph. A structure that is unbounded is conceived as ‘continuing on indefinitely with no necessary characteristic of finiteness intrinsic to it’ (Talmy Reference Talmy2000/I: 50). In this context, the unboundedness of the image serves to suggest a line of significant, perhaps indefinite, extent.
In a series of experiments investigating the ideological effects of photojournalistic images of the Syrian refugee crisis, Azevedo et al. (Reference Azevedo, De Beukelaer, Jones, Safra and Tsakiris2021) show that, compared to images depicting refugees in small groups, images depicting refugees as large groups of unidentifiable individuals – the dominant visual framing found in the media and the one presented by (9) – lead to greater implicit dehumanisation as well as increased support for anti-refugee policies and decreased support for pro-refugee policies. Beyond the denotative level, their findings therefore demonstrate empirically the power of specific visual framings at the connotative and ideological levels (Reference Azevedo, De Beukelaer, Jones, Safra and Tsakiris2021: 14). Specifically, Azevedo et al. argue that the increased duhumanisation that results from images of large anonymous groups resonates with a view of immigration as a security issue rather than a humanitarian one where refugees are considered to ‘be a crisis’ rather than as finding themselves ‘in a crisis’ (Reference Azevedo, De Beukelaer, Jones, Safra and Tsakiris2021: 14). Azevedo et al. thus conclude (Reference Azevedo, De Beukelaer, Jones, Safra and Tsakiris2021: 14) that what is made visible, and how, in photojournalism ‘has consequences for the ways in which we perceive and relate to other human beings, especially in a culture that is powered by images’ (emphasis in original).
7.3.4 Viewpoint
If schematisation represents one dimension of construal, in so far as it involves the apprehension of particular conceptual content in order to conceptualise the structural properties of the referential event, further dimensions of construal concern the way that conceptual content is viewed. One dimension of construal here relates to viewpoint where, as Langacker (Reference Langacker2002: 56) states, many linguistic expressions ‘presuppose a particular vantage point on the scene they describe as a crucial facet of their inherent semantic value’. Schematisation and viewpoint thus go hand in hand where viewpoint is an almost inescapable aspect of conceptualisation. For Langacker (Reference Langacker2008: 73), the relationship between the ‘viewer’ – the conceptualiser who apprehends the meaning of a linguistic expression – and the situation being ‘viewed’ constitutes a viewing arrangement. Crucially, however, viewpoint is a matter of construal and linguistic expressions impose contrasting viewpoints on the conceptual content they select for viewing. In Hart (Reference Hart2015a), the cardinal viewpoints made available to language are modelled in three dimensions: anchor (on the horizonal plane), angle (on the vertical plane) and distance. The viewpoint specification inherent in the meaning of linguistic expressions can be seen as a three-value coordinate within this system with the contrast in meaning between alternate linguistic expressions being appreciable as a shift in one or other dimension. Viewpoint is also an obviously inherent feature of images and the same viewpoints are available (Kress and van Leeuwen Reference Kress and van Leeuwen2006). Viewpoint is thus one further dimension of construal with respect to which the language usages and images in a multimodal text may converge. Indeed, the language-image combinations discussed above are observed to coincide with respect to viewpoint as well as schematisation. For example, in Hart (Reference Hart and Hart2019b) it is shown empirically that the viewing arrangement associated with transitive verb constructions places the unfolding action on the sagittal axis. The viewing arrangement associated with the verbal component of (1) is thus the one modelled in Figure 7.7 where the event is construed from a position in line with the agent’s perspective. In (1), this viewing arrangement is replicated in the image which similarly positions the viewer behind the agent. This viewing arrangement has consequences for attention where the action of the agent is foregrounded – in the parlance of cognitive grammar is assigned trajector status – in both language and image. The same parallels in perspective can be seen in (8) where the viewpoint presented by both language and image is from behind the experiencer in a perceptual event. Thus, in (1) the deviant behaviour of the refugee is foregrounded while in (8) it is the legitimate peaceful activities of the Israeli soldiers that receives particular attention.

Figure 7.7 Viewpoint + schematisation in example (1)
By contrast, the viewing arrangement associated with reciprocal constructions places the unfolding action on the transversal axis. Moreover, it is shown in Hart (Reference Hart and Hart2019b) that the left-right arrangement of participants on the transversal axis reflects iconically the information structure in the clause. Thus, the construction X clash with Y places whichever participant occupies the X and Y slots on the left and right of the conceptualiser respectively. This affords two different viewing arrangements as shown in Figure 7.8.Footnote 3 The same goes for constructions like ‘Skirmishes broke out between X and Y’ found in the verbal component of (2). Thus, if we assign protesters to Agent 1 and the police to Agent 2, then the viewing arrangement evoked by the linguistic expression in (2) is the one modelled in Figure 7.8a. The same viewpoint is presented by the image in (2) and thus the two modes are complementary in the dimension of viewpoint as well as schematisation. Although the viewing arrangement instantiated multimodally in (2) constitutes a more neutral perspective compared to sagittal arrangements, and attention is distributed more evenly over the scene, the particular left-right arrangement of participants on the transversal axis is not ideologically insignificant. For example, participants are more likely to receive blame and are perceived as more aggressive when they feature on the left of the viewing arrangement for both language and image (Hart Reference Hart and Hart2019b).


Figure 7.8 Viewing arrangements in reciprocal constructions, (a) and (b)
In (6), language and image are consistent in instantiating an impeded motion schema, the barrier element of which is specified in the image as a fence. Dancygier and Vandelanotte (Reference Dancygier and Vandelanotte2017c: 94–95) define the concept of barrier as a static structure that separates two areas of space, imposing a range of restrictions and allowing different forms of alignment. Crucially, the concept of barrier necessitates a viewpoint from one or other of the spatial regions it defines. Dancygier and Vandelanotte (Reference Dancygier and Vandelanotte2017c: 92) note that the addition of viewpoint means that the same type of structure is capable of yielding different experiential results. For example, barriers can be positively or negatively connoted depending on which side of the barrier one is located. A barrier may provide a protective structure or may present an obstacle that prevents the realisation of goals. In the context of immigration discourse, from the perspective of ‘us’, barriers in the form of fences are positively connoted as keeping unwanted elements out. However, barriers are capable of being breached allowing unwanted elements in. It is this semantic potential of the barrier concept that is exploited in (6) where the viewpoint shared across language and image is from ‘our’ side of the fence. In the verbal component, this viewpoint is provided deictically where the presumed physical location of the reader is Britain. Correspondingly, in the visual component, the viewpoint presented is from the goal on the other side of the fence to where the depicted motion event began.
Finally, in example (9), language and image coincide in construing a multiplex referent in a uniplex fashion. Plexity is related to viewpoint where it correlates with distance. The further a viewer is from a scene, the more uniplex it becomes in their eyes. Uniplex structures indexed by collective noun phrases such as ‘column of migrants’ in (9) involve a maximally distant viewpoint (on either the horizontal or the vertical plane). In the image, the arial shot presents a viewpoint that is maximally distant on the vertical plane. Plexity is also related to attention where the resultant loss of granularity in a uniplexed structure occludes attention to otherwise independent elements.
7.3.5 Attention
Like viewpoint, attentional distribution is a necessary feature of conceptualisation. Attention is linked to viewpoint in so far as which aspects of a scene are in the foreground of attention and which are in the background is determined by viewpoint (Chilton Reference Chilton2014; Talmy Reference Talmy2000). Language-image combinations therefore often converge in both viewpoint and attentional configurations. One aspect of attentional configuration resides in what Talmy (Reference Talmy2000) calls windowing, which has its reflex in gapping (cf. profiling in cognitive grammar).Footnote 4 By virtue of explicit mention, language selects for attention a particular portion of a coherent body of conceptual content in the form of image schemas or event-frames. Windowing of attention (or profiling) has clear analogues in visual forms of representation where images, defined by the scope of their viewing frame, necessarily only capture one part of a wider event-structure. Intersemiotic convergence in the dimension of attention may therefore occur where language and co-text images capture the same portion of a larger event. To illustrate this, consider the example in (11):
(11)

Palestinian protesters look up at falling tear gas canisters near the border with Israel in the southern Gaza Strip on Tuesday. (wsj.com, 14 May 2018; © Said Khatib/Getty Images)
The viewpoint in both the language and the image is from the perspective of the Palestinian protesters. The event they are ‘witnessing’ is an instance of what Talmy (Reference Talmy2000: 265) calls an open path event. An open path event is a type of motion event involving ‘an object physically in motion over the course of a period of time, which is conceptualised as an entire unity thus having a beginning and an end, and whose beginning point and ending point are at different locations in space’ (Reference Talmy2000: 265).Footnote 5 Path windowing occurs when language directs attention over particular facets of the conceptually complete path. Three forms of path windowing are identified – initial, medial and final – which window attention on different phases of the event. This is represented in Figure 7.9.

Figure 7.9 Path windowing initial (a), medial (b) and final (c)
In (11), the object in motion is tear gas canisters. Such an event involves a launch site and a landing site. The beginning and end points of the path, however, do not receive linguistic representation. Instead, the nominalised form ‘falling tear gas canisters’ construes the event with medial windowing and initial + final gapping as in Figure 7.9b. Likewise, in the image, the viewer sees only the tear gas canisters as they are moving through the air and does not see from where they emanated or where they end up. The multimodal representation is thus convergent in distribution of attention (as well as other dimensions of construal) with the attentional configuration in Figure 7.9b being instantiated across both modes. A contrasting image involving final path windowing would be one capturing the moment of impact. Of course, as Talmy (Reference Talmy2000: 266) points out, given sufficient context, we can mentally trace the whole path to reconstruct or ‘complete’ it, but in (11) it is only the medial point on the path that is foregrounded in attention by both language and image. Example (11) may therefore be analysed as a multimodal enactment of both agent-based and patient-based mystification.
Intersemiotic convergence with respect to path windowing is also found in the context of immigration discourse. Immigration, similarly, is an instance of an open path event in which migrants depart a country of origin and end up at a destination country. Language-image combinations can converge in directing attention over particular aspects of this process. Compare examples (12) and (13):
(12)

A group of migrants crossing the Channel in a small boat headed in the direction of Dover, Kent, on 10 August. (independent.co.uk, 10 Aug. 2020; © PA Images/Alamy Stock Photo)
(13)

About 50 people were seen arriving in Dungeness in a dinghy on Monday. (bbc.co.uk/news, 22 Nov. 2021; © PA Images/Alamy Stock Photo)
In (12), the linguistic expression ‘crossing the Channel’ and the image both involve medial path windowing. In the image, neither the beginning nor the endpoint of the journey is shown (though the intended destination is specified in the verbal text). Medial path windowing therefore shows the journey in progress without having yet been completed.
By contrast, final path windowing shows the journey as having been successfully completed. Final path windowing is exhibited by both language and image in (13). Final path windowing is an inherent property of the verb arrive (Langacker Reference Langacker2002: 49). In the co-text image, final path windowing occurs where migrants are shown having disembarked a boat at a shoreline representing the terminus of their journey. Language and image in (13) are thus intersemiotically convergent in instantiating the attentional configuration in Figure 7.9c. Of course, the shoreline in the image could be any shoreline but it is specified in the verbal text as being that of a UK headland. The linguistic expression in (13) therefore exists in a supplementary as well as a complementary relation with its co-text image. Worth noting in connection with (13) is the repeated use of arrive, as well as other verbs which include final path windowing as an inherent semantic feature such as land and reach, throughout the text as in (14). Other images in the text also exhibit final path windowing.
(14)
Channel crossings: Fifty people land on Kent beach in single dinghy A group of about fifty people have crossed the English Channel and landed on a Kent beach in a single day. Photographs show dozens of people, including women and young children, arriving at Dungeness on Monday. On Sunday, eight boats carrying 241 migrants reached the UK, the Home Office said. Nearly 8,000 people have reached the UK in about 345 boats in 2021. (bbc.co.uk/news, 22 Nov. 2021)
This persistent pattern of attentional distribution not only maintains intersemiotic convergence throughout the text but maintains a consistent focus on the endpoint of the migrants’ journey at the expense of the journey itself and the hazards it presents, as well as the factors motivating people to undertake the journey in the first place. The ideological impact of path windowing in images is shown experimentally where a number of effects are observed. Azevedo et al. (Reference Azevedo, De Beukelaer, Jones, Safra and Tsakiris2021) demonstrate that images showing immigrants on land, compared to images showing them at sea, result in reduced feelings of pity and admiration towards the depicted immigrants and increased feelings of contempt. Photos of immigrants on land were also found to increase perceptions of symbolic threat.
7.4 Multimodal Metaphor
The central claim of conceptual metaphor theory is that metaphor ‘is primarily a matter of thought and action and only derivatively a matter of language’ (Lakoff and Johnson Reference Lakoff and Johnson1980: 153). If it is the case that metaphor is not a linguistic phenomenon per se but a cognitive one reflected in and effected through metaphorical linguistic expressions, then, as Forceville (Reference Forceville2002: 2) observes, ‘it should be capable of assuming non-verbal and multimodal manifestations as well as … purely verbal ones’. El Refaie (Reference El Refaie2003: 76) similarly argues that, from a cognitive perspective, ‘any form of communication can be seen as an instance of metaphor, if it is able to induce a metaphoric thought or concept’. Indeed, based on an understanding of metaphor as a cognitive process of frame-projection, Forceville and others have been able to demonstrate the non-verbal and multimodal occurrence of conceptual metaphors across a vast range of texts and genres (e.g. Forceville Reference Forceville1996; Forceville and Urios-Aparisi Reference Forceville and Urios-Aparisi2009). From the perspective of Cognitive CDA, the same dominant framings found in verbal articulations of a given political discourse can be expected to receive representation in other semiotic modes also. Researchers in Cognitive CDA have similarly shown that metaphors identified in language as constitutive of particular discourses are reproduced visually and multimodally within the same discourses. For example, El Refaie (Reference El Refaie2003) shows how the metaphors nation is a building and immigration is moving water, which are characteristic of right-wing media discourses of immigration (Charteris-Black Reference Charteris-Black2006), get expressed visually in editorial cartoons. Fridolfsson (Reference Fridolfsson, Carver and Pikalo2008) shows how the protest is war metaphor is represented visually in news photographs of protests which are ‘charged with visual references to war aesthetics like people hunching down in the streets or frightening faces taking protection in a smoky environment’. Silaški and Đurović (Reference Silaški and Đurović2019) show how the metaphor brexit is a journey is rendered pictorially and multimodally in political cartoons. And in the context of Covid-19, it has been shown how the covid-19 is war metaphor is expressed visually and multimodally across genres including cartoons and public service announcements (Domínguez and Sapiña Reference Domínguez and Sapiña2022; Feng and Wu Reference Feng and Wu2022). In the covid-19 is war metaphor, the virus is depicted as the enemy, doctors and other medical professionals are depicted as soldiers, and measures taken to curb the spread of the virus, such as vaccinations, are depicted as weapons. A visual example of this metaphor is found in (15) where a molecular representation of the virus is shown as the enemy in the crosshairs of a target weapon.
(15)
(© Rumi Fujishima/Getty Images)
In visual metaphor, source- and target-frames are accessed by visually depicted elements. Mappings are then established between elements of the source-frame and elements of the target-frame which may or may not be explicitly represented in the image. As with verbal metaphor, entailments arise within the target-frame as a consequence of epistemological relations inherited from the source-frame. While attempts have been made to set out a step-wise procedure for visual metaphor identification, identifying instances of visual metaphor is not straightforward (Šorm and Steen Reference Šorm, Steen and Steen2018). This is because ‘the boundaries between the literal and the metaphorical are fuzzy and highly context-dependent’ (El Refaie Reference El Refaie2003: 75). It is also the case that ‘every individual reader or viewer is likely to bring his or her own experiences and assumptions to the interpretation process’ (Reference El Refaie2003: 75). The study of visual metaphors must therefore be carried out in a way that is sensitive to social, political and historical contexts of use for the metaphoricity of an image is strongly influenced by such factors.
The clearest classification of visual metaphor is offered by Forceville (Reference Forceville and Gibbs2008, Reference Forceville, Klug and Stöckl2016) who distinguishes between different types of visual metaphor based on how the metaphor is rendered in the image. In a contextual metaphor, an object is metaphorised by virtue of the new visual context into which it is placed. The principle here is that ‘a visually rendered object is turned into the target of a metaphor by being depicted in a visual context in such a way that the object is presented as if it were something else – the source’ (Forceville Reference Forceville, Klug and Stöckl2016: 246). The example in (15) presents such an instance where the image of a coronavirus features as the target of a military weapon and thus assumes the metaphoric sense of an enemy in a war. In a hybrid metaphor, two objects that are normally distinct are merged to form a single gestalt. What is typical of this subtype is that ‘the target and the source have been physically integrated. We can recognise both, but we cannot “disentangle” them’ (Forceville Reference Forceville, Klug and Stöckl2016: 248). An instance of hybrid metaphor is presented by (16) where the ball and chain in the visual component of the text integrates into its form a symbol representing the European Union.Footnote 6 The third form that Forceville identifies is simile. In examples of simile, two objects are represented as independent items but are made to look similar in some way. Following Forceville (Reference Forceville, Klug and Stöckl2016: 248), this can be achieved by various means including ‘juxtaposing target and source, by presenting them in the same form or posture, by depicting them with the same attention-drawing colour or in the same (deviant) style, by lighting them identically … or by any combination of these’.
(16)

Forceville’s classification, however, captures only a particular subset of visual metaphors where, as El Refaie (Reference El Refaie2003: 80) observes, ‘there seems to be a whole range of different forms through which metaphorical concepts can be expressed visually’. Rather than defining visual metaphor according to its surface realisation, El Refaie (Reference El Refaie2019: 14) therefore argues that, from a cognitive perspective, it makes sense to see as a potential visual metaphor any aspect of visual representation that ‘invites us to consider one kind of thing, concept, or experience in terms of another’. At least one further type of visual metaphor can be specified, however. In holistic metaphor, only the target-frame is explicitly represented but the image as a whole is reminiscent of imagery from another context or frame. This is often based on interdiscursive or specific intertextual references that the image makes to images associated with the source-frame. Intertextuality and interdiscursivity can also work across modes so that images can reference spoken or written texts and text-types and vice versa. Thus, as Werner (Reference Werner2004) observes, the ‘echoing of themes, quotations, symbols, storylines, or compositional elements from older images and famous written texts may create visual metaphors’. An example of holistic metaphor is presented by the image in the well-known Brexit poster produced by the UK Independence Party, reproduced in (17), which may be analysed as a visual instantiation of the metaphor immigration is moving water.
(17)
(© United Kingdom Independence Party (2016))
Although it depicts people, the form in which those people are presented bears a structural resemblance to prototypical images of water, such as rivers or streams, and therefore evokes a liquid frame. In resembling images of rivers, the image in (17) might also be analysed as making an intertextual reference to an equally well-known speech delivered by Enoch Powell in 1968 entitled ‘Rivers of Blood’ which included the extract in (18):
(18)
We must be mad, literally mad, as a nation to be permitting the annual inflow of some 50,000 dependents, who are for the most part the material of the future growth of the immigrant descended population … In these circumstances nothing will suffice but that the total inflow for settlement should be reduced at once to negligible proportions, and that the necessary legislative and administrative measures be taken without delay. (Enoch Powell, 1968)
Visual metaphors based on intertextuality/interdiscursivity are necessarily a matter of degree with images having stronger or weaker associations with the source-frame for different readers at different points in time. The images involved may therefore be described as having a potential metaphoric reading whose realisation is dependent on readers recognising the intertextual or interdiscursive connections made with the source-frame. As Werner (Reference Werner2004) further states, ‘allusions to historical events and personages, or to past cultural texts (e.g. poems, novels, famous quotations, art) are only successful if the reader is able to access the allusionary base from which the analogies are drawn’.
In multimodal texts, verbal and visual modes may enter into different intersemiotic relations with respect to metaphor. The most studied of these is intermodal metaphor in which source- and target-frames are indexed by representations in different modes (Forceville Reference Forceville and Gibbs2008, Reference Forceville, Forceville and Urios-Aparisi2009, Reference Forceville, Klug and Stöckl2016). However, metaphors may also be expressed cross-modally where the same metaphor is expressed fully and simultaneously in both verbal and visual modes.Footnote 7 In such instances, the metaphor expressed in one mode may constitute a more specific elaboration of the metaphor expressed in the other. Such a relation is presented by (16). In (16), the verbal text ‘The EU is restricting us from unleashing the kind of innovation which creates jobs and grows our economy’ evokes a metaphor eu is restraint, which implies that by leaving the EU Britain will be free to pursue its own goals. The same metaphor is expressed in the image where it is more specifically incarnated as eu is ball and chain. In other words, the source concept of restraint is, in the visual component of the text, realised more specifically in the concept of a ball and chain which implies that Britain is a prisoner to, or potentially even a slave to, the European Union. Alternatively, images that have a potential metaphoric reading may be subject to a metaphor anchorage effect whereby a verbally expressed metaphor makes apparent and/or reinforces the metaphorical reading of the image so that language and image come to co-express the same metaphor.Footnote 8 Finally, the source-frame indexed figuratively in a verbal metaphor may receive literal representation in the visual mode. In such cases, we can speak of frame-consistent images. The metaphor is not repeated cross-modally but there is a partial overlap and thus degree of complementarity in the conceptual structures evoked by each mode. As an example, the verbal text in (19) expresses a protest is fire metaphor indexed by ‘engulf’ while the source-frame of fire is represented literally in the co-text image.
(19)

A protester raises their fist near a fire outside the White House as protest engulfed the country for another night. (dailymail.co.uk, 1 Jun. 2020; © Samuel Coram/Getty Images)
7.4.1 Images of Immigration
A well-documented metaphor in anti-immigration discourse is immigration is war (Catalano and Musolff Reference Catalano and Musolff2019; Hart Reference Hart2010). This metaphor receives visual and multimodal representation as well as verbal representation. A visual instantiation of the metaphor can be found in (20) which shows another of the targeted Facebook ads produced by the BeLeave and Vote Leave campaigns in the run-up to the Brexit referendum.
(20)

The arrows in (20) are reminiscent of the arrows used in illustrations of battle plans such as the one analysed by Machin (Reference Machin2007). Machin (Reference Machin2007: 60) describes such arrows as ‘abstractions that suggest the idea of troop movements’. The association between arrows and the concept of invasion is entrenched through other culturally salient texts. For example, as shown in Figure 7.10, arrows were used in the opening sequence of the sitcom Dad’s Army to represent the threat of invasion from Nazi Germany. Similar arrows used in the context of immigration discourse may therefore, as a function of interdiscursive and for some readers specific intertextual knowledge, access the war frame and serve to represent immigration as an ‘invasion’.

Figure 7.10 Arrows in opening sequence of Dad’s Army
The immigration is war metaphor may also be realised cross-modally by the language-image combination previously given as example (9) whose caption referred to a ‘column of migrants’. ‘Column’ is the designation for a linear formation of soldiers moving in file. It is also, by metaphorical extension, the designation for a moving group of ants. The caption in (9) may therefore be analysed as expressing one of two metaphors: immigration is war in which immigrants are an army or immigrants are animals in which immigrants are an ‘army’ of insects. The image in (9) has a potential metaphoric reading that is consistent with the imagery of both these metaphors. The long-distance aerial shot of a densely packed organisational unit moving through fields is reminiscent of images of both invading armies and insects. The metaphoric construal in both modes is therefore likely to be determined or anchored by metaphorical expressions in other prominent regions of the text which perform a frame-setting function. Here, the headline of the article is observed to express a militarising metaphor as in (21). The language-image combination in (9) is thus likely to converge in evoking an immigration is war metaphor.
(21)
On the march to western Europe: Shocking pictures show thousands of determined men, women and children trudging across the Balkans as politicians warn EU could collapse in weeks. (dailymail.co.uk, 25 Oct. 2015)
Another well-documented metaphor in anti-immigration discourse is immigrants are animals (e.g. Santa Ana Reference Santa Ana1999), which is most conventionally realised in bovid or bovine characterisations such as found in (22) and (23) indexed by ‘flock’:
(23)
Britain’s population to overtake France as growing migrant families flock to UK in droves. (express.co.uk, 11 Jul. 2016)
(23)
Hundreds of migrants flock to Cherbourg ferry port as Calais ‘Jungle’ demolition continues. (telgraph.co.uk, 2 Mar. 2016)
In (24), the immigrants are animals metaphor is indexed by ‘stampede’, which normally refers to the sudden, startled movement of a group of large animals, including bovid and bovine creatures such as wildebeest and cattle:
(24)
PICTURED: The hoax leaflet which caused a migrant stampede in deadly river crossing THIS single A4 piece of paper caused a killer stampede of migrants to cross a swollen river which claimed the life of a pregnant woman. (express.co.uk, 16 Mar. 2016)
A question is to what extent may such a metaphor be repeated cross-modally. Co-text images accompanying (24), such as the one shown in (25), cannot be said to alone instantiate an immigrants are animals metaphor. However, images like (25) do bare some resemblance, both in terms of content (depicting animate beings crossing a river) and form, to images found in wildlife reports of animal behaviour such as the one shown in Figure 7.11. In the context of the verbally expressed metaphor, such interdiscursive connections may be highlighted so that the images, as a function of their specific textual environment, come to be interpreted figuratively in a way that echoes the metaphor expressed verbally. In other words, the images may be subject to a metaphor anchorage effect whereby their potential metaphoric reading, made available through interdiscursive connections with another genre, is realised as a consequence of the verbally expressed metaphor.
(25)
(express.co.uk, 16 Mar. 2016; © REUTERS/Stoyan Nenov)

Figure 7.11 Animal stampede
The interdiscursive connection with wild-life reporting is further reinforced though verbal expressions occurring elsewhere in the text where examples like (26) share the same narrative structure as descriptions of animal stampedes such as (27), which feature in texts where images like the one in Figure 7.11 are also to be found.
(26)
It sparked 2,000 migrants, some even in wheelchairs, to brave a perilous 6km trek, then attempt to cross the swollen Suva Reka river, where Greece lay on the other side. (express.co.uk, 16 Mar. 2016)
(27)
Up to 80,000 of the creatures - some weighing up to 43 stone - risk death as they dash through the treacherous waters in their annual migration. (dailymail.co.uk, 14 Aug. 2018)
An elaboration of the immigrants are animals metaphor consists in the metaphor immigrants are pests, which is realised in descriptions of immigrants as insects or rodents. This extreme version of the metaphor, which carries intertextual references to examples of Nazi discourse and thus represents a radicalisation of immigration rhetoric, receives visual representation in a political cartoon published by the Daily Mail newspaper in 2015.Footnote 9 In the cartoon, the purported open borders of the European Union are caricatured as allowing undesirable migrants, represented metaphorically as rats, to enter the Euro Zone unchecked. Notice that in this example of a contextual metaphor it is the source-frame element (rats) that is inserted into the target context to produce the metaphor rather than a target-frame element being inserted into a source context. Of course, as with verbal forms of metaphor, visual metaphors are not necessarily uncritically accepted by audiences who may instead resist them. And, as with verbal metaphors, one form of resistance involves highlighting parallels between the metaphor-indexing image and fascist discourse. This resistance may itself be visual or multimodal as in (28) which points out parallels between the contested image and co-text images accompanying water metaphors in discourse identified as Nazi propaganda.
(28)
(Twitter user, 31 Oct. 2022)
7.4.2 Body Poses
One special source of visual metaphor is body-poses. Body-poses can be general connotators of meaning where they are ‘used to suggest a certain kind of person, a certain set of values, particular ways of living’ (Machin Reference Machin2007: 29). However, in political discourses, they can also be resemioticised to establish a metaphorical mapping between actors in the target-frame and actors (whether generic or specific) in a source-frame. Such metaphorical mappings can perform characterisation and serve to bestow actors with the qualities, values etc. associated with actors in the source-frame. This often involves an appeal to particular historical figures or else archetypal figures (Lule Reference Lule2001) like the hero, the villain, the victim and the trickster which feature repeatedly across cultural narratives.
As one example, the image in (29) appeared on the front page of the Daily Mail (29 Aug. 2019) accompanied by the headline ‘Boris takes the gloves off’. It appeared in the context of Johnson (unlawfully) prorogueing parliament prior to Britain’s planned departure date from the EU thus disenabling any attempt by MPs to block a ‘no-deal’ Brexit.Footnote 10 The pose struck by Johnson in (29) is one of a repertoire of body postures and moves (Ostermeyer and Sittler Reference Ostermeyer, Sittler, Hart and Kelsey2019) associated with boxing and thus establishes a mapping between Boris Johnson and a boxer. The image therefore portrays Johnson as a skilled combatant willing to ‘take on’ his opponents. This potential metaphoric reading is reinforced by the verbal expression in the headline which similarly references a boxing frame. The bare fists in the image and the verbal reference to gloves being removed (a well-known idiom) further suggest an intention to ‘fight’ ruthlessly without restraint or mercy.
(29)
(© Tobias Schwarz/AFP/Getty Images)
While the body-pose in (29) is suggestive of a type of actor, the pose in which Johnson is shown in (30) is one associated with the specific historical figure of Winston Churchill. A near identical image was published on the front page of the Daily Express (24 Dec. 2020) accompanied by the headline ‘The deal is done!’ In a context where securing Brexit is construed as a victory for Britain, the pose in which Johnson is depicted recalls Churchill’s iconic ‘victory’ sign, an emblematic gesture which came to stand as a symbol of national unity and resilience in the face of a common enemy. The image thus evokes a world war ii frame with a correspondence established between Johnson and Churchill. Churchill is revered as a national hero in popular memory where he himself has come to stand as a symbol of courage in the face of adversity. His mythologised character is frequently exploited in nationalist discourses (Kelsey Reference Kelsey2012). Within the metaphorical ‘war’ of Brexit, images like (30) bestow Johnson with Churchillian qualities implying commitment and resolve in having led Britain to victory over the EU.
(30)
(© Peter Summers/Stringer/Getty Images)
If, within the set of correspondences established by a brexit is world war ii metaphor, Boris Johnson is repersonified as Winston Churchill, then representatives of the EU may be compared to Adolf Hitler. This correspondence is realised in (31), a text posted on Twitter by Leave.EU, where the pose in which Angela Merkl is depicted mimics the salute associated with Hitler and Nazi Germany. The association with World War II is strengthened in the verbal component of the text which refers to ‘two world wars’.
(31)
(Leave.EU via Twitter, 8 Oct. 2019)
In comparing Johnson to Churchill and Merkl to Hitler, images such as those in (30) and (31) retell an archetypal narrative (Lule Reference Lule2001) of ‘hero versus villain’ played out in the context of World War II. In this narrative, the hero fights for the values and ideals of the society in which their story features. They are intelligent, brave and benevolent. By contrast, the villain is intelligent but malevolent, devious, driven by self-interest and intent on destruction.
The body-poses presented by (29)–(31) are reminiscent of general imagery associated with certain activity types and historical figures. In (32), documenting the gilets jaunes (yellow vests) protests in France, the body-pose struck, as well as other features of the image, recalls a specific text whose wider context is consequently brought to bear in conceptualising the target-scene. The image in (32) makes an intertextual reference to Victor Hugo’s Les Misérables, and specifically the famous ‘barricades’ scene found in promotional literature for the 2012 screen adaptation of the novel. It makes a further intertextual reference in turn to a well-known painting, namely Eugène Delacroix’s Liberty Leading the People, which was produced to commemorate the July Revolution of 1830 and shows Marianne, a national symbol of the French Republic, personifying the Goddess of Liberty. The painting is shown in Figure 7.12.
(32)
(© Alain Jocard/AFP/Getty Images)

Figure 7.12 Eugène Delacroix’s Liberty Leading the People.
By virtue of the intertextual references it makes, the image in (32) may therefore be analysed as a visual instantiation of a protest is revolution metaphor, which, arguably, provides a counter-perspective to the dominant protest is war metaphor. The meanings communicated by images, however, depend on subjective feelings toward the source-frame. While intertextuality imbues images with the cultural or historical appeal necessary to be successful in resonating with readers, such symbols are not univocal but multivocal, having different meanings for different people (Gill and Angosto-Ferrandez Reference Gill and Angosto-Ferrandez2018). National symbols such as Marianne (and Churchill) are therefore themselves sites of power struggle as much as they are vehicles used in struggles over power (Reference Gill and Angosto-Ferrandez2018). Depending on one’s point of view, the image in (32) may therefore construe the events in question either as a necessary act of rebellion or as an unwanted waging of civil war. As it was published in The Times, the co-text image was accompanied by the verbal text in (33), which repeatedly expresses a protest is war metaphor. For readers of The Times who recognise the intertextual reference, the image is therefore likely to be read in a way that complements the verbal metaphor and confers a negative evaluation on the event depicted.
(33)
(The Times, 2 Dec. 2018)
































