Translation on the internet is a full stack problem—here are the some of the issues as we think about tackling hate speech in minority languages like Burmese.
In 2016, I did a visiting fellowship at the Nieman Foundation for Journalism, where I focused on what was admittedly a geeky topic: what translation looks like for journalists in a networked age, and what are the new tools, techniques and methods to fostering linguistic diversity in our reporting.
A recent report by Reuters about the devastation in Myanmar and the systemic issues Facebook has faced with the Burmese language pointed to a lot of the issues of inequality I explored. Here are a few excerpts:
- The government began reporting cases to Facebook, but Tun said he quickly realized the company couldn’t deal with Burmese text. “Honestly, Facebook had no clue about Burmese content. They were totally unprepared,” he said. “We had to translate it into English for them.”
- Many of the millions of items flagged globally each week — including violent diatribes and lurid sexual imagery — are detected by automated systems, Facebook says. But a company official acknowledged to Reuters that its systems have difficulty interpreting Burmese script because of the way the fonts are often rendered on computer screens, making it difficult to identify racial slurs and other hate speech.
I’m republishing some of my findings from my time at Nieman (originally published on the terrific Fold platform), in the hopes of sparking some thoughts and conversation about what exactly is needed with regards to how we deal with language online. In short: it’s a full stack problem, and multiple issues—from fonts to keyboards to content moderation—compound existing inequalities.
Some other resources:
- You can read this essay I wrote about the specific challenges facing Asia when it comes to language divides.
- If you’d rather watch a video, you can see this talk I gave at the 2017 Stockholm Internet Forum.
As my time as a Knight Visiting Nieman Fellow winds down, I wanted to reflect a bit on what I’ve learned about journalism, translation and the importance of the network in contemporary digital journalism. Much of this applies more broadly — language is going to be and already is a critical issue for technologists concerned about supporting the increased range of people online — , but I’ll focus on the specifics of journalism in this post.
It’s been an incredible few weeks of interviews, conversations, seminars, workshops, historical research (especially at the beautiful Widener Library), Hacks/Hackers, a conference on comments and going beyond them. We also managed to squeeze in a few pilot projects with Bridge, Meedan’s platform for translating social media. I’ll be writing a longer, more thoughtful version of my time for Nieman Lab in coming weeks, so I’ll not try to craft too much of a logical narrative in this post.
Instead, some notes to jot down:
We’re moving toward a majority internet population. With 3.3 billion online and a 832% growth rate, the internet is incredibly diverse.
The “next billIon” have arrived, and already, language diversity is steadily increasing. I’ve written before about how ostensibly “offline”communities like in rural northern Uganda, North Korea and Cuba are impacted by the internet, and it’s important to keep in mind that the internet has ripple effects far beyond those who are formally online. As we crossed into a majority urban population, even rural areas have now oriented toward cities, providing raw and manufactured materials and serving as dumping grounds.
A similar effect will no doubt take place with the internet — even if not everyone is officially connected with a single user account, they will be pressured to find creative solutions to get connected. (Zachary Hyman and I have a piece coming out soon in Makeshift to this effect, and you can read what Julia Ticona and I discussed in the US context for Civicist.)
With regards to language, the sheer diversity of speakers online is stunning. From 2000 to 2015, we’ve seen 6592% growth amongst Arabic speakers, 2080% amongst Chinese speakers and 3227% amongst Russian speakers, to name a few. Even more striking is the fact that English speakers will soon be the minority online, and the growth of non-Top Ten language continues apace. If the news is breaking, it’s almost always going to happen online too. And more importantly, it will be happening in many more languages than English.
Multilingual content hasn’t caught up with multilingual users.
This is both a challenge and an opportunity. According to the IDN World Report, English content is vastly overrepresented on the web. Part of this, of course, can be explained by the fact that many people speak English as a second language. But other languages, like Arabic, Chinese and Spanish, are severely underrepresented.
This sounds like an opportunity for content creators to make relevant content for language speakers, whose experience of the internet is much more limited than that of English speakers. At the same time, adapting the current business models — advertising and pay to read — for these new markets will be a challenge. As Buzzfeed’s Greg Coleman pointed out, global advertising presents unique challenges. If so many people speak English, why bother with other languages?
As came through in many interviews I’ve done, readers tend to prefer their own language, even if they do speak English. I’d like to dive into this with more rigorous research, but it generally makes sense. As digital journalist and Nieman Fellow Tim de Gier described it to me, the internet is full of road bumps. Our job as journalists is to reduce those road bumps and point people to our articles. If it’s in another language, even one we speak, that’s just one more bump in access.
Networked journalism is here to stay. And it’s an opportunity for more diverse stories.
In 2006, Jeff Jarvis defined networked journalism as a field where “the public can get involved in a story before it is reported, contributing facts, questions, and suggestions. The journalists can rely on the public to help report the story; we’ll see more and more of that, I trust. The journalists can and should link to other work on the same story, to source material, and perhaps blog posts from the sources…. After the story is published — online, in print, wherever — the public can continue to contribute corrections, questions, facts, and perspective … not to mention promotion via links.”
He added that he hoped it would be a sort of self-fulling prophecy, as more newsrooms turned to networks to both source and distribute the news. Journalists are shifting from simply manufacturers of news to moderators of conversations.
This month, at the Beyond Comments conference hosted by MIT Media Lab and the Coral Project, it became increasingly clear that major news outlets are striving for an alternative. In a terrific panel moderated by Anika Gupta, journalists like Amanda Zamora, Joseph Reagle, Monica Guzmán and Emily Goligoski pointed out that we need to make a shift from thinking of the audience as an audience to thinking of them more as a community.
To meet both speed and accuracy, translators need better tech and better processes.
In a breaking news environment, both speed and accuracy are critical. Indeed, translation and technology have always worked closely together. There are two examples that stick in my mind. The first is the Filene-Finlay simultaneous translator, developed at IBM and used in the Nuremberg trials. The second is the printing press: in Western Europe, it wasn’t until books were translated from Latin to vernacular languages that they started to have an impact.
What does this look like in the digital context? It’s something we’re exploring at Meedan with Bridge, our platform for social media translation. Other great examples include Yeeyan, a Chinese platform for crowdsourcing news translation; Amara, for subtitling videos on platforms like TED; and Wikipedia.
But just as importantly as the tech, we need better systems and processes. The rigorous training of UN interpreters has made simultaneous interpretation at scale possible today. Glossaries, keeping up to date with the news, pairing interpreters together — this is the stuff that makes the tech powerful, because the humans behind it are more effective.
These processes can be supplemented with new tools in the digital context. Machine translation, translation memories, dynamic and shared glossaries can all help, as can fostering a collaborative mindset. What’s most striking to me is the fact that interpretation at the UN is collaborative, with at least two interpreters per language pair. As we do away with the myth that translation is a one-to-one matter (i.e., one translator to one text), we can generate a stronger body of translations made possible through collaboration.