Data visualizations have made a large impact on presenting data-results from research in an understandable way. They are aimed at a non-technical audience and treated as objectively as the dataset they are based on.
But at its core, data visualization is an an act of storytelling. They tell data-driven stories from the perspective of the creator(s)–the data analysts and designers in particular. Like any story, the end result isn’t ‘objective’ and contains bias.
In the following post, we present four basic biases that may be found in data visualization. We describe how the bias is presented in the visualization, as well as recommendations for how these biases may be addressed and mitigated. In order to create visualizations that contain a minimal amount of bias, we must not only be cognizant of our own data visualization practices, but also savvy in our evaluation of the visualizations produced by others.
There is a premium on statistical analysis and numerical measurement in evaluating outcomes and results. This translates into visualizations that are valued for their inclusion of descriptive statistics and linear models. In addition, the use of metaphor in the development of visualizations is often gleaned from Western-centric perspectives, sometimes becoming ‘lost in translation’ when sharing these data visualizations with non-Western audiences. But what about when creators develop non-Western-centric data visualizations for the purpose of communicating a story or narrative? In the realm of international development, this can mean failure to fully communicate with Western donors, the result being a loss of confidence on the part of the donor regarding a program’s effectiveness.
One of the advantages to creating data visualizations is the ability to reach a wider audience that may not have the skills necessary to interpret complex calculations and data output. However, some visualizations (most often graphs) may still be created for narrow audience, or visualizations may appear to tell one story when, upon more critical review, are actually telling another. An example of this last point would be presenting a bar graph in which the X-axis does not begin at zero.
There is even the risk for unintentional bias, particularly with color choice. For instance, using the red/green combination when labeling numeric values may be interpreted as ‘bad/good’ for some audiences.
Continuing to utilize Western-centric, high-level, and quantitative focused tools for creating data visualizations, at best, reduces access to the story that the data are telling. At its worst, this practice paves the way for misinterpretation that could lead to poor decision-making at the institutional or programmatic levels.
Whether we are aware of it or not, the language we use is not neutral or inert. It’s one of the reasons data services and algorithms are developed to recognize biased language (e.g., Textio as a service to highlight gender bias in job descriptions). But, as literary theorists like Edward Said have pointed out, language and identity politics have often been intertwined. Over time these words-as-identities can turn into negative (or positive) stereotypes that affect policies, economic mobility, etc.
When working with contested topics, such as racial disparity and inequality, words and other ‘minor’ details can make big impact. In our own work, we (Jennifer and Norman) try to be as cognizant of these issues and edit accordingly.
One example that Norman is intimately aware of is the language used in the politics of contested spaces in Israel-Palestine. Maps, in particular, become a site of contestation due to language. Language that the data visualization creator might be using unintentionally or without realization of the larger issues!
I believe there are two key contestations in language that appear in maps: 1) how to refer to the territory between the 1949 Israel-Jordan armistice agreement and the western Jordanian border, this area that is most commonly referred to as the West Bank; and 2) what the separation barrier that Israel has built is called1.
The name the “West Bank” is a simple geographical descriptor. It is the western bank of the Jordan river, so why not simply call is the West Bank? However, in Israel and on pro-Israeli maps, this territory is referred to as Judea and Samaria, names for the territory from the Hebrew Bible. The contestation plays out in the recognition or lack of recognition of ‘historical’ ties to the land and has major implications on negotiations between the Israelis and Palestinians.
When map creators choose a name, they are intentionally or unintentionally taking a side in this land dispute.
The creator(s) of a data visualization can have the uneviable job of trying to create an easy to read visualization on contested topics. Language becomes a political stance and a way that bias can intentionally or unintentionally creep into a visualization. Even if the choice is being made with the best of intentions and to comply with other considerations, such as to create an uncluttered and understandable map.
Biased Data Narrative
One of the ways in which data visualizations may be biased is in the narrative that the creator of the visualization develops. The creator becomes a gatekeeper of the data, choosing which data to visualize, as well as the nature of the visualization. Without feedback from external reviewers or end-users (if appropriate), these visualizations can drive a singular, and perhaps misguided, agenda with or without the creator’s intention to do so.
In following data visualization standards, such as highlighting key data to make your point clear, alternatives can become limited. People are wont to find patterns in random data, a tendency referred to as apophenia. There is also the tendency to reaffirm existing patterns. When a pattern is highlighted, it biases the audience to see that pattern, whether it exists or not, and agree with the creator. Thereby introducing the creator’s bias into the visualization.
While the creator of a visualization may have a preconceived notion of how they intend to visualize their data even before they conduct an analysis, it is important to leave room for discovery. Visualizations may change with changes in measurements over time, but also with changes in context, participant populations, audiences, etc.
The end result of the data visualization invisibilizes the methodology and thought process of the creator(s). It is never clear whether a pattern was determined before data analysis and then confirmed in the visualization–a common bias known as confirmation bias.
Validity, reliability, falsifiability, generalizability, and replication are five commonly accepted hallmarks of good empirical research. Replication, in general, ensures the validity and reliability of analysis results. Replication also supports the generalizability of these results to populations outside of the data sample. There is an inherent difficulty in reproducing data visualizations given the amount of personalized contributions involved in developing data visualization products. While the rise of tools like Python and R does make sharing scripts easier for reproducibility, it does not remove the replication bias due to personalized contributions.
These biases that we have discussed here are not unique to data visualization. In presenting data visualizations, the creator must take into account an assumption that the general audience will take much of the presented data product for granted, more than they would if the data was presented in non-visual form.
The humanities and liberal arts places strong emphasis on understanding the context cultural artifacts. When talking about written works, it means recognizing the reader and author as people who bring in bias based on their cultural biases. Even though reflexivity as a method is being debated in social sciences, the underlying need to reflect on bias and methodology can be brought into data visualization to position the creator.
Another way in which we can reduce bias is to take ownership of our processes and methods for analyzing and presenting data. Data visualization experts should be encouraged to develop procedures and techniques with which they demonstrate transparency in the ways in which they create visualizations. Executing a meta analysis of these procedures and techniques may even be included as part of a larger data visualization product.
About the Authors
Jennifer M.K. Parkin - Jennifer is an inspiring strategist who applies her wealth of methodological skills and training to solving problems at the intersection of conflict processes and public health. She holds a Master in Public Health and is working on completing her PhD dissertation in Political Science. Find out more at her website. Contact her on Twitter @jmkernerparkin.
Norman Shamas - Norman is an educator and activist whose work primarily focuses on technology, culture, and marginalization. They are a technologist by way of the humanities and liberal arts, a human rights worker by way of post-colonialism, and a data nerd by way of privacy advocate. Contact on Twitter @normanshamas
While we won’t be addressing this explicitly in the blog post, Israel has built a divider, under the guise of security, in the West Bank to denote land ‘ownership’. This man-made object has become a necessary feature in maps of the region. But what do you call this barrier? Most international sources refer to it as a barrier, because the term wall is contested due to its references to the Berlin Wall. If we look at the term in Hebrew (gader גדר), it is different from the word for wall (chomah חומה) and that is used in the Berlin Wall. Without going into a full word study, gader is an older Hebrew word to refer to fences and other separations due to restrictions. ↩