Talks

You don’t know quality when you see it: Bias and bibliometrics

One important discovery in psychology is that decisions and evaluations are affected by their context and by experiences accumulated over a lifetime.

We can lump these two things together — context and experience — and call them bias. Our decisions are usually biased. That is, they deviate from what they would be if we were purely objective.

To take one example from the work of Daniel Kahneman and Amos Tversky, our ability to make numerical predictions can be biased by another number present in the context; our prediction gets anchored to that other number.

Anchoring effects

In one of their studies, subjects were asked to predict the percentage of African nations that were members of the United Nations. Prior to making a prediction, they watched as a colorful wheel of fortune spun and eventually landed on some number.

The subjects were first asked if they thought the percentage of member countries was higher or lower than the number on the wheel. This made the number particularly salient. Then they were asked for a specific guess: what is the percentage of African nations in the UN?

The number on the wheel of fortune influenced their judgements. For example, those who spun a 10, guessed on average that 25% of the countries were members of the UN, while those whose wheel of fortune showed a 65 had average guesses of 45%.

Context affected their predictions. This is normal; the work of Kahneman and many others makes it clear that there is a rich set of biases affecting our decisions.

Is academia a meritocracy?

Success in research and education rests on our ability to identify and promote high quality work. Hiring, promotion, evaluating our peers’ submissions to journals, awarding grant monies, giving grades to students, selecting students for advanced programs — all of these depend on the capacity to identify quality.

Yet when research on decision-making processes reveals pervasive bias, it’s only reasonable to expect that our decisions in academic contexts are also biased. Of course, we have a hard time believing that. We want to trust that academia is a meritocracy; we’re convinced that we know quality when we see it.

Our incentive system should be about leadership, not about micromanagement. 

But do we? Fortunately, we don’t have to speculate. We’re starting to understand the role of bias in the evaluation of academic work. There is research that reveals our biases. Here are two examples and some thoughts about their implications for policy.

Biased assessment of scientific articles

When experts are asked to evaluate the quality of a paper that has already been published, are they influenced by the impact factor (IF) of the journal?

This question is investigated in the article The Assessment of Science. Researchers there explored agreements and disagreements among evaluations of two large databases of articles. The articles were already published and the expert assessors knew the journal in which they had appeared.

While this research finds relatively low levels of agreement between experts in their evaluation of the articles, it identifies a stronger correlation between assessor scores and the impact factor of the journal in which the article is published.

How might we interpret this result? What does it mean when a reviewer tends to give higher scores to papers that were published in journals with high impact factors?

Do more prominent journals publish better research?

One possibility is that the reviewer and the journal are both good at identifying quality. The high IF journal publishes better work and the reviewer sees that.

A second possibility is that the higher score given by the reviewer is actually due to the IF of the journal, i.e. that the reviewer is biased by the idea that a particular journal publishes higher quality research.

How could we tease these apart? That’s what The Assessment of Science shows us; here’s how they do it.

The number of citations a paper receives in a 5-year post-publication period is stipulated for the project to be an independent indicator of quality. If one paper has 50 citations and another has 75, the latter is taken to be of higher quality than the former.

This stipulation is in fact conservative, since we know that articles in higher IF journals get more citations by virtue of appearing in such a journal. How do we know that? There are studies of several thousand articles which have been published in more than one journal, where the higher IF publication garners more citations.

The researchers then looks at pairs of papers which had the same number of citations but which were published in journals with different impact factors. For example, Paper-1 might have 50 citations and appear in a journal with an IF of 5, while Paper-2 also has 50 citations but appears in a journal with an IF of 15.

For the purposes of the study, this means that the quality of the papers is the same, since they have the same number of citations. What do the reviewers say about these papers? They conclude that the one in the journal with higher IF is better.

Furthermore, if IF does influence citations, as the evidence suggests, then it might well be the case the Paper-1 is actually better, since Paper-2 is getting a citation boost not based on its quality alone, but based on the visibility of the journal in which it is published.

When reviewers conclude that Paper-2 is better, they are being influenced by something other than the quality of the paper itself; they are displaying bias. The context for the review — manifesting itself in the prestige and IF of the journal in which the paper is published — is subjectively influencing the evaluation of quality of that paper.

Biased assessment of student exams

Another example of bias emerges when studying evaluation of students’ exams. In the article When precedence sets a bad example for reform: conceptions and reliability of a questionable high stakes assessment practice in Norwegian universities, we find the results of a study in Norway in which seven experts assigned grades to 50 psychology exams that had already been graded by an actual exam commission consisting of three people.

Two results are of particular relevance here. The first regards whether a particular paper should be awarded a passing grade or a failing grade. There was considerable variation among the experts regarding their level of agreement with the commission. One of the experts looked at all the papers and gave passing marks to only half of those the commission had passed. Three other experts gave passing marks to 70-some percent (72%, 75%, 78%) of those the commission had passed.

When it came to the failing papers, the level of agreement was even worse. One expert failed only 44% of the papers the commission had failed. Five others ranged from 61% to 89% agreement (61%, 61%, 67%, 83%, 89%).

The second result looked at a crucial threshold which students have to exceed to be admitted to the professional studies program in psychology. Here the variation between the seven individual experts and the commission was even greater.

The highest level of agreement was one expert who agreed with the commission 86% of the time about whether or not a particular paper was above or below the threshold. The lowest level of agreement was 43%.

When seven individual experts have such substantial disagreements among themselves and with the commission of three who actually had graded the paper, clearly they have not agreed upon a set of criteria that can be objectively implemented or reliably replicated.

There is bias based on the individual experts’ ideas of quality and merit — ideas that are built on lengthy professional experience and then brought along to the grading process.

Two ways to reduce the effects of bias for publishing

Based on these two studies, and others that contribute to the picture, we have to admit that even advanced expertise does not lead to reliable evaluations of the quality of work. We just aren’t good at judging quality — at least not in ways that yield agreement with other researchers. Bias is everywhere in decision-making processes, also in decisions about what is good, what is excellent and what is ground-breaking.

What are the implications of this conclusion? What does this mean for academic life and for the ways we evaluate the results of research? When we focus on publishing, what might we learn from acknowledging the reality of bias, also in our ability to judge academic or scientific quality?

The growing literature on bias leads me in two directions when I try to answer this question. First of all, where it is possible, we should build systems that start by acknowledging bias and then try to minimize its effects.

Secondly, and perhaps even better, we should actually try to build systems that simply avoid making use of what we now know to be inevitably subjective judgments of quality. Both of these can be illustrated in our publishing system.

Minimizing bias when evaluating articles

As an example of the first strategy — minimizing the effects of bias — consider the question of whether open access approaches to publishing can enhance the quality control function of peer review.

This blog entry is the text of a lecture I recently gave at the University of Oslo.

[In the talk which forms the basis for this text, I went on to discuss here some benefits of OA publishing, such as enhanced longevity, increased likelihood of reaching students (especially for OA books), new approaches to peer review, and more. These benefits give more opportunities for evaluation and thereby a stronger foundation for confidence about the quality of the work, given the presence of an infrastructure that facilitates this. I have discussed these types of issues in previous blog entries, and will leave the reader here with a few links in the hope that you make it to the more urgent issue in the next section.]

The use and abuse of bibliometrics

The second response to bias I mentioned above is to build structures which minimize our dependency on subjective evaluations. This issue is particularly important today because last week the Norwegian government received a set of recommendations for modifying the Norwegian system, and those recommendations would increase — not decrease — our dependency on subjective evaluations. This, I would argue, is not the way for policy-makers to go.

We have a system in Norway which awards funds to universities based not only on how many publications they produce but also on a classification of the journals in which those publications appear. The majority of journals are called Level 1 journals, while a select group, Level 2, earns more funds. The committee that is responsible for administering this system describes Level 2 as being those journals “which publish the most significant publications within a particular field.”

That same committee also notes that awarding an outlet status as a Level 2 journal will “to some degree give guidance about where one sends ones best works.” In other words, the system is built in a way that is likely to affect our behavior.

The committee’s rules wisely avoid using impact factor as a criteria for Level 2, acknowledging the weaknesses of that measure. Instead, Norway uses recommendations from national field-specific committees that identify the journals to be considered for Level 2. These committees reach their recommendations through discussion and the kinds of political processes associated with all committee work.

Based on the research about bias, it is inevitable that the suggestions from the national committees — even if made without an ounce of cynicism — will reflect far more than excellence and high quality. They will reflect debates, compromises, special interests and other factors — in short, bias. All of this is normal and not necessarily scandalous, but we have built a system that rewards publication in places that don’t necessarily deserve that competitive advantage.

Bias will be present in creating the distinction between Level 1 and Level 2. Even if we believe we are nominating based on quality, that will be a subjective nomination; it will be unreliable and irreproducible.

Leadership and the pressure to publish

But let’s ask a bigger question: why is it important for the leadership in our sector to push people towards particular journals? Why should we have a financing system which rewards publications in one serious journal more than publications in another?

One of the traditional functions of journals is to help scientists decide what to read. Maybe those in leadership positions want their researchers to be visible and they therefore find it legitimate to guide them in certain directions. With around 2,000,000 scholarly articles being published every year, it certainly is true today more than ever that we need help to sort through the sea of scholarly publications.

But we no longer need journal brands to help us with that work; instead, we can find what we need through Google, F1000 or Twitter. Indeed, we know that visibility is enhanced by publication in open access journals, so if visibility is to be stimulated, a better proposal would be to stimulate open access publishing — a topic the committee leaves untouched.

Instead, the recommendations given to the government of Norway last week remain hung up on the prestige associated with a few journals; the committee explicitly notes that there are relatively few papers by Norwegians in Science and Nature. In response to this, they recommend introducing a third level in the financing system, to reward such publications.

The mistake of a 3-level reward system

This third level of journals is supposed to be identified neither on the recommendations of the national committees nor on the basis of impact factor, but instead using a new bibliometric measure called Article Influence Score (AIS). Without going into details here, AIS retains various biases, not least of all due to its parasitic relationship to IF. Published research has concluded that “AIS would not seem to be entirely necessary for the Social Sciences, and not at all necessary for the Sciences, relative to the leading journal performance measures that are already available.”

recent study examines where Nobel Laureates in physics had published their landmark papers — those cited by the Nobel Committee when making the award. Their study leads them to conclude that “there is no correlation between journal IF and the quality of individual research papers, and IF should not be an exclusive criterion to evaluate the quality of scientific output of an individual or even the quality of a journal.” Evidence suggests that the same critique applies to AIS.

Adding a third level to the Norwegian system is a bad idea. This is a case in which a system will inevitably be build on biased determinations and we would be best off leaving this proposal aside. We should not build a more biased system.

The committee also wants to raise the value of Level 2 publications, which they justify by explicitly stating their assumption that individual publications in more highly ranked journals are of higher quality. This assumption is unjustified and flies in the face of a substantial body of research that demonstrates  exactly the opposite.

“The Impact Factor and the Article Influence Score are intended to assist in the evaluation of journals. To apply either score formulaically to […] ranking of individual articles is to misapply the numbers.” [source] There is every reason to assume that the selection process for the Level 2 journals in Norway also fails to be an appropriate indicator of quality for individual articles. It is therefore irrational to construct a system which rewards them.

What should the system look like and why

The system we have in Norway today is far from perfect, but there  is value in continuity, not least of all to compare the quantitative measurements the system provides. So if we are going to change it, that should only be done in ways that are reliable, transparent, understandable and evidence-based.

University leaders, politicians, and committees should not be in the business of telling researchers what journals to publish in.

If I were going to change it, I would acknowledge bias, and have only one level. In addition to monographs, I would let textbooks and anthologies earn rewards. I would move in the direction that our colleagues in Great Britain have, of only letting openly available publications count. The result of these changes would be a simple, clear and comprehensible system.

These changes would give us an incentive system that is about leadership, not about micromanagement. And leadership is necessary in universities; it’s right to find ways to stimulate research activity, including publishing.

But building a system on inevitably subjective evaluations leads to a flawed set of incentives. Instead of doing this, we should hire carefully and trust those we hire to do what is right in the context of their fields. We should encourage our researchers to publish in the places they think are right, where their work gets improved, where the impact and influence and visibility they want can be had. University leaders, politicians, and committees should not be in the business of telling researchers what journals to publish in.

Because I take this view, it is my hope that the Ministry of Education and Research ignores the suggestions they recently received about the publication system. Those suggestions are not evidence-based, they introduce an unnecessary degree of micromanagement and they are inevitably biased.

In short, they turn the reward system for scientific publishing into a process that is no more reliable than spinning a wheel of fortune.

This text is a slightly modified version of a keynote I gave at the University of Oslo, on January 22, 2015, at their seminar on Penger og Poeng.

To read about yet another proposal by this same committee, not touched on in the text above, see A bibliometric theater of the absurd.

My interest in moving universities towards balance encompasses gender equality, the communication of scientific results, promoting research-based education and leadership development more generally. Read more

Share

No Comments

Republish

I encourage you to republish this article online and in print, under the following conditions.

  • You have to credit the author.
  • If you’re republishing online, you must use our page view counter and link to its appearance here (included in the bottom of the HTML code), and include links from the story. In short, this means you should grab the html code below the post and use all of it.
  • Unless otherwise noted, all my pieces here have a Creative Commons Attribution licence -- CC BY 4.0 -- and you must follow the (extremely minimal) conditions of that license.
  • Keeping all this in mind, please take this work and spread it wherever it suits you to do so!