Should the Public Care About Software Testing?
A peer conference by The Association for Software Testing and The BCS Special Interest Group in Software Testing, November 22nd 2020.
The Association for Software Testing (AST) and the BCS Special Interest Group in Software Testing (SIGIST) are organisations whose remit includes the promotion and advancement of the software testing profession. Inherent in this is a shared interest in the way that software is used, understood, and trusted by the general public, and the impacts it has on lives and lifestyles.
Peer conferences are described by Adrian Segar as “a conference that is small, attendee-driven, inclusive, structured, safe, supportive, interactive, community-building, and provides opportunities for personal and group reflection and action.” The AST has a long history of peer conferences, and recently wrote a guide to running them, and the SIGIST team were keen to try the format out.
Given those things, when we decided to organise a joint peer conference, the topic Should the Public Care About Software Testing? was easy for us to agree on:
It’s our contention that most people don’t think very much about software development and software testing, despite software being deeply embedded in almost all aspects of our lives.
When there’s a publicised issue such as a national bank failing to process customer orders for several days, a government IT project overrunning for years and then being canned, or a self-driving car causing a fatal accident, then society might take notice for a while.
However, even when that happens it’s rare for the complexities and risks associated with the creation, integration, and maintenance of software systems to be front and centre in the discussion, and testing almost never makes it to the agenda at all.
In this workshop we aim to explore why that is, and what testers, testing organisations, software development companies, and governments could do to persuade the public that software testing is worth understanding and caring about.
This report contains core elements of the conversations that took place over seven hours on 22nd November 2020 with participants calling in from Canada, Germany, Israel, Portugal, the Netherlands, and the U.K. The nature of these kinds of events is that the conversation goes where the energy of the participants takes it and this report reflects that. Specifically, it is not a review of the field or a position paper from either SIGIST or AST.
Should the public care about software testing?
We could think of good reasons why the public should and should not care about software testing. On one hand, in order to have confidence in a piece of software, the public should care that testing was done and it was the right kind of testing. On the other, why can’t the public just delegate all that stuff to professionals and leave themselves free to simply use the software?
We thought that both sides of that answer could be addressed by establishing trust between software builders, industry, government, and the public. We categorised approaches to this into three broad categories:
- Push: develop guidance for producers of software.
- Publicise: provide communication around software products.
- Punish: penalise undesirable outcomes from software use.
We didn’t intend to make a specific set of proposals at this conference but think that any strong proposal should recognise at least three things:
- Software is made for humans.
- Software production has mixed incentives.
- Software testing is important.
The rest of this document expands on these points based on the presentations that were made and the conversations we had around them.
What the Public Wants
We settled on a split perspective of what the public wants from software testing.
At one level the public should care about software testing because of the increasing dependency of our lives and lifestyles on software. Who would want to use software that handles financial transactions if no-one has checked that it is secure and doesn’t lose money?
But, at another level, the public should be free to use the software with confidence that it “just works” without significant undesired side-effects. Few people care about the details of their G.P.’s background yet willingly take potentially dangerous drugs on their advice.
We paused and wondered whether, as a group of testers who care deeply about their craft, it’s difficult for us to accept that our role is not relevant to the consumers of the products we make.
After some reflection, we decided we’re comfortable that software testing as an activity in its own right is just one way to help to create working products that do no harm. That’s not to say that critical thinking and other testing skills have no value, but that the public doesn’t need to know who provided them in the development of a piece of software.
We noted risks connected to the production and use of software in several categories that we thought the public should care about but perhaps have little or no awareness of:
- Architecture: the way the software is structured and how it is deployed.
- Society: the human factors around software.
- Development: the processes used to design and create the software.
This section addresses each of those in turn.
The way in which a software product is structured and deployed affects the points at which it could fail. Unfortunately, many of these are not directly in the control of the organisations that create those products. The material consequences of a failure can vary enormously: from nothing, where loading your Facebook page takes an extra second, to existential, where someone dies.
In modern software development it is common to use packages created and maintained by third parties, either commercial or open source, and so largely outside the control of the organisation. Couple this with continuous build, delivery, and deployment systems and package-management software that is configured to use the latest version of any package and you have a recipe for mismatched assumptions and unintended behaviour.
Further, much of today’s software is reliant on the internet at runtime. While some members of the public will be aware of terms like Internet of Things and have heard about security issues at major corporations, many will not be aware of the complexity of dependencies behind the apps they run on their phone, the streaming service on their TV, or the pacemaker inside their father’s chest.
It’s not just the dependencies, their asynchronous changes, and the network complexity though. There are so many devices, so many variant devices, and so many deployments to them, that it’s impossible to have a good mental model of the software space at anything other than the most general level. Without a good model it’s hard to conduct good risk analysis.
Although we expect that computers are deterministic, the chaotic nature of the software ecosystem we have accrued can easily exhibit emergent behaviours. These may be difficult to predict and difficult to recreate once they have occurred because of the number of variables that affect when they are seen.
Increasingly in contemporary software development, artificial intelligence (AI) approaches are replacing hand-crafted algorithms with behaviour that is learned from data. This can be a great force for good in applications, such as identifying lung disease in chest x-rays, but also comes with great risks, such as unintentional racist conduct. Weapons of Math Destruction by Cathy O’Neil contains many such examples.
There is an old adage in computer science, garbage in garbage out, which refers to the fact that if a computer program is given input that doesn’t conform to its expectations, it is likely to produce output that doesn’t conform to its user’s expectations.
In today’s world huge amounts of data is being pushed into machine learning (ML) systems, and these systems are likely to be set up to assume that the analysis they are performing is valid on that data. If the input is garbage, the output is likely to be garbage too although, crucially, it may be very difficult to tell if that’s the case at either end. On top of that, even if the input data is good, it’s a formidable challenge for a human to understand the ins-and-outs of the analysis.
The risk of this ending badly might be reduced when such systems are being created by our best and brightest technologists and scientists, people who we might expect to understand something of the powerful statistics behind ML and the assumptions that must hold true for them to be valid. Unfortunately, there are now commodity tools for AI software which mean that anyone can shove any old data in and have some kind of result come out, and then use it in whatever way they see fit.
Even if we assume that the software for ML is bug-free (which we would consider to be a very bold assumption), and used by people who understand it, with good intentions, there are still major risks with the data. Take the example of Amazon’s internal recruitment tool which effectively filtered women out of the hiring process because the material used to train the ML had been largely made up of male CVs.
A case of bias in, bias out, perhaps. Although, interestingly, it seems that the outcome bias might not be as directly related to input bias as naive expectations would suggest. This speaks to a tortuous problem: biases are notoriously hard to remove. Humans generally operate sensibly in the face of unknowns and ambiguity, are aware that biases exist in systems, and are meta-aware that their thinking is subject to biases. Computer systems, not so much.
However powerful the algorithms are that we apply to our data, the risk that an unknown unknown lurks somewhere within always exists. If this only becomes apparent to the pilot of a plane that is making radical manoeuvres to avoid a non-existent emergency, and who is prevented from intervening by deliberate safety mechanisms, then it’s too late.
Software development is subject to the same kinds of constraints as all business activities, dominant amongst which is the need to keep an eye on the bottom line. As in any industry, this necessarily leads to compromises, and with compromises come risks.
Knowing where and when to apply testing, and determining what kind of testing to do at that time, in that place, are non-trivial decisions. Placing them under time or cost pressure does not ease the difficulty.
When a ticketing website crashes under extreme demand or a hacker speaks to a young girl through a security camera in her bedroom it is natural to wonder about the degree to which business constraints affected the development and testing of the software involved.
For example, trying to save money by outsourcing testing is attractive to the balance sheet but risks denying the testers context, and hides the true testing activities from the software producer. Using generic testing services rather than dedicated specialists can miss much that is relevant. Not stopping to really think about what could possibly go wrong can certainly get a product to market faster, but is that a good way to provide a safe service?
We felt that it is important to have someone asking critical questions during software production if the outcomes generated by the software matter. We discussed whether those people need to be (designated as) testers and decided that they do not, but did note the psychological challenges of being both the creator and critic of the same piece of work. If we ask developers to also be responsible for testing, which is an increasingly common trend, then this increases the risk of missing problems with the software. The flip side to this is a sense, typically on the business side, that a person tasked to ask critical questions can be a drag on the team.
As committed and ethical testers it is our aim to help our employer to generate business value, and also to do the right thing for the societies we are part of. These two commitments may sometimes be in tension.
In the preceding sections we’ve outlined a selection of risks associated with contemporary software development and discussed the idea that the public should only need to care about software testing to the extent that they can trust that experts have exercised good judgement about where, what, how, why, and when to test.
In this section we’ll talk about approaches for establishing that trust. We came up with three broad classes of approach, which we’re calling push, publicise, and punish.
Pushes are applied up front to influence behaviour during the development of a product. We identified three groups who can provide them: practitioner, industry, and consumer.
At the practitioner level, we talked about a code of ethics or conduct (AST and SIGIST members sign up to such codes), or by analogy with the medical profession a “hippocratic oath”. The proposal is that individuals would provide bottom-up checks and balances to temper company self-interest. A chartered status scheme, such as those offered by BCS, for software professionals could also provide a baseline standard of behaviour, technical and ethical. We wondered who gets to decide what level the bar is set at, and what that bar should look like.
Historically, dissatisfied consumers vote with their feet. This can still be effective against smaller companies but the behemoths of software are teaching the public to like, or at least lump, what they’re given. Consumer campaigns might be able to change this dynamic though, and we discussed the idea of clean software analogous to clean air, something that consumers could demand as a right. Trusted advisors such as Which? who are independent, transparent, and well-respected can also play an influencing role.
On the industry side, finding ways to incentivise ethical behaviour over pure profit-seeking is desirable, but challenging. Also challenging, given the diversity of applications for computer software, are the idea of standards, regulation, and certification for its development. We’ll discuss this in some depth later on but it’s worth mentioning the concept of industry standard rigs which, for some specific task, define a set of benchmarks that should be met to enter the market such as, for example, the Java Device Test Suite.
We also wondered whether, at a company level, the widespread introduction of process designed to inhibit mistakes, ideally with the publication of changes introduced as a result, could help build public confidence that industry is trying to improve.
Publicisation encompasses any approach which puts information into the public domain with the intention of helping consumers to formulate and ask the right kinds of questions about the software products they use.
In certain domains it is common for there to be material packaged with a product that allows the (potential) user to evaluate a variety of quality metrics and risks in a context-relevant way. Foods typically have a range of labelling requirements related to nutrition, production methods, or packaging, white goods (large electrical goods used domestically such as refrigerators and washing machines, typically white in color) will often come with energy ratings prominently displayed, and medicines are required to list potential side-effects. We wondered how this might be applied to software.
The way that technical subjects are portrayed in the media has the power to influence public opinion. This applies to both media outlets (including traditional news services, web sites, advertising and so on) and those whose voices they feature. It’s our belief that the public are willing and able to understand complex issues, if presented appropriately. For example R, the rate of infection of a disease, has been heavily reported during the Sars-CoV-2 pandemic. We’d like to find ways to help public discourse on software, and its risks, mature.
It’s common for specialists in a field to talk about the need to “educate” the public about their craft. We discussed this and generally felt that the conversation should be less about testing practices per se and more about the outcomes that are significant to the public interest at a time that matters to them. We said that it must be relevant, accessible, and importantly, wanted. Professor David Spigelhalter was cited as a person who manages to pull off this difficult balancing act with aplomb in the world of statistics.
While push and publicise attempt to encourage desirable behaviour, punish is for penalising undesirable behaviour and for introducing additional practices to attempt to prevent it in future.
The aviation industry covers both of these bases: a team of air accident investigators will be called in to review significant incidents and issue recommendations that can take effect at an international level. This forms part of a global infrastructure that permits cross-border governance and interoperability of aircraft, airspace, and airports. National bodies such as the Federal Aviation Authority can also impose financial penalties on individuals and businesses operating in undesirable ways.
It’s worth noting that, perhaps because of this, aviation generally has a good reputation with respect to taking public safety seriously. For example, Atul Gawande’s The Checklist Manifesto talks at length about the way that the industry as a whole attempts to improve its behaviour through techniques such as providing simple checklists for important procedures.
Across industries there are bodies such as ombudsmen, consumer courts, regulators, and self-regulators which provide a means of reconciliation and redress for customers who are unhappy with the service they have received from a company, or who monitor the industry on a regular basis. These bodies may be appointed by government, regulator, or industry bodies and have the power to impose behaviour, rules and fines.
Finally, we noted that sharing information about problems in an independent public place can help to make an issue more visible and put pressure on companies to change it in spite of activities they might undertake to hide it. The Ford Pinto was an extremely dangerous car for many years, but Ford is said to have calculated that fixing its issues was not good for business (the bottom line again) relative to covering them up.
We talked around the viability of these kinds of after-the-fact bodies for the software industry. We are not naive about how easy it would be, asking ourselves questions like: Is software too broad a church for a single organisation to cover? What would qualify a person to be eligible to serve in an organisation like these? Who would fund it? What level of international cooperation would be required? How often could a single cause or responsible party be found in a complex ecosystem like software which includes participants such as the government, the producer, the purchaser, the user, the regulator, the tester, the developer, the whole product team, the algorithm, the training data, and so on? What kind of punishment would make sense when the industry ranges from bedroom operators to huge multinationals? How to regulate government IT projects? How to avoid hindsight bias, making it “obvious” that someone should have done something at some specific time and creating scapegoats? How could this be a transparent and productive process rather than a beanfeast for lawyers?
Standards, such as ISO 29119 and its precursor ISO 829 for software testing, can apply across all of push, publicise, and punish. They can be used as guidelines for the development of software, they are explicit and in the public domain, and they can be invoked by regulatory bodies as rules to be followed.
We broadly agreed that, if a testing standard was expected to be a proxy for a product quality standard, then it was essentially trying to “drive software development from the back of the bus.”
For us, testing and quality can be separated: the most excellent testing could be done and a product still be terrible, for example because the business shipped it without fixing issues. Likewise, it would be possible for no explicit activities called testing to be done and a product still be immensely valuable, say in the case where the developers were extremely conscientious about checking the implementation with prospective users as they built it.
Assuming we were trying to make a standard for testing alone, we thought that the following questions should be considered:
- Process vs results: the way the software was tested or the outcome of the testing?
- Quality vs existence: a measure of how good the testing was or simply reassurance that some testing was done?
- Normative vs prescriptive: derived from standard practice or imposing practice?
- Objective vs subjective: is judgement encouraged or is the standard black and white?
- Principles vs rules: an envelope defining a space of acceptable behaviour or a straightjacket restricting all but required behaviour?
- General vs specific: applies independent of application context or only in certain contexts?
- High vs low risk: applied independent of hazard context or only in certain contexts?
- Self-regulation vs legal enforcement: do we trust industry to do the right thing or do we require the courts to keep things in line?
- Minimum bar vs all the things: is this the “table fee” before actual testing starts or the least testing a producer can get away with?
We talked around the idea that, if testing standards are considered to be strict rules to follow, then some players will probably be prepared to invest time and money looking for loopholes. Such technicalities could allow them to follow the letter but not the spirit of the standard, in a drive to meet business priorities. Further, any strictly numeric requirements in a standard are metrics waiting to be gamed by the unscrupulous.
Technology and industry practice changes so rapidly that it could be a tremendous challenge for a standard to remain relevant, particularly one that is tied to specific methodology, architecture, artefacts and so on. Deepfakes are an example of an emergent application of technology that both government and industry are running to catch up with.
Software is such a broad church, applied in so many different ways in so many different domains, that we weren’t sure that one standard could apply universally. Could it make sense for a nuclear reactor’s safety monitoring infrastructure and the firmware in a Tamagotchi to be tested according to the same guidelines? What would those guidelines need to look like to make it a reasonable goal? We might get some ideas from the current attempt to regulate AI in the European Union.
We pursued this line further, wondering whether it would be possible to “slice” that enormous software space into pieces where some consistent standard could be applied. We thought of factors that could make this tractable:
- Be explicit about the context-specificity.
- Define the context carefully.
- Cover things that are valuable to people relevant in the context.
- Prefer thinner slices over thicker.
- Leave room for judgement.
We noted that attempts at slices such as these do already exist, for example in the UK gaming machine industry.
The conference didn’t set out to make a specific set of proposals but we agreed that any valuable proposal about the relationship between software testing and the public should recognise at least the following:
- Software is made for humans, and humans should be at the centre of its development and use. This includes understanding human biases and accounting for them.
- Software production has mixed incentives, notably tension between business needs and societal needs. These have to be balanced carefully and include incentives to test appropriately according to the context.
- Testing software appropriately is important, but we testers should not fixate on testing for its own sake, nor on the craft of testing, above the bigger picture concerns of making the software work, and work safely, for its users.
The Association for Software Testing
The Association for Software Testing is a professional organization for software testers. We are dedicated to advancing the understanding of the science and practice of software testing according to Context-Driven principles.
We strive to build a testing community that views the role of testing as skilled, relevant, and essential to the production of faster, better, and less expensive software products. We value a scientific approach to developing and evaluating techniques, processes, and tools. We believe that a self-aware, self-critical attitude is essential to understanding and assessing the impact of new ideas on the practice of testing.
The BCS Special Interest Group in Software Testing
The British Computer Society has 68,000 members in 150 countries, and a wider community of business leaders, educators, practitioners and policy-makers all committed to our mission. As a charity with a royal charter, our agenda is to lead the IT industry through its ethical challenges, to support the people who work in the industry, and to make IT good for society. The BCS Special Interest Group in Software Testing is the largest not for profit group of testing specialists in the UK.
Contact: email@example.com (the Chair)
The conference ran online from 15:00 BST to 22:00 BST on 22nd November 2020. The material created at the conference is jointly owned by the participants: Lalitkumar Bhamare, Fiona Charles, Janet Gregory, Paul Holland, Nicola Martin, Eric Proegler, Huib Schoots, Adam Leon Smith, James Thomas, and Amit Wertheimer.
This report was prepared by James Thomas using notes taken by him and Adam Leon Smith during the conference. Thank you to Adam, Amit Wertheimer, and Ros Shufflebotham for the helpful reviews.