6 MINS READ
A mechanism to standardize testing and measure the reliability of AI models.
AI helps achieve super efficiency and overall competency in business applications. However, with AI adoption comes the problem of challenges or concerns around establishing trust. There is need for a research-backed AI testing framework that uses methods and processes to improve the security, privacy, explainability, bias, and calibration of AI models—a mechanism that evaluates non-functional aspects of testing. The framework explored must comprise methods and tools from state-of-the-art and existing technologies to measure the reliability of AI models.
Currently, such a framework’s scope is limited to testing image recognition, object detection models in computer vision, text classification, and sequence-to-sequence models in natural language processing. The aim is to extend the realm to tabular data systems and help standardize the testing and governance of AI models at all levels.
Software testing methodologies and standards used for AI systems remain inadequate.
State-of-the-art (SOTA) software testing methodologies are still evolving and limited or restricted to certain data types. The prominent SOTAs include waterfall methodology, verification and validation, and incremental, spiral, extreme programming and agile methodologies.
While methodologies are approached from a product development or management perspective, software testing is classified into two types based on functionality:
Functional testing: It tests the behavior and usability of the software, verifying the input and output data. There are many established standards for functional testing; however, they do not cover all the requirements for testing AI models.
Non-functional testing: This testing assesses the application’s compliance with non-functional requirements; namely, security, performance, reliability, scalability, etc. It covers areas that functional testing does not address.
At present, there are no pre-defined tools and procedures to test AI models for non-functional aspects such as trust, robustness, privacy, explainability, resilience, etc. Not all methods and value ranges required for trust-related aspects fit the parameters for AI testing.
As a result, there is a widespread need to quantify trust aspects, define the ranges of validity, and assess models for compliance. A comprehensive mechanism that could test AI models evaluating these aspects is imperative.
Likewise, established standards have evolved for non-functional software testing. The latest ISO 25010 covers eight quality characteristics: functional suitability, reliability, performance efficiency, usability, security, compatibility, maintainability, and portability. Despite such standards in place, non-functional testing requirements fail to meet expected norms, as the quality assurance model does not cover some of the unique characteristics that AI systems possess.
Use of AI models has raised ethical and legal concerns.
Most AI models or black boxes are impenetrable. They have some inherent testing challenges, such as data inconsistencies, vulnerability to manipulation, insufficient training data, inexplicable predictions and outcomes, imprecise calibrations, and so on. The use of black box models has raised ethical and legal concerns, including inadequate transparency, privacy concerns, and biases in data utilized for training, among others.
Here are some critical concerns related to AI systems:
A mechanism that aims to standardize the testing and governance of AI models.
The AI testing framework is built factoring certain processes, leading to APIfication that will help test the non-functional aspects in AI systems:
The framework evaluates models for their non-functional requirements based on the following aspects:
Compliance: This covers industrial and regulatory governance aspects. It includes policies, processes, procedures, checklists, tools, and guidelines to check on the inner aspects of the framework.
Ethical AI: A reliable model should be ethical. Privacy, societal and environmental concerns, and AI governance are key aspects that are assessed.
An AI model needs to ensure that user privacy is respected. It should avoid collecting data or using sensitive personal information for training purposes without the user’s consent. Systems are prone to model inversion attacks, and private inferencing is one such method that can prevent obtaining sensitive data.
AI systems also assist in addressing critical environmental issues. From this perspective, one should evaluate its entire life cycle in an environment-friendly manner. To build a sustainable model, one must ensure that training consumes the minimal energy possible, as well as consider societal and social concerns while developing new methods for equitable and fair practices.
Moreover, data collection and processing should be fair and without bias as they influence the outcomes. While there are no straightforward solutions to address it, there are ways to measure, detect, and mitigate biases. Calibration is critical, especially in clinical and financial use cases, as it measures the difference between a model’s computed output probability and accuracy.
Trustworthy AI: The trustworthiness of AI models rests on the credence and accuracy of predictions, especially for businesses such as healthcare, banking, and insurance. Decisions based on analytical projections can have significant implications for such industries.
Let’s take an example to understand the process involved in structuring the framework. Consider an application that takes image data, computes it, and then provides an output classifying the image. It is now essential to verify that the model is well-calibrated, ensures privacy, and is resilient. Following this, the metrics must first identify that they quantify the requirement and use non-functional testing procedures to improve the model. An API should then analyze the model to generate a report on the non-functional aspects.
Trust encompasses factors such as human-centricity and explainability. Human-centric values, principles, and perspectives are central to building AI systems, and safeguarding these is key to determining the trustworthiness of a model.
With advancements in deep learning and artificial neural networks, data privacy and the role of fundamental rights have become critical while collecting, training data, and interpreting sensitive information.
Explainability is the level to which humans can understand predictions and decisions made by AI models. The model should provide valid reasons behind automated decisions, as newer regulations expect the service provider to ensure this. Explainability will help uncover biases and promote fairness. It can also bring in more transparency, help identify weak spots, enable debugging, and help increase consumer trust in the system.
Robust AI: This forms the framework’s core, critical to every organization. Models must be robust to encompass factors such as resilience, calibration, bias, performance, and fairness. While adversarial training can help overcome the attacks, it can also negatively impact model performance. Adversarial defenses do not stop the attacker from trying new methods.
Raw data is susceptible to attack from insiders and outsiders. There could be situations where an attacker could gain access to an AI model and reconstruct the data. These types of attacks are known as model inversion attacks. A robust model should be able to resist intentional or unintended manipulations. A high level of security and accuracy is required even in the face of threats or changes in input. It must use several metrics for evaluation criteria instead of a single metric.
A governing body will help institute common standards and processes.
AI models need a governance panel, comprising AI developers, government bodies, industry and consumer group representatives, and individual legal experts. The panel can be tasked with reviewing datasets to ensure that it delivers the requisite model outcomes.
Developing clear standards for the application of AI is vital to have an inside-out view of how models are designed and deployed in different contexts. The panel will also help adhere to the ethical principle of governance of AI systems. Overall, the idea is to reduce inequities and provide a lever to drive sustainable, ethical, and robust AI practices across industry.
Additional references:
Johner Institute, Do You Need to Be Aware of ISO/IEC TR 29119-11 on Testing AI/ML Software?, published May 6, 2021. https://www.johner-institute.com/articles/software-iec-62304/and-more/isoiec-tr-29119-11-on-testing-aiml-software/
SoftwareTestingHelp, Software Testing Methodologies For Robust Software Delivery, published December 5, 2022. https://www.softwaretestinghelp.com/types-of-software-testing/
European Commission, Ethics Guidelines For Trustworthy AI: High-Level Expert Group on Artificial Intelligence, published April 8, 2019. https://ec.europa.eu/newsroom/dae/document.cfm?doc_id=60419