The Business & Technology Network
Helping Business Interpret and Use Technology
«  
  »
S M T W T F S
 
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
 
9
 
10
 
11
 
12
 
13
 
14
 
15
 
16
 
17
 
18
 
19
 
20
 
21
 
22
 
23
 
24
 
25
 
26
 
27
 
28
 
29
 
30
 
31
 
 
 

MATE framework tackles accessibility with AI

DATE POSTED:June 25, 2025
MATE framework tackles accessibility with AI

Researchers from Tsinghua University have developed a multi-agent framework called MATE, designed to improve accessibility through automated modality conversion. The system uses large language models (LLMs) and other AI tools to help users with disabilities convert data—such as text, images, or audio—into formats they can understand more easily.

Addressing gaps in accessibility with AI

Despite the growing use of AI in many fields, accessibility remains underserved. Technologies often fail to accommodate users with visual, auditory, or cognitive disabilities. MATE (Multi-Agent Translation Environment) addresses this by dynamically converting inputs into user-friendly formats. For example, it can transform an image into a spoken audio description for a visually impaired user.

Limitations of existing systems

Traditional accessibility tools tend to be closed-source, task-specific, and inflexible. Many focus only on single-modality problems, such as screen readers, and cannot adapt to a broader range of user needs. MATE offers a general-purpose, open-source solution capable of handling multiple forms of modality conversion in real time.

The multi-agent architecture of MATE

MATE uses multiple specialized agents that collaborate to execute different tasks. A central interpreter agent receives user prompts, identifies the desired modality conversion, and assigns the job to one of several expert agents. These include TTS (text-to-speech), STT (speech-to-text), ITT (image-to-text), TTI (text-to-image), and more.

Privacy by design

Because MATE is designed to run locally, it minimizes the risk of data exposure. This makes it suitable for sensitive applications, such as digital healthcare systems. MATE can be integrated into existing institutional platforms to deliver accessibility support in real time without transmitting private data to external servers.

The ModConTT dataset

To train and evaluate the system, the team created ModConTT, a custom dataset consisting of AI-generated and human-verified prompts representing various modality tasks. These include nine conversion types, such as text-to-speech, image-to-audio, and audio-to-video, plus a category for unrecognized prompts.

Two dataset versions were created: one for comparing LLM interpreters and another for training classifiers from scratch. The full dataset includes 600 labeled examples and is publicly available for use in other accessibility-focused AI projects.

New LLM draws wisdom from ancient texts

ModCon-Task-Identifier model

As part of the framework, the researchers developed ModCon-Task-Identifier, a fine-tuned BERT model that classifies prompts into specific modality conversion tasks. This model significantly outperforms other LLMs and machine learning baselines on the ModConTT dataset.

Performance comparison

Using classification metrics such as accuracy and F1-score, ModCon-Task-Identifier achieved an accuracy of 91.7% and an F1-score of 91.6%. In comparison, GPT-3.5-Turbo reached 75% accuracy, while logistic regression with BERT embeddings achieved 78.3% accuracy. These results show that a specialized fine-tuned model can capture prompt semantics more effectively than general-purpose LLMs.

Evaluation of interpreter agents

The team tested three interpreter agents powered by GLM-4-Flash, Llama-3.1-70B-Instruct, and GPT-3.5-Turbo. GPT-3.5-Turbo delivered the best overall performance with an accuracy of 86.5% and a failure rate of only 0.4%. Most errors occurred in ambiguous or undefined tasks, such as those labeled “UNK” (unknown).

Insights from error analysis

While most tasks were handled with high accuracy, the most common failure types involved audio and video-to-text conversions. Tasks that required converting speech to image or vice versa were recognized more consistently. These insights help refine the framework for future improvements.

Agent collaboration and model orchestration

Each agent in MATE is optimized for a specific task. For instance, the ITA expert combines image captioning (using BLIP) and TTS (using Tacotron 2) to turn an image into an audio description. Another agent, ATI, processes audio through Whisper and then uses Stable Diffusion to generate a corresponding image.

The entire architecture is built using Microsoft’s Autogen framework, allowing seamless integration between agents and easy scalability. The interpreter agent dynamically chooses which expert to delegate a task to, enabling the system to operate efficiently across diverse scenarios.

Real-world applications

MATE can be integrated into digital hospital assistants, educational tools, and accessibility solutions in public services. For example, in a hospital setting, it could convert medical documents into audio for patients with visual impairments. In education, it could help students with disabilities access course materials in preferred formats.

Video modality not yet supported

While MATE handles audio, image, and text transformations, it currently lacks video generation capabilities due to hardware and computational limitations. However, the team plans to integrate models like CogVideoX and ModelScopeTTV in future updates to support video-related use cases.

So what?

MATE depends on third-party models for some of its conversion tasks. Because many of these models are not optimized specifically for accessibility, there may be limitations in the quality or relevance of certain outputs. Additionally, the absence of optimized TTV (text-to-video) support limits its use in some accessibility scenarios.

Future research will focus on integrating more efficient and specialized models for video conversion, reducing system overhead, and extending MATE’s capabilities to new domains such as transportation, retail, and entertainment. Use cases like following cooking instructions via visual demonstrations or watching animated travel guides are already being explored.

MATE offers a robust, open-source solution for real-time accessibility support through intelligent modality conversion. It combines the flexibility of multi-agent systems with the power of LLMs and machine learning models, creating a versatile tool for individuals with disabilities. Its lightweight, privacy-focused architecture makes it suitable for local deployment in sensitive environments such as healthcare and education.

Featured image credit