Can AI write accessibility specs?

TLDR: I built a little RAG AI prompt that pulls together accessibility best practices for whatever component I’m advising teams on.

Then I had an existential crisis about using AI.

☞ Generate accessibility guidance

Test it out and let me know if it’s helpful, or if you just… hate me.

Here’s my job:

Slack pings: “Hey, quick question about accessibility for this modal…”

Figma comment: “What’s the keyboard interaction for this accordion?”

Zoom call: “ARIA labels for this carousel?”

There’s usually some docs or a WCAG criterion that springs to mind, or if it’s complex, I’ll disappear down a research rabbit hole.

Last month, my brain went wandering into a dangerous place: what if I just… let AI do this? 🙈

The process I go through to advise on a technical spec for a component build or implementation is pretty similar each time. I look at:

Semantic structure
ARIA
WCAG / technical guidance
Keyboard and focus order
Platform specific guidance for apps

From here, I might write a checklist or set of acceptance criteria then jump on a call to share with a designer or engineer. I thought I’d have a pop at wanging all this into a prompt to see what happened.

1: Set up an API request

This was surprisingly straightforward, given I’m a designer and I’m not super technical. I wrote a JS function that calls the OpenAI API with my prompt. Netlify runs the function and stores the API key in environment variables.

To appease my conscience, I went with the PT-4o-mini model - which supposedly means lower electricity consumption and a smaller carbon footprint. At least that’s what the guidance says. I mean, why would Sam Altman lie to us?

2: Write a prompt that doesn’t suck

My initial prompts produced unreliable results. Eventually, I got ChatGPT to help me improve my prompt, thus demonstrating how the web is SLOWLY EATING ITSELF.

Before I specified what a component was, the AI just made stupid shit up, hallucinating components based on their titles. Here are some of my favourite fabrications:

A screenshot of my favourite hallucinated components: the SickMonkey, Yaksmas, BossMan and NumberWang components My favourite hallucinated components

The danger here is that everything is formatted neatly so it looks authoritative even if it’s a load of old NumberWang.

I also discovered you can set the model’s temperature to low. The model’s temperature controls how random its word choices are. Setting it to zero makes the output more precise and it’s less likely to invent details. It doesn’t stop hallucinations entirely, but it makes the model far less prone to creative or speculative answers.

3: Improve with RAG

After poking the prompt for a bit, the results were mostly correct but the advice was pretty generic. Like WCAG criteria that could apply to anything, and links to general guidelines instead of specific component docs.

Enter Retrieval Augmented Generation (RAG). Instead of letting a model make educated guesses based on the murky depths of the internets, you give it verified information to base a response on.

I altered my script like this:

Retrieve: My Netlify function looks up the component in a directory of local JSON files containing links to WCAG, ARIA Authoring Practices, MDN docs, Apple’s HIG, Material, and design systems sites which I know offer sound advice. Basically, the same sources I’d check myself. I found these performed best with no more than 4-5 links per component. Any more, and it timed out or results got confused.
Augment: The function builds a prompt that includes these verified sources, so the model knows exactly what it can reference.
Generate: OpenAI API generates the guidance with a low temperature setting for factual accuracy.

Where I landed

This is useful for quick reference. But is the juice worth the squeeze?

Even with RAG and OpenAI API’s temperature at the lowest setting, it still hallucinates. Hallucinations are paired with the overconfidence of a crypto influencer explaining NFTs to your Nan.

Some of the code it prints in the semantic web section is startlingly wrong. But given the material LLMs are fed with is flawed, can we ever rely on them to be 100% correct? The WebAIM Million still shows that 94% of the internet’s top million homepages have WCAG fails, so I feel like we’re asking the impossible.

If I can’t guarantee it won’t hallucinate, then what’s the point? I might as well just have these sources bookmarked because I have to visit them to verify accuracy anyway. When someone’s ability to access critical information hinges on detail in the technical build, there’s no substitute for thoughtful human oversight.

Cue my existential AI crisis

Here’s the sticker: I actually enjoy researching and writing component accessibility guidance. That’s how I learn stuff.

Beyond accuracy, I’m grappling with how to use AI responsibly as an accessibility practitioner. The ethical and environmental costs are serious. Powerful AI systems rest on the invisible work of underpaid labour. Then there’s the underlying sense of dread that when the AI bubble bursts, lots of my friends will probably lose their jobs.

Mostly, it feels dirty feeding expert-authored content into OpenAI’s training machine, even if it’s publicly available. Like I’m doing a disservice to the incredible people who created it. I added a sources section to the end of the prompt for transparency on where advice came from. And I didn’t reference any personal blogs or sites because it felt wrong without permission.

While writing this, I watched Mike Monteiro’s talk How to draw an orange. It’s probably the best thing I’ve watched on the ethics of AI. Mike points out that Google’s giant ‘help me write’ button is not really helping you to write. It’s telling you that you’re too dumb to write yourself and that your writing isn’t good enough. AI can draw you a perfect orange, but you’ve learned nothing. That slaps hard.

I have huge respect for anyone who swerves the AI hype wagon entirely. My brain swings from quietly excited to guilt-spiraling. My inner Geri, the one who rescues spiders from baths and feels bad for unused emojis, is horrified that I’m publicly admitting to even this level of moral compromise.

If you have opinions, hit me on Bluesky