Bad Character

Guardrails

Nov 10, 2025

Recently, Character.ai faced a lawsuit for failing to provide adequate protections for users.

‘I want them to understand that this is a platform that the designers chose to put out without proper guardrails, safety measures or testing, and it is a product that is designed to keep our kids addicted and to manipulate them,’ Garcia said in an interview with CNN.
Garcia alleges that Character.ai – which markets its technology as ‘AI that feels alive’ – knowingly failed to implement proper safety measures to prevent her son from developing an inappropriate relationship with a chatbot that caused him to withdraw from his family. The lawsuit also claims that the platform did not adequately respond when Setzer began expressing thoughts of self-harm to the bot, according to the complaint filed in federal court in Florida.

The frustrating thing about this is that we have been walking this path for a long time. I am part of the Smartphone Free Childhood movement in my local area, and the primary danger we have focused on is the harm caused by algorithmic feeds in social media that indirectly influence children down darker paths and can lead to humiliation when content about them is shared and spread. Character.ai seems to have taken a similar approach to user engagement using an AI agent – but the danger here is how direct it is. LLMs are trained to please, so if someone asks about suicide, expresses interest in it, and says they want help to do it, LLMs will output what is most likely and relevant to the user’s request – unless you build guardrails.

The flipside of being part of the Smartphone Free Childhood movement is that I am also active in building software with AI and developing agents to benefit professionals. What really frustrates me is that it is actually fairly straightforward to take measures to make products like Character.ai safer and to establish solid guardrails. Nothing involving AI will ever be 100% safe – but very little in life is. What people want is to know that companies offering AI agent products are developing them with genuine consideration for safety.

Character.ai has since said they will not allow under-18s on their platform, but I don’t see how they can enforce this. We’ve already seen, with the UK’s age verification laws coming into effect, that children will just find a way around them. Children have always been better than adults at adopting new technology and using it.

Let’s assume it isn’t possible to stop children accessing the platform and that it makes more sense to give them safe access with parental and platform oversight. What does this look like?

Parental Oversight

The concept of parental control over child accounts already exists in many areas. The Apple ecosystem is well known for offering this, and I use it for my family. My children need to use computers for schoolwork, and they have their own Apple accounts that are part of our family group, where I have oversight of what they see and control over which apps and websites they can access. For Character.ai, it could work much the same way.

The parent’s account could show different levels of detail about the child’s activity. This might be a full audit log of every chat thread, with all messages – but would that encourage the child to use the platform elsewhere, where they aren’t observed? Another option could be to classify common themes of conversation and notify parents about worrying or potentially harmful topics. This is not difficult to build. LLMs can easily perform the classification, as the work is linguistic. LLMs are well suited to this and require minimal prompt engineering or expensive models.

Platform Oversight

From the excerpts of the conversations with the child in the case above, there seems to have been insufficient guidance given to the LLM in its prompt. I have built systems that require guardrails to prevent unsafe responses from a professional accuracy point of view, and they work well. If the prompt is clear about where the LLM should not respond or should provide a warning, it generally works. This should have been the first step.

Secondly, a verification step could briefly use another fast LLM to classify a response from the platform as safe or unsafe in context. If unsafe, a safety message could be sent instead of a direct reply or affirmation. This would protect against jailbreaks in the first LLM and its prompt.

Thirdly, messages and conversation threads can be classified by theme. It’s easy to flag words that indicate high risk and to generate a risk score from these inputs to determine whether a conversation exceeds a tolerable risk threshold. Where it does, there should be a final line of defence: human operators who can review these conversations and intervene if necessary – for example, by alerting parents or prompting the AI agent to guide the user in a healthier direction.

Conclusion

If the guardrails I described above were in place1, I think many people would feel that adequate steps had been taken. The general approach I’ve outlined can be applied to all B2C agentic platforms. The platform oversight guardrails should also apply to adults. Many adults struggle with mental illness and use these platforms as a crutch, and others simply need this level of care. Inevitably, these systems will begin to act in a therapeutic way – and we already have strict rules for how therapists interact with patients. Similar principles could be incorporated into the prompts and platform oversight systems I described above.

Few people expect 100% safety from these platforms – they simply expect a reasonable level of care. This is also why I wouldn’t look to regulate them in a similar way to how actual human therapists are required to have certain qualifications to practice. It’s not reasonable to expect that level of compliance, and the companies haven’t set out to offer a replacement for therapists, but it is reasonable to have used good practices in applying guardrails.

Companies aiming to offer AI Agent experiences for consumers need to decide to prioritise safety; otherwise, they risk becoming the next toxic platform. Social media platforms have always had the indirectness of their algorithmic feeds as cover for the harm they cause. AI agent platforms don’t have that cover, as their interaction with users is direct. Governments around the world will find it difficult to avoid banning platforms that actively guide users towards harm.

When the public becomes fearful of AI systems in general because of a few careless actors, it harms the entire industry.

Character.AI has since introduced new safety features — including parental insights and a ban on under-18s — but these came after the incidents described in the lawsuit. The parental insights feature requires the child to invite the parent, which is back to front in my opinion. My 8YO daughter laughed out loud at the idea that children would invite their parents to monitor them and said she would never do it!

davidj.substack

Discussion about this post

Ready for more?