Why AI struggles to detect hate speech online
Hanna Duggal, Mohammed Haddad
On the International Day for Countering Hate Speech, June 18, AI tasked with detecting and removing toxic content still falls far short of human judgment. Al Jazeera analyzes the limitations of AI models, highlighting inconsistent scoring, failure to detect implicit hate speech, and misclassification of reclaimed language.
Hate speech, once limited to face-to-face encounters, now travels farther and faster through anonymous online accounts. Marking the International Day for Countering Hate Speech on June 18, UN Secretary-General Antonio Guterres warned that social media platforms are amplifying this threat.
According to the UN definition, hate speech encompasses all forms of communication — speech, text, or behavior — that discriminate against or incite violence against a person or group. Targets may include race, ethnicity, religion, gender, sexual orientation, or disability. Hate speech is not limited to words but can include images, cartoons, gestures, and objects.
A joint 2023 survey by Ipsos and UNESCO of 8,000 people across 16 countries found that more than two-thirds of internet users had encountered hate speech online. 33% of respondents identified the LGBTQI community as the most targeted group, followed by ethnic and racial minorities (28%) and women (18%).
Meta, the owner of Facebook, has reduced the number of hate speech posts removed since 2023. In the final quarter of 2025, it removed 1.3 million posts from Instagram and 1.3 million from Facebook, compared with 7.4 million on Instagram and 5.8 million on Facebook in Q4 2024. This decline coincides with Meta shifting away from proactive detection and relying more on user reports. Conversely, TikTok says it removed 96.3% of hate speech and content in Q4 2025 before it was reported.
Social media companies are increasingly turning to AI, with moderation systems based on large language models (LLMs), to automatically filter content. These systems use labeled datasets and pre-trained language models to detect abusive language, then apply rules or scoring thresholds to decide if content is hateful or violates company policies.
A 2025 study by the University of Pennsylvania found that different AI models vary widely in how they identify and classify hate speech, creating significant inconsistency. The study evaluated seven AI moderation systems, including models from OpenAI, Anthropic, DeepSeek, Mistral, and Google, and found major differences in how they scored hatefulness. The Mistral Moderation Endpoint often assigned scores very close to 1, meaning it rated almost every sample as highly hateful regardless of target group. In contrast, the OpenAI Moderation Endpoint gave much lower scores, sometimes less than half those of other models.
The study authors noted: 'If two systems return different results for the same content — flagging it as hate speech in one but not the other — it undermines the legitimacy of the moderation process.'
Arkaitz Zubiaga, an associate professor at Queen Mary University of London, said AI can detect clear hate speech, such as when there are swear words or slurs targeting a specific group, but misses more subtle cases. 'One difficult example is implicit hate speech, which often goes undetected because it contains no slurs,' Zubiaga said. This could be a seemingly positive message like 'I would love to see how wonderful the world would be if...' followed by content insulting a demographic group. AI focuses on the positive part and misses the hate.
Conversely, words that seem offensive but have been reclaimed by formerly targeted communities and used affectionately are often wrongly flagged by AI. Zubiaga explained: 'This is a case of reclaimed language, where keywords that were historically slurs are adopted and reused by the community they once offended in a loving way. While these cases should not be flagged as hateful, AI tends to do so.'