Wikipedia:Wikipedia Signpost/2026-06-21/Recent research

Recent research

Proposed tagging system for AI involvement; successful and unsuccessful AI tools for contributors

A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.

"Labeling the Spectrum of AI Involvement: New Tag Proposals for Wikipedia and Commons"

This paper^[1], presented last month at a workshop about "Diffusion of Harmful Content on Online Web" as part of the annnual ACM WebSci conference, proposes a comprehensive system of disclosing AI involvement in contributions to Wikipedia and Wikimedia Commons, building on the edit tags system of the MediaWiki software. From the abstract:

"Building on prior work analyzing inconsistencies in Wikipedia’s Special Tags system, this paper argues that current tagging practices do not match the realities of modern editing workflows. Tool usage is frequently underreported, tag adoption varies widely across languages and topic areas, and editors often have incentives to hide tool involvement to avoid being held responsible for errors introduced by automated systems. Recent work, such as the development of LLM-based image captioning tools for Wikimedia Commons, illustrates that AI participation is already widespread and expanding. Instead of attempting to restrict AI use entirely, this work proposes a more practical strategy. This paper outlines a set of new tags organized into four major categories: Content Creation Tags for textual editing, Assistance and Verification Tags for evaluation and support functions, Metadata Suggestion Tags for organizational elements, and Media-Specific Tags for images, audio, and video. These tags document how, where, and to what extent AI systems contributed to Wikipedia content [...]"

"Proposed AI Tags by Category and Level of Involvement. The chart displays all proposed tags organized into four categories: Content Creation Tags (Text), Assistance and Verification Tags, Metadata Suggestion Tags, and Media-Specific Tags (Images, Audio, Video). The height of each bar indicates the level of AI involvement, from Lightest to Heavy/Heaviest. Color coding serves visual purposes only and does not indicate endorsement or legitimacy." (Figure 1 from the paper)

In the paper's introduction, the authors refer to extensive community discussions over the past years as a motivation for their system:

The community is also divided on what to do, with some advocating for a prohibition of AI tools and others embracing use of tools for light editing or even more.

In practice, a prohibition on AI-generated text would be impossible to enforce. Many people suggest banning AI-generated content, but this is hard because AI use is already common and difficult to detect. This work aims to enable contributors to Wikipedia to label their contributions accurately. Rather than discouraging or shaming contributors for using AI tools, labeling AI involvement is a more practical path that encourages honest disclosure.

The paper also mentions the numerous quality problems with AI-generated content on Wikipedia identified by the WikiProject AI Cleanup.

A "Demonstration" section walks through several concrete examples of applying the proposed tags. One involved prompting an LLM to "Polish this paragraph" for some text about Bangladesh, showing how AI tools can assist with stylistic refinement while leaving the substantive content and overall meaning under human control. Another one, demonstrating the propsed "AI-Bias-Detection" and "AI-Bias-Removal" tags, used the article 3ality Technica (as an example of an article already tagged for NPOV problems by human editors).

Like much other peer-reviewed academic research about the impacts of the AI boom, this paper is already outdated in some respects at the time of its publication, both in terms of the capabilities of the models used in the "Demonstration" section (e.g. GPT-4) and in its summary of community discussions. The latter summary seems to predate English Wikipedia's WP:NOLLM policy, instituted in March 2026, which now outright prohibits some kinds of edits in the red-colored part of the figure (although not all kinds of use of AI). However, the paper at least mentions that the community strongly opposes unreviewed AI text in articles. Also, for the above quoted statement that AI use is [...] difficult to detect, the authors cited a paper from 2020, i.e. from before the advent of current LLMs and AI detectors.

A section on the practical implementation of the proposed system emphasizes that it

[...] must be vetted through Wikipedia’s established governance processes: discussion on Village Pump, feedback from relevant WikiProjects, and potentially pilot programs in specific topic areas or language editions. Different communities may adapt the framework to local norms while maintaining core transparency principles. Technically, MediaWiki must provide easy tagging mechanisms, dropdown menus, checkboxes, or semiautomated suggestions based on edit patterns. Specifically, the proposed tags would be implemented using the ChangeTagsListActive hook, which allows new tags to be registered in the system. Editors would apply them through a collapsible checklist in the EditPage form, and once applied, tags would appear in revision history and be accessible via the API [...]

"Designing for Human–AI Collaboration in Open Knowledge Work": Why the Wikimedia Foundation's "Computer-Aided Tagging" tool on Commons failed

"The CAT interface displaying unverified suggested tags for an example image file" (figure 2 from the paper)

From the abstract of this paper (accepted at the upcoming CSCW conference):^[2]

"This study investigates Wikimedia Commons contributors’ lived experiences with the Computer-Aided Tagging (CAT) tool, an AI-assisted image tagging system designed to improve Commons’ discoverability, searchability, accessibility, and multilingual support. Using a qualitative analysis of 595 CAT-related community comments from 11 wiki pages and 16 in-depth interviews, we identify seven key issues that contributed to CAT’s mixed reception and eventual deactivation. We also offer community-informed suggestions for improving the tool.

In the "Discussion" section, the researchers identify

[...] seven key issues that contributed to CAT’s mixed reception and eventual deactivation: (1) misalignment in the perspectives of Structured Data on Commons, (2) unclear definitions of the Depicts statement, (3) difficulties applying Depicts through CAT, (4) lack of integration between categories and CAT, (5) an ill-specified AI/ML task, (6) limited support for collaborative evaluation, and (7) a disconnect between CAT and Commons’ search functionality.

("Structured Data on Commons" is an effort launched by the Wikimedia Foundation in 2017 to integrate structured data from Wikidata on Commons. An an earlier research publication had found it to have "made little progress on Commons because many contributors simply did not know about it or did not care", or "preferred their 'own' [category-based] system over a new structure designed by the foundation".)

(See also our past coverage of previous research by the two authors.)

Briefly

See the page of the monthly Wikimedia Research Showcase for videos and slides of past presentations.

Other recent publications

Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, are always welcome.

ChatGPT assistance improved student editors' "linguistic polish", but did not improve (or worsen) content quality or verifiability

From the abstract:^[3]

"This randomized controlled study assessed how model assistance influences the quality of Wikipedia edits created by undergraduate audiology students [in Brazil]. Thirty-six participants were assigned to two groups: Group 1 (G1) edited without model support, and Group 2 (G2) edited with ChatGPT support [ChatGPT 5 Instant]. Twenty blinded expert reviewers evaluated a sample of 30 texts (15 per group) using a six-item Likert-scale instrument covering Content, References, and Language. Inter-rater reliability was high (Gwet's AC2 = 0.80). The analysis suggests that LLM assistance may lead to a significant improvement in Language (β = 0.400; p = 0.023; Cliff's δ = 0.400). G2 improved by 0.68 ± 0.47, while G1 improved by 0.28 ± 0.57. The analysis found no significant differences for Content (β = 0.167; p = 0.214) or References (β = −0.267; p = 0.859), although G2 scores in the latter category trended lower. [...] LLM assistance appears to risk substituting constructive contribution with linguistic polish.

From the "Content selection" section:

"The 15 texts closest to the median in each group were selected, excluding the 3 with the greatest deviation (6 total). Extreme word counts were excluded to reduce heterogeneity, as very short edits may indicate disengagement or misunderstanding, and long edits could show pre-existing expertise"

From the "Instrument" section:

"A questionnaire [...] was created to evaluate the quality of Wikipedia articles, consisting of six questions rated on a four-point Likert scale (Poor, Fair, Good, Excellent). Its construction was based on an open-licensed assessment rubric used in educational activities involving (Wiki Education Foundation, 2025; see also https://w.wiki/8Xzv). This rubric was selected for its use in Wikipedia educational contexts and its development by the Wikipedia educational community. The instrument is structured across three dimensions with six items. Content: Evaluates thematic scope (Q1), ensuring all relevant aspects are covered, and quality/accuracy (Q2), assessing if the information is current and clearly explained. References: Checks verifiability through citation coverage (Q3) and the reliability

of the sources utilized (Q4). Language: Assesses textual mechanics and correctness (Q5) alongside the logical organization and coherence of topics (Q6)."

Dialog from the app, showing suggested machine-generated descriptions and reporting flags (figure from the paper)

"Integrating Machine-Generated Short Descriptions into the Wikipedia Android App: A Pilot Deployment of Descartes"

From the abstract:^[4]

"Short descriptions are a key part of the Wikipedia user experience, but their coverage remains uneven across languages and topics. In previous work, we introduced Descartes, a multilingual model for generating short descriptions [see this newsletter's 2023 coverage]. In this report, we present the results of a pilot deployment of Descartes in the Wikipedia Android app, where editors were offered suggestions based on outputs from Descartes while editing short descriptions. The experiment spanned 12 languages, with over 3,900 articles and 375 editors participating. Overall, 90% of accepted Descartes descriptions were rated at least 3 out of 5 in quality, and their average ratings were comparable to human-written ones. [...] The pilot also revealed practical considerations for deployment, including latency, language-specific gaps, and the need for safeguards around sensitive topics. These results indicate that Descartes's short descriptions can support editors in reducing content gaps, provided that technical, design, and community guardrails are in place."

The preprint (by several Wikimedia Foundation employees and researchers from EPFL) mentions that the underlying models The model is currently deployed on LiftWing, Wikipedia’s production machine learning serving platform (linking to ).

References

↑ Ahamed, Ashik; Wang, Max; Matthews, Jeanna (2026-05-25). "Labeling the Spectrum of AI Involvement: New Tag Proposals for Wikipedia and Commons". Companion Publication of the 2026 18th ACM Web Science Conference. WebSci Companion '26. New York, NY, USA: Association for Computing Machinery. pp. 98–107. doi:10.1145/3795513.3810451. ISBN 9798400724923. / author's copy
↑ Yu, Yihan; McDonald, David W. (2026-05-29). "Computer-Aided Tagging on Wikimedia Commons: Designing for Human-AI Collaboration in Open Knowledge Work". arXiv:2605.30800 [cs.HC].
↑ Corrale de Matos, Hector Gabriel; Wasmann, Jan-Willem; Peschanski, João Alexandre; Alvarenga, Kátia de Freitas; Jacob, Lilian Cassia Bornia (2026-05-21). "Large Language Models on Wikipedia Editing: The Stylistic Mask Effect in a Randomized Controlled Intervention in Health Education". doi:10.5281/zenodo.20320736 – via Zenodo.
↑ Šakota, Marija; Brant, Dmitry; Feng, Cooltey; Nowick, Shay; Ramadan, Amal; Schoenbaechler, Robin; Seddon, Joseph; Tanner, Jazmin; Johnson, Isaac; West, Robert (2026-01-12). "Integrating Machine-Generated Short Descriptions into the Wikipedia Android App: A Pilot Deployment of Descartes". arXiv:2601.07631v1 [cs.CL].

← Previous "Recent research"

In this issue

21 June 2026 (all comments)

+ Add a comment

Discuss this story

To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.