Jump to content

Commons talk:Categories

Add topic
From Wikimedia Commons, the free media repository
Latest comment: 4 days ago by Alaexis in topic AI-assisted diffusion
This is the talk page for discussing improvements to Commons:Categories.
Archives: 1, 2, 3, 4, 5, 6

Merging categories with identical scopes

[edit]

A dispute has arisen over whether categories for the two Boeing aircraft construction number systems should be merged or left apart. For context, Boeing uses two separate construction number systems to designate their aircraft. As such, each Boeing airliner is given both a unique "manufacturer serial number" (msn) and "line number" (ln). Currently, there are separate categories for msn and ln (for example, Category:Boeing 747-8 (msn 37075) and Category:Boeing 747-8 (ln 1449) both refer to the same individual aircraft). Unlike an aircraft's registration, which can change multiple times throughout said aircraft's service life, an msn/ln pairing never changes from the moment it is built to the moment it is scrapped or otherwise destroyed. In a sense, an msn/ln pairing is the absolute identifier for an individual Boeing airliner. As such, an aircraft's "msn" and "ln" categories should always be populated the identical subcategories and pages with identical sorting. It is my understanding that this is a textbook example of COM:OVERCAT and should be merged. If I am mistaken, please let me know.

It should be noted that there are many thousands of "msn" and "ln" categories just like the above example, so if this is indeed OVERCAT, then it will be a huge effort to clean it up manually. Perhaps it could be a task for a bot.

Pinging Ardfern, who has been the other side of this dispute. - ZLEA T\C 03:16, 1 November 2025 (UTC)Reply

I think it's fine the way it is, the existence of these additional categories does no harm. Perhaps it would be good to link with {{See also cat}} though. - Jmabel ! talk 20:55, 21 November 2025 (UTC)Reply

Recommending populating categories at creation

[edit]

Regarding how the creation of categories, this page currently has

To create a new category:
[…]
2. Find images (or a gallery or other pages) which should be put in the new category. Edit this page, and at the end insert the new category reference. e.g. Category:Titles. Save the edited page. The new category appears as a red link at the bottom of the page.

So a user adding just one image to a category and then creates it, leaving it near-empty like that would be perfectly following this policy.

Creating categories containing only one file or very few files when there are more (or a tiny fraction of files that belong into it) I think is not beneficial overall. Such categories give people – who open them via Commons search, Web search, file cats, or category-subcategory browsing – a wrong impression of what's on Commons relating to the subject, aren't useful, and are basically misleading.

Thus, I suggest that text is added to the section Creating a new category where it's recommended that people also do a thorough search to find and add files in the scope of the new category before or directly after creating the category. It could also be worth considering making some effort to populate a category a requirement to mitigate more of excessively incomplete categories being created and facilitate more users properly adding files to new categories.

Currently, if a category has not yet been created instead of created with just very few files, then another user who creates the category will usually do a thorough search to check whether files are missing. This is a much rarer practice for categories that already exist as people assume the person who created it and other visitors likely already did so (not always of course, just more often/usually).

The Creating a new category section could also inform about techniques on how to find relevant files such as using search term + deepcategory:"a parent category that contains relevant files" in the search and/or about the tools Cat-a-lot and HotCat which can make populating a category much easier, quicker, and accessible than the antique plain wikitext-editing for adding categories that's barely used anymore but described in this subsection. Ideas and suggestions for the text could be added to this thread here. Prototyperspective (talk) 17:37, 21 November 2025 (UTC)Reply

I would agree that should be recommended, but it should be understood more as a "best practice" than as a command. I usually try to do that, but I can think of times I've skipped it, especially when introducing an intermediate category in the hierarchy to make sure a "leaf" category I'm adding gets an ancestor structure similar to other parallel categories. For example, if someone were introducing Category:John T. Williams memorial pedestrian crossing and Category:Howell Street, Seattle didn't already exist, so they had to create it, I wouldn't necessarily consider them obligated to see if they could further populate Category:Howell Street, Seattle. Great it they could do that, but far from required. - Jmabel ! talk 21:03, 21 November 2025 (UTC)Reply
One could also add it like that and then maybe think about whether to phrase it less like a mere recommendation. One could also name exceptions or broad principles/types of categories where this doesn't make sense.
if someone were introducing Category:John T. Williams memorial pedestrian crossing and Category:Howell Street, Seattle didn't already exist, so they had to create it, I wouldn't necessarily consider them obligated to see if they could further populate Category:Howell Street, Seattle. the better course of action would be to just add the category as a redcat if they don't populate it and the closest existing category such as Category:Public art in Seattle. Somebody who sees the category due to the other categories set on it or otherwise, can create the category if they also populate it. If your goal is to create a certain category, there is no need and no justification for creating misleading incredibly incomplete intermediary categories just because they fit on the category one wanted to create. The effect here is not the user populating a street category but not creating the empty street category but leaving it e.g. to users motivated & skilled to do so or users who frequently or routinely create street categories and know what to do and how. Creating empty or very incomplete categories at least at this point is unconstructive.
Also, I'd like to add that the guidance in that section currently describes things as if the HotCat gadget was not a default-enabled gadget. Prototyperspective (talk) 15:42, 24 November 2025 (UTC)Reply
@Prototyperspective: I'm having trouble following at least some of that response. the better course of action would be to just add the category as a redcat: not sure which category you mean. If Category:Howell Street, Seattle, I disagree. It's much more likely to get populated if it is visible when coming down the hierarchy from Category:Streets in Seattle. - Jmabel ! talk 19:53, 24 November 2025 (UTC)Reply
Yes, Howell Street, Seattle which you said you didn't populate. I outlined earlier and multiple times why it's problematic if categories are heavily incomplete. I could expand on that but I'm not sure if it was unclear or if you have anything to address those things. Such categories would be more likely to get populated if you leave creating them to those that do substantially populate them at creation. There are lots of fine-resolution subcategories around all the relevant topics already. Maybe it would be more likely to get a file here and there but it's not much of a help if it has just 2 or 4% of files. The course of action proposed here is to also populate the new subcategory you think should be set on a category you're also creating. The second best option would be to just use the closest category and leave creating the category to somebody who will populate the new category. That small category already has
Pedestrian crossings in Seattle
Boren Avenue, Seattle
Deer in art
Public art in Seattle
Monuments and memorials in Seattle
Denny Triangle, Seattle, Washington
Decorated pedestrian crossings
White road markings in the United States
plus a good substitute of the category Howell Street, Seattle which is one (or multiple) of the parent categories now set on it. Again, all fine if you put a sizable fraction of the files that belong into it into it but if not how could people even tell that's a stub category with <1% of files? It's not useful and obstructs populating categories as such is usually done when creating a new category. Prototyperspective (talk) 23:25, 24 November 2025 (UTC)Reply
No, I did not say I did not populate Category:Howell Street, Seattle. Please re-read my remark, which was in the subjunctive and referred to a hypothetical situation for a hypothetical user. - Jmabel ! talk 01:17, 25 November 2025 (UTC)Reply
You're right on that, sorry. I'm addressing the hypothetical then which you argued by/for, not what was actually done. Prototyperspective (talk) 15:33, 25 November 2025 (UTC)Reply
(In fact, we seem to have few pictures along that street; if we have more, there is no indication in their respective descriptions. I still think it is an appropriate parent for a crosswalk category, even if it makes for a barely-populated category.) - Jmabel ! talk 01:23, 25 November 2025 (UTC)Reply
we seem to have few pictures along that street; if we have more, there is no indication in their respective descriptions In such cases it would be totally fine with what was proposed if it's put into stronger language than a mere loose recommendation.
  • It's about some minimum level of checking whether there are files that belong into the cat and adding them (e.g. searching for name of category and then from these search results adding the relevant ones)
  • / about the fraction of files that are added to the cat compared to all files on Commons that belong into it
not about the total number of files in the cat (see when there are more in the original post) Prototyperspective (talk) 15:37, 25 November 2025 (UTC)Reply
In practice, I believe I usually do a fair job of populating categories I create, but I don't think that is incumbent on everyone who creates a category. - Jmabel ! talk 01:26, 25 November 2025 (UTC)Reply

AI-assisted diffusion

[edit]
This image should be in Category:Abkhazia/Cities in Abkhazia/New Athos/Iverian Mountain rather than in Category:Abkhazia directly

There are thousands of Categories requiring diffusion. Many editors, myself included, have diffused thousands and thousands of images in my wikicommons career but it looks like the backlog keeps growing. For the most part it's drudgery even though sometimes you learn something new or go into various rabbit holes trying to locate some ruins in the Caucasus.

This problem is related to but different from the problem of non-categorised media. We've been using bots to find categories for newly uploaded images lacking categories for 10+ years.

Would you use a user script that analyses images in a category with hundreds of images and suggests how to categorise them properly? Assume it's not perfect but has decent performance (e.g., it suggests categories for 80% of images, and 80% of suggestions are correct). Alaexis (talk) 11:47, 6 February 2026 (UTC)Reply

I would honestly love to see this, as someone who's made bulk category edits myself. A model parsing the image and relevant metadata could definitely be tested on a sample of images, and we can see how accurate the suggestions are and if it is worth pursuing this further. The two risks I am keeping in mind are either hallucinating non-existing categories, or hallucinating details about the image that are not present (for example, giving a specific species ID for an organism that can't be identified visually down to that level). Chaotic Enby (talk) 12:10, 6 February 2026 (UTC)Reply
Maybe we could even start only with metadata: the name, description, date, location, etc. I think that most of the time it would be sufficient to categorise an image. Alaexis (talk) 18:34, 8 February 2026 (UTC)Reply
Are you thinking of developing such or similar or is this only about hypotheticals?
Furthermore, please see Commons:Bots/Work requests/Archive 18#Auto-addition of inferrable categories. Prototyperspective (talk) 21:56, 7 February 2026 (UTC)Reply
@Prototyperspective, I'm thinking about it and trying to gauge the interest/need. Thanks for sharing the link. Bots have advantages over user scripts but I'm a bit wary of fully automated categorising, I don't think we're quite there yet, accuracy-wise.
One could also imagine a bot that doesn't categorise files itself but rather suggests categories at the talk page of a relevant category, but that would require substantial manual work to implement the proposed changes. Alaexis (talk) 18:32, 8 February 2026 (UTC)Reply
Sure. That's why it would need testing, gradual rollout and a good design. The latter can for example be a metacategory 'auto-added categories that need checking' that gets removed when a user reviewed the categories or the categories being added only in commented out form so that they only get added once a user confirmed them with the click of a button using some new gadget. I think I wrote down some ideas relating to suggested categories but I can't find them anymore other than for example for video files in Category:Short films it would create a 'suggestion' to add Category:Short films videos at that BWR link. I don't think adding it to the Talk pages would be good and what you said about manual work also is a big point that needs to be considered. Prototyperspective (talk) 18:58, 8 February 2026 (UTC)Reply
Yeah, if you go this way, no need to involve talk pages. If it's a bot good enough to be worth using at all, it would make sense for it to write a list of possible categories straight into the file page, either commented out or otherwise disabled, plus one maintenance tag asking for human review of that file page. - Jmabel ! talk 21:37, 8 February 2026 (UTC)Reply
Yeah, that's a good idea.
I have some experience with user scripts and zero experience with bots. I think that it would make sense to create a PoC to see if the algorithm indeed works and then proceed with building a bot. Would you be willing to test the algorithm? Alaexis (talk) 22:16, 8 February 2026 (UTC)Reply
I wonder how you'd identify the categories to suggest though – is it also about inferring it via the other categories or would you intend to use the metadata? Especially if it's not inferring from the already-set categories that seems like the hard part – file titles are often easy to misinterpret, descriptions are often long, date in description is sometimes false, geocoordinates are rarely set. It would be great if you could build a PoC I think but it may be best to think about how to find the cats to suggest beforehand because maybe the ideas for how to do that aren't viable. Prototyperspective (talk) 22:49, 8 February 2026 (UTC)Reply
I'd absolutely be willing to help with a proof of concept. Regarding errors with metadata, well – we kinda already have to deal with them when categorizing things. If we don't have geocoordinates (and the photo isn't a recognizable landmark), we might not give a much better location ID than what is already present. There should be a way for the model to just not suggest any refinement if none is reasonably supported by the data. Chaotic Enby (talk) 06:33, 9 February 2026 (UTC)Reply
Keep in mind that geocoordinates in EXIF or on file pages are often way off. I'd be loath to see a bot make decisions on that basis that were not checked by a human. - Jmabel ! talk 07:02, 9 February 2026 (UTC)Reply
Yes, presumably we'll always have a human in the loop mechanism. Chaotic Enby (talk) 07:32, 9 February 2026 (UTC)Reply
@Chaotic Enby, @Jmabel, @Prototyperspective, please try out Diffusor! Very keen to hear your thoughts. Alaexis (talk) 20:42, 9 February 2026 (UTC)Reply
Just tried it (sort of). Three minutes in it was on 3 of 19 of some task, and I stopped it. I take it that this is the sort of batch task you start when you are off to have dinner. - Jmabel ! talk 21:20, 9 February 2026 (UTC)Reply
All right, tried it on something smaller. Can I suggest that it would probably be an improvement if the refined categories were to be placed in the same place as the category being replaced, rather than at the bottom? - Jmabel ! talk 21:30, 9 February 2026 (UTC)Reply
Also:
  • It would be good if the summary indicated what has been changed. If this were used by someone else on one of my files, I'd rather not have to check the diff to see if what they did was sane.
  • Even at this experimental phase, it should probably have a tag so people can control whether they see these in their watchlist.
Jmabel ! talk 21:34, 9 February 2026 (UTC)Reply
Thanks for the feedback! Which category was too large for it? The summary will be easy to fix. What do you mean by the tag? Alaexis (talk) 22:51, 9 February 2026 (UTC)Reply
Category:Seattle overwhelmed it. I suspect it may more be that there is a heck of a hierarchy under that than the number of images in the category.
Tags: Special:Tags. And then you have the bot use that in the edit summary. Allows filtering the watchlist. - Jmabel ! talk 01:40, 10 February 2026 (UTC)Reply
@Jmabel Seattle is a great example. When I analysed it, it worked and suggested categories for 251 out of 374. There were two problems: it took a while to fetch the category tree and then to get responses from the LLM. In fact I limited the number of fetched categories to 2000 to keep the prompt sizes manageable.
Therefore for this image it suggested categories Great Seattle Fire and Seattle in 1890s rather than the more accurate Aftermath of the Great Seattle Fire and 1891 in Seattle respectively (since the tree crawl didn't get down to that level). This is an improvement compared to the original category but I can also imagine this being a bit frustrating - if I'm categorising images I'd want to do it properly.
Two Alaexis (talk) 19:26, 10 February 2026 (UTC)Reply
Maybe something that could work would be to apply it recursively, going down from the current suggestions until it doesn't find any new relevant categories? The only worry I'm having is of it getting stuck in category loops, but keeping a history of checked categories would avoid this. Chaotic Enby (talk) 22:01, 10 February 2026 (UTC)Reply
I have a couple of ideas on how to improve the performance (parallellised tree crawl, progressive disclosure of suggestions) will need to see if it can be done easily.
As to the tags, it looks like it requires some permissions that I don't have to create new tags. If this scripts gets widely used I'll ask someone to do it. Regarding the bot flag, it's for bots and in fact not every bot is allowed to flag its edits per Commons:Bots#Bot_flag. Alaexis (talk) 19:30, 10 February 2026 (UTC)Reply
@Alaexis: I wasn't suggesting a bot flags (which is for bots that run on their own), just a tag. I can easily set up the tag if you will use it. Would "cat-diffusor" be an acceptable tag? And with the description "category refinement suggested byDiffusor" (obviously, we would change that link once it is no longer in your user space). - Jmabel ! talk 00:06, 11 February 2026 (UTC)Reply
That was fast; amazing, thanks for the development. Could you please move the button to the Tools panel instead of adding this red button to the top of category pages? That's also where I can use all the other gadgets I have installed and it doesn't take up space on the category page. Ideally, there would be a setting where the button is located but I don't think people would use it so often that having the button on the category page is needed.
As Jmabel said, it takes very long to load. Would it be possible to see the suggestions before it has finished loading or to make it scan just a few layers?
A reason for it loading long often is miscategorization. Sometimes quite clearly false subcategories and sometimes less clearly. An example for the latter, my first try of the tool was on Category:Beekeeping but that category has Category:Diseases and disorders of bees->Bee mites which contains lots and lots of subcategories so it makes the scan take very long but Bee mites are just animals and directly in the branch of Diseases and disorders of bees should only be the subcats and files that show them struggling with bee mites etc, not all files about the bee mites. So instead I created a subcategory for bees struggling with mites Category:Bees infected with mites and also included it in the bee mites cat. It contains the existing Category:Bombus with Acari which was already built with the correct relational approach. It's great really to see somebody helping out with Commons development – could you consider implementing the requested tool to easily with a click see the category path for why a file is somewhere in a category in the deepcategory wall-of-images view of the category (and there's a gadget to open this view with a click)? So in this case, if one did run deepcategory:Beekeeping and wondered why there's so many microscopic images of mites, one could click a button and see Beekeeping->Diseases and disorders of bees->Bee mites to see the path of categorization (to fix it if adequate). This would also help with the tool you developed here as one could check the deepcategory view before running it to fix miscategorization that adds in large cat-branches, it would help the community to reduce the miscategorizations overall, and one could use it if the tool takes too long to load by checking the deepcat view.
It would also be better if the tool somehow first scanned for how long the loading will take and then displayed the estimated overall time.
Sadly, for the cats that the tool worked with, Category:Drone videos and Category:Time-lapse videos it displayed "Analysis complete. No suggestions were generated for any file." I'll try it again elsewhere. Probably, the more difficult parts here are how to infer the suggested categories (here again, I'd suggest adding first the inferrable categories per thread linked earlier and only then checking if one could build suggestions for more cats based on other methods).
I don't see a problem with placing categories at the bottom btw since that is the established standard and files where they aren't placed at the problem would be good eventually clean up. Prototyperspective (talk) 00:03, 10 February 2026 (UTC)Reply
@Prototyperspective: Cats collectively at the bottom of the file is fine, but often (whether entirely advisable or not) there is some logic to the order of the cats within the cat list, e.g. topical cats kept separate from maintenance cats, or several cats that relate to one object in the photo (e.g. a building that doesn't merit a category of its own) grouped together. A lot of users prefer that be maintained, and in this case it should be easy. - Jmabel ! talk 01:43, 10 February 2026 (UTC)Reply
Thanks for the feedback! For some reason Category:Drone videos worked just fine for me. For instance, the video to the right has Drone videos of bodies of water and Drone videos of nature as suggested categories.
Can you re-run it and if it doesn't work again, please send me what you see in the browser console (usually you need to press F12 to get there).
I'll take a look at the wishlist - I had no idea it existed. Alaexis (talk) 20:17, 10 February 2026 (UTC)Reply
Now it works – thank you very much for moving it to the Tools! The problem was that I had to allow alaexis.workers.dev in the NoScript Firefox addon first (no warning displayed so I didn't notice).
  • I get these errors in the browser console in case it helps somehow: "Content-Security-Policy: (Report-Only policy) The page’s settings would block the loading of a resource (connect-src) at https://publicai-proxy.alaexis.workers.dev/ because it violates the following directive: “default-src 'self' data: blob: https://upload.wikimedia.org https://meta.wi[…]" (4 times the error) but I also get these for one other gadget apparently
  • Could you please also hide the Suggestions buttons when the script has not been run? This is the biggest problem currently imo.
  • When clicking accept, could you make it directly write to the page instead of opening the diff like cat-a-lot does it? You could check the CAL code to see how it's done there and there could also be a setting where one can configure whether or not it should open a new diff tab.
  • When clicking Reject, it closes the Diffusor panel and one even has to run it all again. Instead, please make it move to the next item so one can use this to go through items on a page quickly one after another.
  • One mode of operation would be to go through each item one after another as just described – another would be to glance over the suggestions and click on "View suggestions" where the suggestions seem good. For the latter mode, could you enable seeing the suggested categories directly underneath the files?
  • At least for categories that have unidentified in the title it would be good if the tool checked as if one was in the parent category. So in Category:Drone videos from unidentified countries instead of showing no suggestions, it should show suggestions as if one was in Category:Drone videos by country.
  • For the prior case, maybe the prior scan of the tool at Category:Drone videos could be cached somehow so that it can be used when in a subcategory so it loads quicker.
  • An issue tracker would keep help with tracking such change requests and ideas and to discuss them individually. Maybe you could set up a github repo or track the issues on phabricator.
Given that this already produced quite good results and there's lots of ways things can be improved over time (via specifying hard-coded rules for suggestions like the ones described in the inferred categories bot work request thread or the rule 'never suggest 'xyz (text)' categories Jmabel hinted at), I think it's very plausible this will get widely used. Thanks again for developing this!! Hopefully you can find time to develop it further and maybe other wiki devs can help out too at some point. Prototyperspective (talk) 00:03, 11 February 2026 (UTC)Reply
I've hidden the "View Suggestions" buttons when the analysis hasn't been performed and also fixed the Reject.
I'd love to connect it to a github repo. Unfortunately I don't know how to do it here. en:wiki has bots like usync that automate this but I understand that they don't work here. Alaexis (talk) 14:52, 11 February 2026 (UTC)Reply
Thanks! I'm not sure I understand the issue with a github/gitlab repo: it was meant mainly as an issue tracker and secondarily also to better enable other devs to help out via allowing pull requests. The code would be copied manually the script page so there is no semi-automatic github->commons script deployment. Thus, 'connecting it to a github repo' just consists of adding a link to the github repo to the documentation page (ideally with the note that this is where the issues are tracked). Prototyperspective (talk) 17:26, 11 February 2026 (UTC)Reply
Ah sure https://github.com/alex-o-748/diffusor
I was just saying that on en-wiki en:Wikipedia:USync keeps the script keeps the script in sync with the repo. Alaexis (talk) 20:18, 11 February 2026 (UTC)Reply
I'm going to guess that "…(text)" categories are almost always going to be bad suggestions. - Jmabel ! talk 18:42, 10 February 2026 (UTC)Reply
@Jmabel, @Prototyperspective, @Chaotic Enby, I've implemented the UI changes you guys suggested, I think that they make a lot of sense.
  1. Moved it to the Tools and moved all the status updates to the right panel
  2. The new category replaces the old one exactly where it was
  3. Detailed edit summary
If it's actively used and if I have time I'll do something with the performance for large categories. Alaexis (talk) 20:43, 10 February 2026 (UTC)Reply