User talk:Bluerasberry/Wikidata graph split

Important subject. If Wikidata is to become truly useful, then something needs to be done about those limits. For most major applications or types of items, it needs a far larger state of completion via mass-data-imports and these seem currently only feasible if sth is done about those technical limitations. Potential applications include Scholia charts about studies about a subject or by an author (maybe at some point altmetrics scores and citation counts can also be queried?), books data, documentary films, software, ingredients, products, companies, and so on. Think especially of where people use databases in their daily lives – isn't that where we'd like Wikidata to come in? Also worth noting is that one can now build SPARQL queries using natural language so it's become easier to query this vast dataset. Scaling isn't the only difficulty; it also needs people to actually do such imports (including Anna's Archive metadata about books I think) as well as the mostly archived/stale bot requests and it probably needs some way to lock items or most properties thereof because it's not feasible to watch millions of items (alternatively better patrolling tools). It does seem like genuine innovations and out of the box thinking is needed to solve this problem of the technical limits. --Prototyperspective (talk) 01:08, 14 February 2026 (UTC)Reply

@Prototyperspective: I hear you. For most of these, I think Wikidata's scaling really is the issue, because the datasets are available and we have been turning away Wikidata editors who would import them for a generation.

✅ Scholia profile for main subject, for example https://qlever.scholia.wiki/topic/Q2002254

✅ Scholia profiles for authors, https://qlever.scholia.wiki/author/Q3063122

✅ citations for alt-metrics https://diff.wikimedia.org/2017/04/06/initiative-for-open-citations/

❌ books - despite existence of library catalogs, Wikidata modeling for books is among the more difficult projects

✅/❌ documentary films or films in general - no dataset exists. There is IMDB Letterboxd and themoviedb.org /TMDB, all of which claim that their dataset is closed and proprietary. Wikidata is already the world's best open movie dataset. For documentaries, there are identified academic datasets for films which are unpublished or self-published and have no major publisher or distributor. For example, there are many student or scholarly documentaries which may be university published. We can get those, people have tried in Wikidata but again no capacity right now.

✅software - https://doi.org/10.3897/rio.8.e94771

✅ ingredients

SPARQL query

SELECT ?cocktail ?cocktailLabel (SAMPLE(?recipe) AS ?recipe) (SAMPLE(?image) AS ?image)
WHERE
{
  {
    SELECT ?cocktail ?cocktailLabel (GROUP_CONCAT(DISTINCT ?ingredient; separator=", ") AS ?ingredientsList) (GROUP_CONCAT(DISTINCT ?garnishLabel; separator=", ") AS ?garnishList) (COUNT(DISTINCT ?ingredient) + COUNT(DISTINCT ?garnishLabel) AS ?count)
    WHERE
    {
      ?cocktail wdt:P31/wdt:P279* wd:Q134768;
                p:P186 ?materialStat.
      MINUS { ?materialStat pq:P518/wdt:P279* wd:Q2453629. }
      MINUS { ?materialStat ps:P186 wd:Q488463; pq:P366 wd:Q26876981. }
      MINUS { ?materialStat pq:P366 wd:Q59541. }
      ?materialStat ps:P186/rdfs:label ?materialLabel.
      FILTER(LANG(?materialLabel) = "en").
      BIND(?materialLabel AS ?ingredientSolo).
      OPTIONAL {
        ?materialStat ps:P186/rdfs:label ?materialLabel;
                      pq:P1114 ?quantity.
        FILTER(LANG(?materialLabel) = "en").
        BIND(CONCAT(STR(?quantity), " ", ?materialLabel) AS ?ingredientWithQuantity).
        OPTIONAL {
          ?materialStat pq:P1114 ?quantity;
                        ps:P186/rdfs:label ?materialLabel;
                        pqv:P1114/wikibase:quantityUnit ?unit.
          FILTER(LANG(?materialLabel) = "en").
          FILTER(?unit != wd:Q199).
          ?unit rdfs:label ?unitLabel.
          FILTER(LANG(?unitLabel) = "en").
          BIND(CONCAT(STR(?quantity), " ", IF(?quantity = 1, ?unitLabel, CONCAT(?unitLabel, IF(STRENDS(?unitLabel, "sh"), "es", "s"))), " ", ?materialLabel) AS ?ingredientWithUnit).
        }
      }
      BIND(COALESCE(?ingredientWithUnit, ?ingredientWithQuantity, ?ingredientSolo) AS ?ingredient).
      OPTIONAL {
        ?cocktail p:P186 [ ps:P186 ?garnish; pq:P366 wd:Q59541 ].
        ?garnish rdfs:label ?garnishLabel.
        FILTER(LANG(?garnishLabel) = "en").
      }
      ?cocktail rdfs:label ?cocktailLabel.
      FILTER(LANG(?cocktailLabel) = "en").
    }
    GROUP BY ?cocktail ?cocktailLabel
  }
  BIND(
    IF(REGEX(?ingredientsList, ", .*,"),
       REPLACE(?ingredientsList, "(.*), (.*)", "$1, and $2"),
       REPLACE(?ingredientsList, "(.*), (.*)", "$1 and $2"))
    AS ?ingredients).
  BIND(
    IF(REGEX(?garnishList, ", .*,"),
       REPLACE(?garnishList, "(.*), (.*)", "$1, and $2"),
       REPLACE(?garnishList, "(.*), (.*)", "$1 and $2"))
    AS ?garnishes).
  OPTIONAL {
    ?cocktail p:P186 [ ps:P186 ?glass; pq:P518/wdt:P279* wd:Q2453629 ].
    ?glass rdfs:label ?glassLabel.
    FILTER(LANG(?glassLabel) = "en").
    BIND(IF(REGEX(?glassLabel, "^[AEIOUaeiou]"), "an", "a") AS ?article).
    BIND(CONCAT(" in ", ?article, " ", ?glassLabel) AS ?container).
    BIND(", served" AS ?served).
  }
  OPTIONAL {
    ?cocktail p:P186 [ ps:P186 wd:Q488463; pq:P366 wd:Q26876981 ].
    BIND(" on the rocks" AS ?onTheRocks).
    BIND(", served" AS ?served).
  }
  OPTIONAL {
    FILTER(STRLEN(STR(?garnishes)) > 1).
    BIND(CONCAT(" with ", ?garnishes) AS ?garnish).
    BIND(", served" AS ?served).
  }
  BIND(CONCAT(
    ?ingredients,
    COALESCE(?served, ""),
    COALESCE(?onTheRocks, ""),
    COALESCE(?garnish, ""),
    COALESCE(?container, ""))
    AS ?recipe).
  OPTIONAL { ?cocktail wdt:P18 ?image. }
}
GROUP BY ?cocktail ?cocktailLabel
ORDER BY DESC(MAX(?count))

Click here to launch the Wikidata query

❌ products, we could, but other things are easier. I think we need a strategy to avoid commercial intrusion until we are established. I was Wikipedian at Consumer Reports and we made products like The Digital Standard, which evaluate ethical compliance with products. I think Wikidata would be a good place to collect product certifications, warnings, and ethical assessments.

❌ companies we could, but other things are easier. I do like the idea of profiling cities, and assuming that we profile cities, then also indexing company / nonprofit / civic org / government agency datasets which mesh either with a Wikipedia-connected civic data profile, OpenStreetMap, or local government contracts if we get into local government profiling.

I think we are all ready to develop all of these, regardless of whether now is time to do the data import, as soon as Wikidata capacity increases. Bluerasberry (talk) 20:15, 16 February 2026 (UTC)Reply

Well it would be great if that is the case. The issue with existing datasets is with how large they are in total compared to the overall data of papers and if some imports were declined because of scaling issues it's not so clear people will be able to and actually will import datasets toward a high level of completion. That's the reason for why I find the scholia pages currently not useful in practice: one can't know how complete the given dataset is and usually it's quite incomplete.

Regarding films, there is imdb and also scrapable media libraries like arte mediathek when it comes to documentaries. If data from imdb and tmdb can't be imported then this is a persistent big problem because that's one of the areas where people access data in their daily lives. One application would be for Wikiflix to also show films available for free on YouTube (or public broadcaster media libraries etc) as proposed here which I think could then very plausibly make Wikiflix the first heavily-used application of Wikidata outside of Wikipedia and make people learn about Wikidata for the first time (and use it outside of Wikipedia). Without the films data this can't be done. Obviously it also needs more than 0 replies thread on the talk page there.

Software isn't useful in practice. Most of even the most notable software don't have a Wikidata item. Currently, it can't be used to discover forks or alternatives or software using a certain tech etc. I'm not speaking of in theory with SPARQL but in practice where it's taken into account how large the fraction of software with WD items is and which data they have (eg there's data about programming language used and license readily available on github but nobody imports these; at least not after my bot work request, similar to most other bot import ideas/requests).

Ingredients is not done at all. Wikidata could step in and enable generic foods to be displayed in the open source app Waistline. Nothing has been done and this related low-participation proposal is stale (not sure to what extent if any this could help with the aforementioned or what else would need to be done for that). But ingredients aren't just about common generic foods; it's OpenFoodFacts things when one scans for example the barcode of a random shampoo in a grocery store.

Anyway, I agree that the technical limitations are a major issue of Wikidata and maybe currently its largest.

Thanks for the edit linked below; I think it's much better now. Prototyperspective (talk) 00:18, 17 February 2026 (UTC)Reply

@Bluerasberry: some feedback above. It doesn't show on the page; maybe sth is wrong with the page idk.

Also I'd add that the article could be shorter; far more readers would read it if it was half the length and some content could be moved to another place or be wrapped into collapsible templates or become part of a second part. Additionally, I find texts like and to me Wikipedia's last-generation janky technology is 💙 cute and endearing cringey – you can write it like you want of course but you asked for feedback and I'm just saying things like that certainly aren't helping the cause. Prototyperspective (talk) 17:58, 16 February 2026 (UTC)Reply

@Prototyperspective: This time no emoji special:diff/1338593791/1338708372

Add topic