Wikidata talk:WikiProject Names

From Wikidata
Jump to navigation Jump to search

Reducing redundancy

[edit]

Items for names take up a lot of space in Wikidata. For example, there are just under 600,000 items for surnames. That is only around 0.5% of all items, but the labels on these items which are the same as the English label account for 10% of all labels in Wikidata. The aliases which are the same as the English label account for a third of all aliases.

The size of Wikidata is causing problems, most notably for the query service, which is likely to stop working at some point in the next few years (see Wikidata:SPARQL query service/WDQS backend update) if we continue the way we are.

The developers are working on adding support for using the language code "mul" on labels (phab:T285156), designed to be used for things like this instead of copying the same label to hundreds of languages (and I hope they will also work on adding some simple dynamically generated descriptions after that - phab:T303677).

I think we can reduce the amount of redundancy on items for names before then though:

  • We could remove labels for country variants of a language, if they're the same as the first fallback language, because the fallback language is still the same language/script. This would remove at least 5 million redundant labels.
  • We could do the same for descriptions, for the same reason. This would remove at least 5 million redundant descriptions.
  • We could remove aliases which match another label and are in the wrong script, because they are not needed for searching and are entered under the wrong language anyway. It's hard to calculate using a query, but I think this would remove at least 50 million redundant aliases.

If people agree, I should be able to make a bot to do this.

- Nikki (talk) 21:24, 29 December 2023 (UTC)[reply]

I usually never think about the size of Wikidata, but you're right that at this big of a scale that it has to be considered. I am even okay with giving a property to function as a description, such as P31 being name (which automatically gives the description as "name" unless it's overwritten by something else), which would also work in removing many languages at once. Also if something has one name that applies for many languages, there could be ways of combining them?
Anyway, I support this effort and see it as vital in the sustainability of open data. Egezort (talk) 22:32, 29 December 2023 (UTC)[reply]
Thanks for the statistics. We could remove aliases which match another label and are in the wrong script, because they are not needed for searching and are entered under the wrong language anyway. It's hard to calculate using a query, but I think this would remove at least 50 million redundant aliases. seems like a good idea to me.
The second paragraph seems to be based on fear, Wikidata could be improved to scale horizontally to avoid bottlenecks imposed by a traditional mariadb RMDB, and with additional splitting of the WDQS, Blazegraph is good enough for a long time to come (at least 10 years).
Eventually QLever or similar optimized backends could replace Blazegraph. Additionally a NoSQL database could be provided by the WMDE team at any time as a complement to the REST API which could enable users to run efficient mapreduce queries on the whole graph similar to what we do in Quarry on SQL RMDB.--So9q (disk) 07:46, 24 September 2024 (UTC)) So9q (talk) 08:00, 24 September 2024 (UTC)[reply]

Double given name

[edit]

I met double given name (Q1243157) for the first time today.

The context was the artist (William) Francis Marshall (Q21459938), with given name (P735) = William Francis (Q104831048) coded by @Arroser: in 2021.

This seems to me quite wrong. IMO, in English at least, William Francis (Q104831048) is just a combination of two first names, not any kind of joint name; and (at least in English) even if somebody is habitually addressed by two first names, IMO (apart from a very few exceptions) those names would not be regarded as a joint or compound first name unless they were hyphenated.

Looking at query https://w.wiki/8hgg it seems there are quite a lot of these.

I see William Francis (Q104831048) was created by User:Moebeus in 2021 and has a Commons category (since 2016). Even so, I believe it should be deleted as not a real thing; along with almost all other English examples of this.

Do others agree? Jheald (talk) 12:47, 2 January 2024 (UTC)[reply]

I no longer edit names (except for adding missing ones when I need them). I appreciate the ping, if you want to delete any of the ones I've created that's okay with me. Moebeus (talk) 15:04, 2 January 2024 (UTC)[reply]
@Jheald I am not sure the double given name item is necessary but joint names which are conventionally spelled with a space in English and not a hyphen are relatively common. Punjabi first names often consist of a first part followed by an honorific, gendered, or tribal suffix; for example in the name Satwant Kaur (Q113570497) the Kaur part is what makss it a female name. I don't know what is included as "English names" here as most names used by English speakers are derived from other languages (Francis from Latin for example), but there are a number of very common Hebrew-origin names spelled with a space in English as well such as Mary Anne and Anne Marie. I have no idea about William Francis specifically, but spaces on their own should not be treated as a reason for considering a single name to be two names. عُثمان (talk) 20:01, 2 January 2024 (UTC)[reply]
Moebeus has created possibly hundreds of non-existing double given names on Wikidata (and another user on Commons, named JuTa, has done the same over there), as we are talking about saving space lower down on the page I think it would be a good idea to revive this subject. Many of these "double given names" need deleting.StarTrekker (talk) 21:21, 24 September 2024 (UTC)[reply]

May be related: I've encountered today with Alexander Aleksandrovich (Q124888393) which is (formally double name) completely nonsense - it consists of (real) given name with patronymics (which should be modelled, if at all, separately, with patronym or matronym (P5056)). I believe such "double names" should be nuked. --Infovarius (talk) 19:05, 25 September 2024 (UTC)[reply]

Missing first names

[edit]

Hi, I checked the gender of mayors against the gender of their first names. I would be helpful if you could add missing names from de:Benutzer:Herzi Pinki/Von Frauenern und Männerinnen. E.g. all the names marked as undefined in wikidata. I suspect names like Róbertné to be the Hungarian female version of Róbert etc., but I could not get a source for that assumption. best --Herzi Pinki (talk) 15:38, 16 June 2024 (UTC)[reply]

Kovács Róbertné means the wife of Kovács Róbert: in Hungary, it’s common that the wife takes the full name of her husband when they get married, Kis Júlia becoming Kovács Róbertné. However, her given name remains Júlia (Q19851095), so I don’t think the statement given name (P735) = Róbertné would be accurate. Unfortunately, it’s impossible to determine her given name only from the official name (this is also a problem in real-life situations, one doesn’t know how to address these -nés). So while you can assume sex or gender (P21)female (Q6581072) (which, of course, is true only if the person is heterosexual), I don’t think you should add any given name (P735) statements.
By the way, Hungarian laws don’t allow gender-neutral given names, so when your subpage says that the mayor of Abádszalók (Q336820), Gyula Balogh, is neutral, it’s in fact clearly a “he”. —Tacsipacsi (talk) 20:30, 16 June 2024 (UTC)[reply]
thanks, I handle Gyula as neutral, as WD is undecided Gyula (Q9317185) vs. Gyula (Q124001429). Maybe this is a modelling flow? Or it can be used for both genders outside of Hungary for persons that never want to enter Hungary? I just said that the given name of Gyula Balogh is neutral, not him as a person. --Herzi Pinki (talk) 07:23, 17 June 2024 (UTC)[reply]
I think the female Gyula (Q124001429) is a result of a data error. It’s used only on Gyula Kajari (Q123423697), a person born in Ősi (Q383063), Hungary (so it’s not about inside/outside Hungary), and while the item states that Kajári was a female, and has a reference for that, the reference doesn’t seem to highlight in any way why it makes this surprising statement. While (according to hu:Magyarországon anyakönyvezhető utónevek listája) the rule dictating the strict distinction between male and female names exists only since 1965 (Kajári was born in 1926), Gyula is a pretty well-known male name (originating from the Old Hungarian title of Gyula (Q933316)), so I find it unlikely that a girl is given this name (again, a nonbinary person may be a possible explanation, but I’d expect that to be explicitly stated). Maybe just some librarian pushed the wrong button.
For for persons that never want to enter Hungary: I’m pretty sure nothing prohibits a male Boris (Q666112) (in Hungary, only Boris (Q61356723) is allowed) or a female Robin (Q1158139) (in Hungary, it’s classified as a male name) to enter the country: such rules should only apply to citizens. —Tacsipacsi (talk) 22:02, 17 June 2024 (UTC)[reply]
https://www.kozterkep.hu/9215/kajari-gyula-emlektabla gives an evidence that this Gyula Kajari (Q123423697) is male. Do you have any idea on how to fix this around the authority data around and get rid of Gyula (Q124001429) here in WD? best --Herzi Pinki (talk) 11:11, 19 June 2024 (UTC)[reply]

Should fictional names have their own items?

[edit]

For example, Cogita (Q116790893). No real person has this name, and presumably none ever will, so there will presumably only ever be one fictional character with it. That doesn't justify a separate item in my opinion. Has this ever come up before? I know this is far from the only one. —Xezbeth (talk) 14:14, 29 June 2024 (UTC)[reply]

@Xezbeth: I wonder about that too. The name "Sheev" has only ever been used by the character Palpatine in Star Wars, same problem there.StarTrekker (talk) 15:12, 26 August 2024 (UTC)[reply]
I believe that we should no create items for any names which have <2 bearers, no matter fictional or not. At the same time I would allow specific class "fictional name". --Infovarius (talk) 14:14, 27 August 2024 (UTC)[reply]

Laman and laman

[edit]

Laman (Q96656279) and laman (Q96651329) with a lower case L both exist. I'm a little unsure on this, it does't seem the later is actually used by anyone on Wikidata, the url links used for the item both seem to be out of function. I'm not familiar with these ethnic groups and their names or even what script they use. StarTrekker (talk) 22:52, 4 September 2024 (UTC)[reply]

@ MonicaMu: who created both of them (I'm assuming by mistake or unknowingly, but better to ask and check before merging). Cheers, VIGNERON (talk) 19:07, 13 September 2024 (UTC)[reply]

Mul labels - proposal of massive addition

[edit]
Ash Crow
Dereckson
Harmonia Amanda
Hsarrazin
Jura
Чаховіч Уладзіслаў
Joxemai
Place Clichy
Branthecan
Azertus
Jon Harald Søby
PKM
Pmt
Sight Contamination
MaksOttoVonStirlitz
BeatrixBelibaste
Moebeus
Dcflyer
Looniverse
Aya Reyad
Infovarius
Tris T7
Klaas 'Z4us' van B. V
Deborahjay
Bruno Biondi
ZI Jony
Laddo
Da Dapper Don
Data Gamer
Luca favorido
The Sir of Data Analytics
Skim
E4024
JhowieNitnek
Envlh
Susanna Giaccai
Epìdosis
Aluxosm
Dnshitobu
Ruky Wunpini
Balû
★Trekker

Notified participants of WikiProject Names Following up from #Reducing redundancy (@Nikki, Egezort:), now "mul" labels are live (cf. Help:Default values for labels and aliases). Presently we have 578 given names and 166 family names with "mul" labels. In order to start adding them massively, I have prepared two queries; I considered only the items with one value of native label (P1705), for prudence:

My proposal would be adding to them "mul" and removing all redundant labels; if possible, it should be done by a bot that edits all these items only once, otherwise the number of edits would be uselessly high. Opinions? Thanks, --Epìdosis 15:19, 13 September 2024 (UTC) P.S. also[reply]

We have a new concern that has been brought to our attention by @Pallor, where deleting the existing labels in a user's given language drops the item in the search rankings (making them less discoverable).
Should we bright this to the development teams attention? And is there a solution we can propose? Iamcarbon (talk) 23:24, 4 October 2024 (UTC)[reply]
Epìdosis VIGNERON Peter F. Patel-Schneider Madamebiblio Zblace So9q Iamcarbon Difool

Notified participants of WikiProject Redundancy

This is an ideal use case for multi-language (mul) labels. However, through my testing, I've encountered several challenges that could arise during a broader rollout:
I have been adding mul labels to existing names (both family and given) and tracking any new labels that appear. While many bots have adapted, there's still work to be done to make them fully mul-aware. Additionally, although the community is gradually becoming familiar with mul labels, new users frequently re-add labels that were previously removed.
In testing the removal of certain label sets, many of my edits were reverted for two main reasons:
1) The removals were seen as unnecessary.
2) It broke existing queries that hadn't been updated to accommodate mul.
With these points in mind, I'm unsure whether we should proceed with a large-scale rollout affecting hundreds of thousands of items or take a slower, staged approach. The latter would give the community more time to adapt and understand the reasoning and benefits behind the change.
Inevitably, some user queries will break after the update, bots may continue re-adding labels, and we may need to present a stronger case for the long-term benefits of this shift. Iamcarbon (talk) 16:50, 13 September 2024 (UTC)[reply]
@Epìdosis: I fully agree to transition to "mul".
That said, Iamcarbon has two good points. For the first one, it's just plain wrong, the removal is highly necessary. For the second, queries (and other tools) need to be fixed/adapted (do we have any example of such queries that I could look at?). For both, I guess it's mainly a question of time and a staged approach seems to be the best way to go. Given that Wikidata SQL tables are also being overloaded, proceeding slowly is also a good thing.
My proposal is too wait a bit (a week ? so people can react to this first message) and to start with a first batch of 1000 first names and 1000 last names (including the items on Wikidata:WikiProject_Names#Sample_items).
Cheers, VIGNERON (talk) 19:03, 13 September 2024 (UTC)[reply]
Is "mul" actually live yet? ArthurPSmith (talk) 19:42, 13 September 2024 (UTC)[reply]
Yes, "mul" is live (see e.g. Mike (Q361309)). That said, the points of @Iamcarbon: are perfectly reasonable and thus I support the plan proposed by @VIGNERON:. Epìdosis 20:38, 13 September 2024 (UTC)[reply]
Does anyone have any reservations against doing a trial run removing labels on 100 family names today?
Here's an example edit:
https://www.wikidata.org/w/index.php?title=Q16869697&diff=prev&oldid=2247763563 Iamcarbon (talk) 21:39, 13 September 2024 (UTC)[reply]
Quick update. I went ahead and removed labels from 50 names to gather some more feedback.
So far, two of these changes have been reverted. Including: "Bad move. Screws up other tools".
I'm going to continue to do some limited testing removing label sets on another 50 items, hoping that we can get some more feedback, and more details on what we may be breaking for users. Iamcarbon (talk) 00:15, 14 September 2024 (UTC)[reply]
One thing that might need looking at is constraint violations - I don't know how many properties would be affected, but there are some that raise a violation if an item has a certain property (e.g. an identifier) and doesn't have a label in a specific language. DrThneed (talk) 00:45, 17 September 2024 (UTC)[reply]
as an example of this ResearchGate profile ID (P2038) on Deborah O'Connor (Q130311610). DrThneed (talk) 00:58, 17 September 2024 (UTC)[reply]
I'm unsure if we have any proposals to delete labels on humans yet. I imagine many things will need to be updated when we do. Iamcarbon (talk) 02:14, 3 October 2024 (UTC)[reply]
Do you have a example items for given name and surname that are used in some other items and have been treated the way you proposed so that I could see how it interacts with other items? William Graham (talk) 20:59, 13 September 2024 (UTC)[reply]
@William Graham: you can consider this (here I had to use two edits, one for removals and one for "mul" addition, but my intention is that the bot operator will edit each item just once). Epìdosis 21:06, 13 September 2024 (UTC)[reply]
@Epìdosis How's this edit look:
https://www.wikidata.org/w/index.php?title=Q16869697&diff=prev&oldid=2247763563 Iamcarbon (talk) 21:37, 13 September 2024 (UTC)[reply]
Reflecting this evening I have discovered an issue that could potentially become disruptive if we massively remove non-"mul" labels before it is solved: see phab:T374745 I have now opened. Basically, if I remove a label because it is becomes a placeholder label generated by "mul", the system doesn't hinder users from recreating the concept in another item: i.e. "Jack" "given name" could be present at the same time in two items, one time as "Jack" (mul label) and "given name" (English description) and one time as "Jack" (English label) and "given name" (English description). I think this issue should be solved before we start massive cleaning of non-"mul" labels. Epìdosis 21:55, 13 September 2024 (UTC)[reply]
I have English, Japanese and mul enabled via Babel. With items like Isabella (Q16290308), I now see the Japanese label instead of the mul label in Special:Contributions, recent changes and related changes. Is there any way around this? —Xezbeth (talk) 05:10, 16 September 2024 (UTC)[reply]
Does moving mul above Japanese in your babble box change the label it chooses by default? Iamcarbon (talk) 17:52, 16 September 2024 (UTC)[reply]
Also QuickNames doesn't work with mul labels, and it now allows you to create duplicates as it's checking for the en label that is no longer there, so that's a giant mess imminent. I really don't think en labels should be removed yet. —Xezbeth (talk) 05:39, 16 September 2024 (UTC)[reply]
Does anyone know who maintains Quicknames so we can reach out to let them know about the upcoming changes? Iamcarbon (talk) 17:33, 16 September 2024 (UTC)[reply]
It seems @Bargioni already updated it to support mul. Fantastic! So9q (talk) 16:06, 20 September 2024 (UTC)[reply]
  • Comment While I'm sure the massive deletion of labels in favor of "mul" has some benefits, I've encountered some significant problems (which is to be expected, because God knows what goes on in the extremes of Wikidata at any point in time). Now perhaps I'm the only person (not a bot) who uses the extremely useful New Q5 tool for creating and editing items about humans. However, if a name (such as James (Q677191) or Steven (Q17501985) or Smith (Q1158446)) has NO English label (or any "duplicate" language label), then it is now ignored by the tool and not incoraptated into Quick Statements, resulting in incomplete entries (a person named "James Patrick Steven Smith" would be rendered solely as "Given name ="Patrick", series ordinal 2) (and I'm sure Patrick (Q18002623) is soon to be similarly affected). Thus, for every item with labels reduced to "mul", it requires much more manual effort to add each given name property. -Animalparty (talk) 23:59, 13 September 2024 (UTC)[reply]
    This is helpful! I went ahead and opened an issue for the new-q5 tool, hoping they can make the tool mul aware. https://github.com/VDK/new-q5/issues/2 Iamcarbon (talk) 00:37, 14 September 2024 (UTC)[reply]
Is it possible to place mul in first place, instead having other languages in first ?
I don't know how to add a mul entry. Is nameGuzzler adapted to this ? Jérémy-Günther-Heinz Jähnick (talk) 09:20, 16 September 2024 (UTC)[reply]
Strongly support this idea. This is much better than copying the name to every other language using nameGuzzler even when nobody who speaks that language uses this name. The millions of labels for names has flooded Wikidata and made it impossible to have an accurate label statistics for small languages. Almost all the labels are added with nameGuzzler or bots. They are absolutely irrelavant for the language, for example, why an American name should have a Belarusian label? Midleading (talk) 17:29, 16 September 2024 (UTC)[reply]
Another source of multiple labels currently is Mix'n'match - it creates labels in I think around nine languages when you create a new item through a catalogue. Many catalogues are for humans who would presumably be better to have a mul label as default. DrThneed (talk) 21:26, 16 September 2024 (UTC)[reply]
 Support "mul" is a good step forward even if some tools need to be updated. ChristianKl21:40, 23 September 2024 (UTC)[reply]
Support adding mul values, but please don't remove values yet until the full consequences of that have been understood using test items. Thanks. Mike Peel (talk) 10:52, 8 October 2024 (UTC)[reply]
I suggest removing labels in a small language that doesn't have a active Wikipedia yet. This helps fixing phab:T333765, and also saves some space to be used by mul. The labels in such languages should only be added the native speakers rather than by bots. Midleading (talk) 09:21, 14 October 2024 (UTC)[reply]
A new issue has been brought to our attention - where some editors expect given and family names to appear first in search results. Removing labels and aliases may push these items down in the search results.
There was some discussion from WMDE about giving ('mul') labels a score boost in a closed issue, but this would also be unlikely to solve the issue as human names will also have default labels soon too.
We could propose to WMDE that all given and family names receive a boost, but it is also unclear if this is a good thing, or if they will consider this change.
It's unclear if there are any other solutions here. Iamcarbon (talk) 19:47, 25 October 2024 (UTC)[reply]

Scripts and mul

[edit]

I suppose that User:Jitrixis/nameGuzzler.js should be blocked then? And what about Wikidata:Namescript? Infovarius (talk) 18:47, 25 September 2024 (UTC)[reply]

nameGuzzler is unmaintained, we should stop using it. And we should replace all labels added by nameGuzzler with mul labels. Midleading (talk) 15:09, 6 November 2024 (UTC)[reply]

Bot to add mul labels & remove aliases

[edit]

It looks like the plan to remove existing labels is going to be a long term effort, while various Phab issues are worked through.

While these issues are being worked on, I think we can continue to make progress with two tasks:

1) Adding mul to existing items that already have a native label in the mul language, and continuing to engage with bots and gadget owners to make them mul aware. When items have a mul label, new duplicated labels are less likely to be added. 2) Removing existing duplicated aliases that match the mul value. Items with a mul label are discoverable in all languages via search, and these labels are no longer needed.

This gets use half way to our goal, while we continue to work on existing issues.

A new bot has been proposed for these new tasks, where feedback is welcome.

https://www.wikidata.org/wiki/Wikidata:Requests_for_permissions/Bot/CarbonBot Iamcarbon (talk) 19:12, 16 October 2024 (UTC)[reply]