Tuesday 23 May 2017: Metadata Merge
A couple of days ago, Wikidata item Q30000000 was created. That was notionally the 30 millionth entry in Wikidata, which is Wikipedia’s database. And, fittingly, it was for a scientific paper, published in 2002 in Molecular Endocrinology.
I say that was appropriate, because today in Vienna the WikiCite conference opens. It is a celebration of, and also intended to be an addition to, the creation and progress on Wikidata of a substantial open bibliography. With around half a million papers catalogued, so far, the job is nothing like finished, even in the restricted area of science research publications. But on the other hand it is well started, and Wikidata is capacious enough to allow it room to grow.
So this is all about metadata. in other words, data that speaks of other data. When “bulk metadata collection” was in the news recently, it meant data about phone calls, for instance who called whom, not what was said. Metadata and collections go together, and there is a large overlap between metadata as an area and more traditional cataloguing.
What is happening on Wikidata is the accumulation of data about scientific papers, but not only those. OCLC, who run the WorldCat catalogue of books in tens of thousands of libraries, has recently announced a metadata donation to Wikimedia. That follows its 2015 decision that VIAF, its authority control site, would crosslink closely with Wikidata. These are global initiatives in the library world, and operate on a large scale. One doesn’t need the exact numbers, nor to be able to follow exactly how the data is flowing, to understand that searching down that reference, online, and putting it to use, is becoming easier.
My own stake in this area comes via ContentMine, and its project to place scientific facts into Wikidata. In science, the difference between a fact referenced to the literature, and the other kind, is major. It is probably the case that much bibliographic and citation metadata is going to end up on Wikidata. WikiCite 2017 is dedicated both to that direction, and referencing on Wikipedia. Now more than ever, what is written online needs to be backed up solidly.
Indeed though, much remains to do, and a slick pressing of buttons will not yet merge the world’s bibliographic metadata, which largely remains in information silos. According to Wikipedia, “The collections of the Library of Congress include more than 32 million cataloged books and other print materials in 470 languages”. It sounds like some work would be involved. Further, as I mentioned last time, authors can be troublesome. The paper announcing the discovery of the Higgs boson set a record with over 5,000 author credits. We ain’t seen nothing yet.