Tuesday 6 June 2017: Extract, Transform, Load
Down to business, then. "Extract, transform, load" is a decades-old catchphrase in the data world. It could be something as simple as taking a list of dates in the style “June 6, 2017”, usual in the USA, and changing them to “6 June 2017”, as is the common way in the UK. Read a date. Take out the comma, swap the two parts (so, transform it). Load it somewhere else. Looked at broadly, much work on Wikipedia is also according to this pattern: extract factual information from a source, write it up in house style with the reference, save it in a Wikipedia article.
More grandly, WikiFactMine, the project of ContentMine (logo above) I’m working on, wants to apply this paradigm to the whole scientific literature. How, exactly? Good question, but let’s work at it from the far end.
If you have been paying attention to these blogs, you won’t find it hard to guess that “load” should mean “import into Wikidata”. Just so. But what about “transform”? The reply “transform into Wikidata format” is actually not the answer the examiner is looking for. Faced with a scientific paper, it is not so helpful, more a matter of begging the question. What exactly are we to extract? And how do we recognise such content?
So this is where we get into the deeper water, of course. The right answer could be put in a few ways, but one concise term is “triples”. Wikidata is an example of a triplestore, and the extraction issue comes down to finding the right sort of factual “triples”. That’s the nubbin of the project.
An example of a triple, such as may be found in Wikidata, is that the disease erythropoietic protoporphyria has a genetic association with FECH, a human gene. It is a triple because a disease (1) is related (2) to a gene, and the relationship (3) has its definite, technical name. The typical Wikidata item involved has much more, including links to Wikipedia articles on the disease in six languages. For our purposes, the important thing is that the relationship has code P2293. There are are currently about 4,000 such relationships, called “properties”, available to use on Wikidata, with additions most days.
To recap, for the WikiFactMine project, we aim to extract from the scientific literature facts, that (a) can be expressed as triples, such that (b) the relationship comes from the master list on Wikidata. The two things related also ought to be in Wikidata, but items for them can be created with much less ado than for a new property, if need be. Then the codes for the three parts of the triple give us exact knowledge of where to load the fact within Wikidata.
For a human operator, the main requirement to do such work is a facility with Wikidata’s property list. Fully automating such a process still looks like a serious problem, though: hence my interest in machine learning. Current thinking is to give the human much automated support with the task, allowing the operator the last word: a semi-automated approach, therefore.
Simply put, machines are not yet up to reading scientific literature with comprehension. Understanding remains a black art, and we can’t yet replace it by a black box. Wikipedia tells me that “nominal horsepower gives a convenient way of rating traction engines”. Makes me wonder how we’ll rate extraction engines, when we have them.