skip to content
 

Tuesday 20 June 2017: Can You Recall Precisely?

It does seem a while now since I started the ContentMine job. It’s two months in fact, and well out of the rut of my old life. I was remembering a conversation, from the first days, in which I said that WikiFactMine was more like panning for gold than filling trucks with anthracite. My line manager gave a kind of secret smile. How was I supposed to guess that he had an actual mining background?

[IMAGE = Panning for gold]

The extractive metaphor really is of use for engaging with data, though. Think about coal that is dug up from deep mines with stony spoil. Think about just any kind of ore. Processes going under the general name of screening, separate out what is valuable from the waste.

That all suggests a modification of the catchphrase I introduced a couple of weeks ago, to “extract, screen, load”. Then, to go back to coal, consider the instructions to miners “dig out all the coal in the mine, and nothing else!” As they say, it’s not going to happen.

The physical example is quite intuitive. The data counterpart can be confusing, or counterintuitive. The instruction to extract only what is wanted goes under the name of “precision”. The precision of data extraction is the percentage of usable data. The counterpart is called “recall”, and applies to the percentage of the possible useful content that is actually found, by a given process. This term is more obviously jargon. See Wikipedia on precision and recall.

The common experience of using a search engine can bring this sort of language into focus. One doesn’t expect every hit for a search term to be relevant, in most cases. So precision is not the leading criterion. One does hope for decent recall, in other words that search will often be successful in finding the information that is out there on the Web. The reasons that search engine recall might be poor are a whole discussion in itself, with “faults” in several different places.

Now, irrelevant hits are often called “false positives”. The language of false positives and false negatives originates in medical statistics. It really is equivalent, though a different way of looking at the ore and spoil you are uncovering.

It is an interesting observation from the machine learning world, that “domain experts”, i.e. people who know what they are talking about, are better at precision than recall. Actually, it may well depend how I think an expert could be useful, what sort of answer I get. A question phrased as “are there any circumstances in which …?” might be good on the recall side: getting an expert to dig deep (note the metaphor). Just requiring simple explanations risks the pat, correct and unhelpful reply, the typical consultant platitude.

Big databases are popular. Wikipedia is clearly an example: I’m talking about its reliability issue in a couple of days, here at the Moore Library, and, no coincidence, online search will come in. Lower ambitions for broad coverage leads back to smaller, more specialised databases: precision emphasised. My own experience with online  research is that with determined effort it is remarkable what one can find. Such an effort takes intense screening of results, and some systematic use of variant searches. The whole aim to increase the recall. Our time has seen the extractive approach to information blossom, primarily because the possibilities of recall have: precision not so much.

Engage with us

News link Read our latest news

Twitter logo Follow us on Twitter

Facebook logo Like us on Facebook

You Tube logo View our YouTube channel

You Tube logo Learn with 23 Research Things