skip to content
 

Tuesday 2 May 2017: Blobs and Trees

The ContentMine credo, as far as I have got hold of it, runs something like this: the scientific literature is there to be studied in the large, and distinct scientific concepts are the way to do it. Those concepts can be identified in Wikidata, which is where I come in. And the basic technique is text mining for facts.

Talking about “the scientific literature” does no more than identify a big blob of publications. We presume it is one unified blob, because all it takes when seemingly distant scientific concepts are brought together is one landmark paper. Then others will pile in.

There are numerous ways to emphasise the density of the literature, science being a collective effort and citations connecting it back and forth. What I’m currently working on is an attempt to spread things out. The possible benefit would be to spot where innovation is bringing concepts together. With ContentMine producing mined facts daily, that would be one way to put them to use.

Starting from the attractive idea of a “mileage chart”, or in more technical language a distance matrix, showing the separations of a range of concepts, consider updating it with a bunch of new facts. The addition of facts could only cause the distances to drop: any sensible notion of separation must be monotone in that fashion. Monitor what if anything drops, and check that it flags up innovation.

For some actual intuition, and a sense of scale, we need some taxonomic trees from biology. These already have a notion of distance: dog is eight steps away from cat via the order Carnivora That is not the kind of distance we expect to see drop: the point is to relate it some other kind of information, for example pharmaceuticals and their chemistry or function. Construct distances from species and drugs, and see how those may drop in the light of new research. The advent of penicillin brought fungal species in close relation to anti-bacterial treatments.

[Starfish Image]

The starfish picture is here not as a kitten substitute, or even as a member of a species. As a graphic, it can stand for a genus at the central disk, with five species represented by the tips of the five arms. In a simplified model of trees a fact can connect a tip to those of another such graphic: and this idea of “adjacency” is the fundamental point here. If you can imagine starfish-to-starfish links, then allow mutants such as those with six arms (snowflakes) and so on … that is how the distance matrix can be built up.

So much for a theory: how would it work out in real life? There are two basic issues, the first being the need to work from a robust hold on the state of the art before expecting results on innovation. The second is that the added facts that create new “shortcuts” will require scrutiny, and there is an “adjacency matrix” as data to clean up. The situation is no different from depending too much on satellite navigation while in your car: you really may need a sceptical attitude. But both these subproblems are of interest in themselves, in terms of mining the literature.

I was more than just amused to find the following in a book by Umberto Eco: “The Porphyrian Tree tried to tame the labyrinth.” He explains that labyrinths come in three flavours, of which the third, meant here, is very much the blob. In linguistic terms, he calls the attempt an “impossible dream”. So watch out, semantic web people, this problem may fight back! (Quote from Semiotics and the Philosophy of Language, 1984, p. 84.)

Engage with us

 

News link Read our latest news

Twitter logo Follow us on Twitter

Facebook logo Like us on Facebook

Instagram logo Follow us on Instagram

You Tube logo Subscribe to our YouTube channel

You Tube logo Learn with 23 Research Things