By Yakub Sebastian
While scientific discoveries may bring to mind images of scientists busily working in their labs, this is not always the case. Some of the grandest scientific discoveries did not originate from laboratory experiments. Albert Einstein’s General Theory of Relativity was a result of a mere thought that did not involve any physical experimentation. Einstein theorized that a light beam can be bent by the force of gravity by simply imagining what it would be like when a laser beam passes through the inside of a chamber that is accelerated upward. It was English physicist Arthur Eddington who later proved Einstein right by observing the deflection of light by the sun’s gravity during a total solar eclipse, in 1919.
However, many discoveries did in fact take place in laboratories. With only a personal computer and access to the World Wide Web, one could now contribute to scientific discoveries. A huge amount of data exists in online literature. Quite interestingly, this enormous repository has become a new domain of scientific investigation on its own. It is now possible to formulate new hypotheses by merely connecting several known concepts in the literature, and the hypotheses can subsequently be validated by actual experiments. This is analogous to inventing an entirely new word by combining the 26 letters in the English alphabet without having to invent a new letter. We call this new method literature-based discovery (LBD).
To illustrate the power of this method, let’s look back to a remarkable medical discovery in 1986. Don R. Swanson, an American information scientist and expert librarian at the University of Chicago, stumbled upon an interesting finding while investigating a collection of medical journals. A paper he had read stated that patients with Raynaud’s syndrome were usually characterized by high blood viscosity. In another paper, he read that fish oil contains a substance that could lower blood viscosity. Armed with these simple knowledge (i.e. fish oil lowers blood viscosity; high blood viscosity is found in patients with Raynaud’s syndrome) Swanson hypothesized that fish oil should be the cure for the disease. However, to his surprise, these papers did not cite one another nor were the two papers cited in any other journal. No one had publicly suggested that taking fish oil could alleviate Raynaud’s syndrome. This was new knowledge hidden in plain sight. Three years later, his hypothesis was clinically tested and, voila, Swanson was proven right.
Swanson’s discovery has three important implications. Firstly, the opportunity to be involved in formulating plausible scientific hypotheses is now open to those without in-depth domain expertise. Swanson was no medical doctor or biologist. Secondly, many new discoveries can be made by synthesizing known facts and data found in the existing literature provided that we are able to overcome two primary limitations: (1) the substantial computational power required for automating the task, and (2) the effective and efficient information search methods. While the first can be addressed by the continual and exponential growth of computer processing power following Moore’s Law, the solution for the second limitation is less obvious. The challenge comes from the fact that much novel knowledge is often not explicitly expressed in documents; otherwise it would no longer be novel. This leads to the third implication: the emergence of a new branch of computer science that concerns itself with the problem of searching for knowledge that is not explicitly stated in text.
Popular Web search engines, i.e. Google and Yahoo!, are excellent as far as the goal is to find information in documents that is directly relevant to the query. Relevance is normally determined by the occurrence of certain keywords in the search results. These search engines, however, would fail to detect information that is not explicitly stated in the form of the user’s pre-defined keywords. In some cases, implicit knowledge is only discoverable if multiple documents are compared and analyzed simultaneously (as demonstrated by Swanson’s discovery). As if these are not challenging enough, information scientists have recently argued that Swanson’s simple syllogistic discovery method (i.e. Acauses B; B causes C; therefore A causes C) is just one of many other alternative LBD methods by which implicit knowledge can be mined from scientific literature.
These present us with a challenging yet exciting research landscape in computer science. Harnessing a wide range of interdisciplinary expertise in artificial intelligence, linguistics, cognitive science, software engineering, and even biomedicine will be essential towards future solutions. Swinburne University of Technology Sarawak Campus has collaborated with a local medical institution to design a software capable of presenting previously unknown links between diseases and biological substances by synthesizing information stored in both clinical data sets and biomedical literature.
It is unfortunate that the current information deluge has not been matched by the capability to fully harness it. The result can be likened to a man who dies thirsty in the middle of a vast ocean. Advances in the literature-based discovery research may help us escape such intellectual tragedy.
Yakub Sebastian is an Associate Lecturer with the School of Engineering, Computing and Science at Swinburne University of Technology Sarawak Campus. He can be contacted atysebastian@swinburne.edu.my