Information Networks

Bio-networks Information Networks Older projects

 

Ranking in information networks (WWW and citations)

Due to their rapid growth and large size, many information networks have become untenable to navigate without some sort of ranking scheme. This is particularly evident in the example of the World Wide Web, a network of pages connected by hyperlinks. A successful solution to the problem of ranking the Web is Google's PageRank algorithm. Another class of information networks that could benefit from such a ranking method are citation networks. These networks are comprised of scientific publications connected by citation links. Current methods of ranking publications based on the total number of citations received are rather crude. They are too "democratic" in treating all citations as equal and ignoring differences in importance of citing papers. One of the advantages of Google's PageRank algorithm is that it implicitly accounts for the importance of the citing article in a self-consistent fashion. However, there exist significant differences between the World Wide Web and citation networks that suggest a modification of the original PageRank algorithm. The most important difference is that, unlike hyperlinks, citations cannot be updated after publication. This makes ageing effects in citation networks much more pronounced than in the WWW. 

The success of the PageRank algorithm can be attributed, in part, to its ability to capture the behavior of people randomly browsing the network of web pages. Indeed, the PageRank of a given web page can be interpreted as the predicted traffic (quantified, for example, by the rate of downloads) for that page if every WWW user follows a random path starting from a randomly selected webpage. The assumption that a typical web surfer starts at a randomly selected webpage might be not completely unreasonable for the WWW, but it needs to be modified for citation networks. As all of us know, researchers typically start "surfing"  scientific publications from a rather recent publication that caught their attention on a daily update of a preprint archive, a recent volume of a journal or, perhaps, was featured in a news article in the popular media. Thus a more realistic model for the traffic along the citation network should take into account that researchers "surfing" the citation network preferentially start their quests from recent papers and progressively get to older and older papers with every step.

Another difference is the apparently shorter "attention span" for researchers surfing citation networks compared to web surfers. From our optimization procedure we found that the average depth of citations followed by researchers is only 1/0.5=2, while Google's PageRank algorithm assumes that web surfers are more persistent and follow on average  1/0.15~6 hyperlinks.

To account for strong ageing characteristics of citation networks, we modified the PageRank algorithm by initially distributing random surfers exponentially with age, in favor of more recent publications. The output of this algorithm, which we call CiteRank can be interpreted as approximate traffic to individual publications in a simple model of how researchers find new information. We optimized parameters of our algorithm to achieve the best performance. The results are then compared for two rather different citation networks: all American Physical Society publications between 1893 and 2003 and the set of high-energy physics theory (hep-th) preprints. Despite major differences between these two networks, we found that their optimal parameters for the CiteRank algorithm (especially the depth of citations) are remarkably similar. 

If you have publications in any of the American Physical Society journals (Physical Review Letters, Physical Review A-E, Reviews of Modern Physics) you can find out the rank of your articles calculated using just the number of citations as well as PageRank and CiteRank algorithms. 

Network connecting routers in the Internet

The data for connectivity of the Internet on the level of the so-called Autonomous Systems (AS) is available at the web site of the National Laboratory for Applied Network Research. Each AS is a large domain of IP addresses usually that usually belong to the same organization such as e.g. a university, a business enterprise, or an Internet Service Provider (ISP). Autonomous Systems connected by an edge in this network directly exchange information with each other.  

In 2002 we have measured the correlation profile of the Internet. This profile quantifies the degree-degree correlations in a given networks. To this end it compares the number of edges between pairs of nodes with degrees  K0 and K1 in a given complex network in question and its properly randomized version, and detects statistically significant deviations of one from another. 

The correlation profile of the Internet shown below is very different from that of the protein interaction network in yeast, even though their power-law degree distributions are deceivingly similar to each other.

Knowledge networks of opinions of customers on products

In our 2001 PRL paper with Yi-Cheng Zhang we introduced a concept of knowledge networks. A practically important example of a "knowledge network" is a collection of positive and negative opinions  that a large group of customers have about a number of products (e.g. books). Such networks of opinions are used by businesses (e.g. amazon.com) to recommend  their customers other products/books  that they haven't bought but which were highly rated by other customers with similar tastes. In the computer science lingo this process of reconstructing peoples' tastes based on their opinions on products is known as "collaborative filtering". We tested our algorithm on a large real-life knowledge networks containing  opinions of users on movies they saw.

The difference between "plain" and "knowledge" networks is illustrated below:

N nodes in knowledge networks are characterized by M internal degrees of freedom (M-dimensional vectors), while their edges carry scalar products of vectors on two nodes they connect.

This rather abstract mathematical scenario can be realized in at least two real-life situations:

The above mentioned network of opinions that a group of customers ("readers") has on a number of products ("books"). Vectors on nodes correspond to customer's tastes and product features, while scalar products on edges  ("overlaps" ) are opinions of a given customer on a given product.
In biology this scenario can be realized in a network of "key-and-lock" interactions in a large ensemble of biomolecules (e.g. proteins). Vectors in this case specify the set of keys and locks present on the surface of a given molecule, while scalar products ("overlaps" ) stand for some measure of interaction strength.

Our algorithm allows one to predict "opinions" that are not yet known based on the opinions, which are already known. In case of customers and products it corresponds to predicting opinions of customers on products they never tried, while in bioinformatics application - to predicting the interaction strength of two bio-molecules in silico. It turns out that loops  in knowledge networks allow one to reconstruct some of the missing weights of edges (unknown opinions or interaction strength). The ability for large-scale prediction of these weights first appears when the density of known opinions exceeds the regular percolation threshold p1 ~ 1/N above which first system-wide loops appear in the network.  What is unique to knowledge networks  is the existence of  the second threshold p2 ~ 2M/N (reminiscent of rigidity percolation), above which all missing opinions can be reconstructed exactly.