About Lists of Web Page Cross-references (citations)

I began making web pages in the mid-90's. The web was a lot smaller and it was harder to find relevant pages back then, so one way to make a useful page was to compile a set of good links for some topic. I discovered a trick to help find good and relevant content quickly, by looking at what other pages sent people to my pages.

In fact, this method is fairly obvious because it is a classic method for research. When you publish you provide references to previous work. So in a first-pass at finding good related material people will look up the papers you found useful enough to reference in your paper. That is very much like the links people put on their web pages. However, that only takes you backwards to good older content, and a very limited amount at that. However, there were are citation indexes that enable you to find what papers reference a given papers. This works forwards. If you like a paper I wrote 5 years ago, perhaps you should find out who has referenced it - that will tend to lead you to more recent research on the same topic.

So how does this work on the web? When you go to a web page your browser passes it a variety of information. One of these is the "referer" page. This serves a variety of purposes. For example, if a web page is sending people to a non-existent page on my site (e.g. by a typo) I can detect that and look up the contact information for the webmaster and let them know the address they are using is wrong. For me, using this referer information to make up citation lists just seemed obvious. By the late 90's I had mostly automated the list generation. Coincidentally, for my own searching I had also started to prefer the results obtained from a project at google.stanford.edu over other search engines. Google was, in part, implementing the same basic idea in search engine form and of course went on to dominate the whole search engine market within a few years.

The Present Method

I still maintain a few web pages on DC Tech. They all still contain a list of links for each topic based largely on lists generated automatically by watching http_referer (citations). The method has the following key aspects:

  • the updating begins by listing the web pages that have recently sent visitors to a given topic area
  • The list is periodically processed then reset.
  • In the processing new pages are put on a "to be checked" list. This wasn't needed in the 90's, but these days the list gets filled with trash if I don't check the pages before including them.
  • Pages that I approve for the displayed list are rated automatically by a method described here. Briefly the ranking is based on the amount of recent traffic coming from the other page, which is a measure of the significance of the page and its relevance to this site (as measured by the number of recent shared readers)
  • there are some weaknesses to http_referer: The fraction of users setting this flag is diminishing over time, likely a result of security/privacy settings. There are still enough browsers using the flag to produce good statistics, but that may not last. Also, some of the referers just don't seem to be right. I suspect that various browser bugs exist where another page on the history shows up instead of the actual refering page.
  • people following links from their email account, facebook account, etc. may have "referers" set that cannot resolve