34 billion German Internet users can teach you something about the web

Update 25.10.2008: See updates in blue text in the body of the post below

A research institute confused million with billion in a news release published last month. Someone called them and they fixed the one-letter error immediately. Case closed, what could be learned out of this? What is the importance of million vs. billion in the light of the current financial market crisis anyway (sorry, bad joke)? After all we are all humans and we are all making mistakes, right?

image

Good that you are asking. The mission of line-of-reasoning is to help making the world a better place by explaining issues of our today’s information infrastructure.

Based on the small and unimportant error of the institute we will showcase with concrete examples some of the most simple and fundamental issues that continue to plague the web since its beginning:

  • broken links: URL´s are pointers and they are great as long as the page that they point to is still existing. But if not …
  • copy and paste helps somehow avoiding the issue of broken links but it makes the removal of errors somehow even more difficult
  • Google, currently THE web search engine, is not providing you with a good answer related to the most essential information out there on the Web although Google itself has been -so to say- built on the answer to this question: What pages are linking to another page?

Read on for all details, artifacts and explanations.

Issue 1: Broken links

The original news release that spoke of 34.4. billion German Internet users was published under this link:

http://www.comscore.com/press/release.asp?press=2481

If you follow that link now you will end up with a page like this:

image

That is happening because when the one-letter error (”b”->”m”) got fixed the news release also got published (probably without any intention) under a slightly different link:

http://www.comscore.com/press/release.asp?press=2484

image

The issue is that fixing the error and publishing the updated news release with a new URL created immediately a follow up issues for those sites that linked to the original news release.

Update 25.10.2008: picture inserted. This German blog post (see newly insert screen shot below) is for now still pointing to the removed news analysis of the institute and can serve as a real life example.

image

Update 25.10.2008: the FAZ blog post from Holger Schmid discussed below has been updated -without further comment or thank you) after we commented the post with a link to this article on the 23.10.2008. The ping-back and our comment has since then been removed from the blog post. We keep that post as part of this article for consistency reasons and because again the central point of this article has been proven: Any information on any web site can be changed or removed without any further notification. It is somehow interesting that it happens in this case (a link update is a small change, we agree, but the removal of our comment and the link to this post is not) on the site of a professional blogger of a newspaper that you would expect stands for authentic, honest information and ethics of journalism in general.

Example: this post on the blog (recommended but German only ) of one of the most respected German News paper, the “Frankfurter Allgemeine Zeitung”, was posted before the institute fixed the error. As a result the post contains now, a month after the error-correction happened, a no longer meaningful link to the no longer existing original news release:

image

If someone would like now to verify the source of this blog post (after reading it) he will not be able to access the source just by clicking on the provided link. That can create various follow up problems (How to find the source of the blog post? Can I rely on the blog post if I can not find the source? etc.)

No one is to blame here, not the research institute and not the FAZ blogger. It is probably very easy to find also on this blog line-of-reasoning lots of broken links.
URLs are pointers to web pages and they are incredibly useful. But the simplicity of the concept of URLs comes at the price of  potentially “dead links”.

Issue 2: Copy and Paste

The simplest idea to avoid the issue of broken links would be to just copy and paste at least all relevant parts of other web pages to the new write up. Besides the fact that this can and will cause all sorts of copyright issues it is also a dead end road as a simple Google search will demonstrate for our example:

http://www.google.com/search?q=%2234.4+billion+German+Internet+users%22&hl=en&client=firefox-a&rls=org.mozilla:en-GB:official&filter=0

image

Because those sites that show up in Google copied the complete news release of the institute directly after he first initial release they avoided the issue of the dead link, but they did not solve the problem. One could argue that copy and paste is making the problem even worse. Although the institute quickly fixed the “misspelling” on its own site, what could it do now (a month later) to remove the error from the other sites as quick and easily?

The web itself and the concepts it is based on do not address in any way the “implicit” relation that “copy and paste” is creating between the original published information and the re-usage on other sites.

Issue 3: Google is not providing a good answer for everyone to the central question:
What pages are linking to another page?

Based on our example URL http://www.comscore.com/press/release.asp?press=2481 we would like to know what pages are linking to the URL of the original news release. Google is the biggest search engine and it is using the information about what pages are linked to each other in its patented search algorithm. Surely we can make use of Google to find all web sites that link to http://www.comscore.com/press/release.asp?press=2481, right?

Wikipedia says today (23.10.200)8 about Google´s patented PageRank algorithm the following:

image

–>

Side remark: if you follow today the link in Wikipedia to the source page of its Google PageRank quote

image

you will be redirected from Google to this URL http://www.google.com/corporate/tech.html where you will find another definition of PageRank than the one that Wikipedia quotes. That again is supporting nicely the point we are trying to make in this post.

–>

The important point to notice is that Google indeed is using the links/URLs in pages to decide about the importance of pages. Google is indeed also offering under “Advanced Search” the feature to search for pages that link to a specific page.

image

Google explains this feature like this:

link: The query [link:] will list webpages that have links to the specified webpage. For instance, [link:www.google.com] will list webpages that have links pointing to the Google homepage. Note there can be no space between the “link:” and the web page url.”
http://www.google.com/help/operators.html

Our Google search for pages that link to the original news release page www.comscore.com/press/release.asp?press=2481 looks simply like that:

http://www.google.com/search?hl=en&q=link%3Awww.comscore.com%2Fpress%2Frelease.asp%3Fpress%3D2481&btnG=Google+Search&aq=f&oq=

One would expect now to find all page that link to www.comscore.com/press/release.asp?press=2481 and it would be expected that also the specific FAZ blog post (Google is having the specific FAZ blog post indexed as can be seen here)  mentioned above would be found, agreed?

But the result is … nothing:

image

On the basis of our little example it is very easy to demonstrate and proof that Google is keeping some of the most basic information that is available on the web (what page is linking to another page) away from the average Internet searcher.

The limitations of the Google link: operator is not a newly detected Google secret, instead it is a well known fact amongst experts:
http://www.seo-theory.com/wordpress/2008/04/23/how-to-get-the-most-seo-bang-from-query-operators/
http://www.webmasterworld.com/google/3110041.htm
http://blog.pengoworks.com/index.cfm/2008/6/20/Do-not-trust-Googles-link-operator

Google is offering webmasters for their sites somewhere else more information than it offers with the link: operator to the public. Somehow implicitly Google is confirming the limitation of the link operator here:

“We’ve extended our support for querying links to your site to much beyond the link: operator you might have used in the past. Now you can use webmaster tools to view a much larger sample of links to pages on your site that we found on the web. Unlike the link: operator, this data is much more comprehensive and can be classified, filtered, and downloaded. All you need to do is verify site ownership to see this information.”

Source. http://googlewebmastercentral.blogspot.com/2007/02/discover-your-links.html

This confirms that if you do not own a site Google will not show you the link related data. Experts recommend to use in that situation Yahoo Site Explorer. We have done that and Yahoo Site Explorer showed as expected the FAZ blog post (amongst others) as hit #42.

image

But Yahoo is not Google and that is bringing up other issues (Yahoo is searching very likely less sites than Google etc.).

Closing comments:

“34 billion German Internet users” have shown us some basic conceptual issues of today’s web. Those issues are well known. Several attempts have been made to avoid exactly these issues even before the web as we know it became reality .
Ted Nelsons project “Xanadu” is an outstanding example of those attempts. You read this post until here, so we know that you will be interested to learn more about Xanadu:

http://www.netvalley.com/intvalxan.html
http://www.wired.com/wired/archive/3.06/xanadu_pr.html
“The real work of writing is rewriting; and especially in big projects, is principally the  overview and control of large-scale rearrangement — a rearrangement process that used to be called “cut and paste” until those terms were redefined by the Macintosh in 1984.”
http://cipher.uiah.fi/pub_stuff/dVd/info_architecture/Xanalogical_Structure.pdf
also here http://www.xanadu.com.au/ted/XUsurvey/xuDation.html
http://www.wired.com/wired/archive/3.09/rants.html

Others have come to the conclusion before, so let them speak for us:

“One profound insight can be extracted from the long and sometimes painful Xanadu story: the most powerful results often come from constraining ambition and designing only microstandards on top of which a rich exploration of applications and concepts can be supported.
That’s what has driven the Web and its underlying infrastructure, the Internet.”
Source: http://www.netvalley.com/intvalxan.html

The web is built on very loose coupling and lots of unpredictable changes. It is, in many ways, a mess – but it’s a mess that works, delivering real value to lots of people every day.” Source: http://martinfowler.com/bliki/EvolutionarySOA.html





tweet it


Bookmark Buttons
Bookmark this: Digg Bookmark this: Del.icio.us Bookmark this: Facebook Bookmark this: StumbleUpon Bookmark this: Google


Oktober 23rd, 2008 at 10:08 pm and is filed under Issues explained. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

Leave a Reply