TL;DR version

I did some investigation into the Bing & Google spat. Once the Bing Toolbar is installed, IE 8 sends Microsoft a summary of every page URL that is requested. I found no evidence the toolbar singles out Google, or even search engines. I found no evidence of directly copying search results.

The results seem to indicate that Bing treats Google’s search result pages the same as any other URL on the world wide web.

The rest, it would seem, is all spin.

Background

Over the past 48 hours Google & Bing have been having a public spat about search results. The initial salvoes led to Google’s official blog post “Microsoft’s Bing uses Google search results—and denies it“, then a counter-post from Microsoft “Setting the record straight“. These two official posts have been accompanied throughout by other parties posting comment, criticism and also poking fun.

Spinny Spin Spin Spin

Two things have struck me as bad in this debate. First off, the insane levels of spin. Google’s original blog post minced no words about implying the actions were deliberately targeted, and went directly on the attack with statements like “some Bing results increasingly look like an incomplete, stale version of Google results—a cheap imitation.”

Bing’s counterspin was just as shocking, associating Google’s tactics with “click fraud” and casting the focus across to other Bing features that have (allegedly) shown up in Google.

Google seem to have struck first in the spin war. Their official blog post painted Google as the innovator, and Bing as the unquestioned copycat. This left little room for either party to save face, but it left Google with a massive PR upper hand.

Science?

The other problem I see is with Google’s methodology. The blog post explains how they concocted a series of experiments to “determine whether Microsoft was really using Google’s search results in Bing’s ranking.”

The experiments produced results, but not in any kind of scientifically conclusive manner. There were no scientific controls, and the conclusions were presented as simply supporting Google’s argument, with no other plausible explanation ceded.

My Investigation

I decided to run my own investigation. Armed with nothing but a humid Australian summer evening, Wireshark, and beer. I also have a Windows XP VM that I keep around just to run Adobe Lightroom.

First off, I installed a vanilla IE 8 and started my packet capture.

Straight away, I saw TLS encrypted connections to a Microsoft IP. Caught red handed?

A quick session with the excellent freeware API Monitor let me read the unencrypted data.

Seems that this is the IE anti-phishing filter. A SOAP request is sent once for each domain the browser visits. The responses appear to be cached so the same domain is not requested twice.

For example, here are the only two SOAP requests sent to Microsoft when I navigated to google.com.au, searched for “green fig jam”, and clicked the third link. Excuse the XML.

You’ll notice the only two URLs sent were the article I clicked on and some tracking URL on that same site. The search URL on http://google.com.au was not not even looked up, because google.com.au had previously been cached as a safe domain.

The anti-phishing service seems to be innocuous. It does not provide enough information to enable any “copying of search results”.

Bing Toolbar

I couldn’t find any other data sent by plain old IE 8 (including Suggested Sites.) Time to install the Bing Toolbar and give away some more of my precious privacy:

Immediately the packet captures started getting interesting:

Unencrypted HTTP GETs to g.ceipmsn.com, one for every page that the browser loaded.

Here’s the full encoded HTTP/1.1 session, as captured by Wireshark. Below are the individual GETs sent when I opened a new window and visiting my website, projectgus.com.

This first GET seems to be an initial identifier sent when the browser opens (nothing is sent back in the response.) Then there is a GET when I type projectgus.com on the address bar. Then one for the Telstra GPL Violation link on the front page. Then finally one when I click through to the GPL-Violations FAQ.

Among the interesting pieces of information that are sent:

  • A unique identifier for at least my Windows login session, maybe my computer. I never saw it change. (MI=0E6B38A7645C4121BC051BDBF57482AF-1)
  • Version numbers such as my VM’s OS version (XP Pro), and some others. (OS=5.1.2600)
  • Timestamp and timezone. (ts20110203221808108|tz-600)
  • The URL loaded. (euhttp://gpl-violations.org/faq/vendor-faq.html)
  • The link text clicked on to get to that URL. (atvendor%20FAQ)

There are also some other interesting fields that appear in some other captures, such as the original URL if a redirect occurs. There doesn’t seem to be any page content sent, just URLs and sometimes link text.

What about Google?

Here’s another full session, recorded when visiting google.com.au to search for Microsoft Seaport, then clicking on the third search result.

Here is some of the pertinent data sent to Microsoft from that session:

<

div style=”text-align:left;”>

  • euhttp://www.google.com.au/search?hl=en&source=hp&biw=1676&bih=813&q=microsoft+seaport&aq=f...
  • euhttp://www.google.com.au/url?sa=t&source=web&cd=10&ved=0CFgQFjAJ&url=http%3A%2F%2Fsmartnetadmin.blogspot.com%2F2010%2F02%2Fremove-microsoft-seaport-search.html&rct=j&q=microsoft%20seaport&ei=ho1KTbQXjMlxw7vFtws...
  • |euhttp://smartnetadmin.blogspot.com/2010/02/remove-microsoft-seaport-search.html

… these are just some of the 12 URLs requested by the browser as it went through these motions. Ironically, most of the other page loads seem to be Google, Blogspot & Doubleclick all recording tracking data.

Interestingly, the ‘at’ field showing the link text doesn’t appear when tracking Google activity. I don’t know if this is IE8 honouring robots.txt, or some other factor.

It’s clear to see that if Bing mines this data then it could associate Google search queries with the resulting pages. This is simply because the query is encoded in google’s URLs, and then the result page URL follows. In this way, though, it does not look like Google is being treated any differently from any other web site with human-friendly URLs.

In particular, Google’s experiment (injecting unique strings into special searches) could surely have caused these associations to occur, from the data seen here.

It’s worth noting that, on the face of it, mining tracking data in this way could provide better search results – by mining where users really do click all over the web, and which pages they bounce through before finding useful content.

Of course, there is nothing in this investigation to suggest that Bing doesn’t have some special “google filter” running on their server side to extract Google clicks from the rest of the tracking data. However, there’s also absolutely no evidence to suggest that this does happen. Occam’s Razor would seem to apply.

So who done bad, again?

The behaviour I’ve seen explains Google’s experiments, but does not support the accusation that Bing set out to copy Google.

Bing Toolbar is tracking user clicks and Bing could use the result to improve search results. I don’t personally see any great distinction between this behaviour and Google’s many tracking, indexing and scraping endeavours which they use to improve their own search results.

While I personally dislike the privacy implications, Bing Toolbar is pretty upfront about it when it gets installed (unlike much web page user tracking.) The fact that the tracking is plain HTTP not HTTPS, with the content in plaintext, would seem to indicate that they weren’t seeking to hide anything.

Ultimately, the worrying question for me is whether web search has become so stagnant, and so generic, that this kind of name-calling and spin-doctoring may now be the only way to carve out a brand?

8 thoughts on “Bing & Google – Finding some facts

  1. Bob: Sure, but at this stage “server-side Google specific filtering” is a conspiracy theory. As I said above, Occam’s Razor applies here. If I was Bing, and I had a ton of clickthrough data showing user’s path of clicks through the web, I’d mine the lot. Not just the Google-specific parts. That’s really useful data for tuning a search engine.

    The only way to suggest otherwise would be to run a control test. Repeat the exact same methodology as Google did, except running datasets on both Google and a non-Google domain. If such tests yield Google associations but no non-Google associations in Bing after a few weeks, there’s a case for some kind of Google-specific filtering (although not airtight, some anti-linkspam mechanism could always kick in against data from a small low-trust domain.)

  2. Gus,

    Very nicely done. Thank you. What I really appreciate is that you’ve provided a lot of info, with very little spin. Quite the opposite of nearly all the discussion that’s been happening on both sides of this controversy. In fact, I’m aghast at the lack of facts — and the indifference to the need for more facts — that is apparent at most news sources and blogs that have covered this episode. Thanks again for filling the gap.

    -Horse

  3. You do not know what experiments Google ran. You assert that they did not have a control based on what evidence? Do you really believe that you spending an hour on something that a team of 20+ Google engineers worked on over months is any way at all comparable? That is supremely arrogant.

  4. Cool down, now, Ron.

    I think Angus said that this “experiment” (Google’s own word) did not have any controls because Google didn’t mention any. Experiments are useless without controls.

    But you’re right, we don’t know all the experiments that Google ran, or any controls that they might have used. We only know what they are telling us.

    As far as Google is admitting, this was a “controlled” experiment only to the extent that they “controlled”, i.e. eliminated, all possible influences other than the one that they explicitl, manipulated. They achieved this degree of “control” by using brand new, alien search terms that had never appeared on the face of the earth, except possibly in a password or a random keyboard session. They intentionally sent Bing some bogus clickstream information containing these alien terms. And then they cried wolf when they observed that, in some cases, Bing made use of the only “information” about these words that it could possibly have. They scarcely mention that Bing didn’t even use that “information” in the vast majority of cases.

    If a team of Google “engineers” worked on this for months, which I really doubt, it’s because that’s how hard they had to work to produce any semblance of a suggestion that Bing makes use of information that might have come from a Google search.

    And no, I don’t like Bing, or Microsoft for that matter. I think Google (the search engine) is great. But Google (the company) is quickly showing that they can stoop as low as Microsoft ever did.

  5. Ron, I’m going on the methodology and results described in the official blog post. If Google did more tests and has more results, then they should release that information as well. AFAIK noone has said there is more evidence out there, though.

    I’m not suggesting that I did more or less than their engineers, I’m just trying to add to the (very slim) amount of actual factual information that is currently available about this.

  6. Without ever specifically setting out to copy any particular site, you could improve search rankings by simply watching how users who end up at a particular domain or page stay there for long periods of time (either by repeated clicks to pages within the domain or an indication of stopping to read for long periods).

    One such example by itself is preetty useless (perhaps the user walked away for 30 minutes). 400 million such examples might mean “Users who start at xyz.com and query for ‘foo’ end up at xyz.com/foo for a ‘long period of time'”.

    Possible conclusion: Maybe we should return ‘xyz.com/foo’ when people search for foo.

    Obviously oversimplified, but definitely a plausible means of capturing user data from Bing or Google or Ebay or anyone’s toolbar and improving results. Why copy Google specifically when you can just track everyone equally and get even better tuning?

  7. @angus, thanks. Nice article (one of the few even mostly-balanced ones I’ve seen on this topic).

    However, I think you didn’t follow the logic entirely on one point, you said:

    “the query is encoded in google’s URLs, and then the result page URL follows. In this way, though, it does not look like Google is being treated any differently from any other web site with human-friendly URLs.”

    “there is nothing in this investigation to suggest that Bing doesn’t have some special ‘google filter’ running on their server side to extract Google clicks from the rest of the tracking data. However, there’s also absolutely no evidence to suggest that this does happen.”

    While Google’s URLs are “human-friendly” the only way for someone (you, me or Bing) to get the search term from the followed link is for them to have manually “trained” a parser to handle the URL. This leads me to my belief that they (Bing) can’t get meaningful feedback from user clicks unless the user is on a site for which they (Bing) have specifically trained their parsers to handle the URL.

    Doing this at a news outlet drives clicks/users to the news site and (potentially) increases the revenue for the site. Doing it for a competing search engine causes the user to bypass the competitor and the competitor loses money.

    At one point I had a form on my home page that allowed searching on a half dozen sites (Google, Wikipedia, etc.) the form would create the URLs like their search boxes, of course. It took frequent effort to keep up with the changes in URLs created by the target sites’ search forms. So the implication that the ranking engine can just “get it” from the URL doesn’t make sense to me.

    @CGomez: “you could improve search rankings by simply watching how users who end up at a particular domain or page stay there for long periods of time”

    I think you’re simplifying too much. If I’m searching for a term or phrase, just because other users (or even me!) spend any amount of time on a site doesn’t mean it’s meaningful in any way to my current search.

Leave a Reply

Your email address will not be published. Required fields are marked *