Digi Squeeb: Information Retrieval versus Data Retrieval

This week's exercises finally put last week's mind-mushing exercises into perspective.

The purpose was to draw a line between data retrieval (what we did last week) and information retrieval (what we did this week). Data retrieval is the kind of thing we do when we query a database. With information retrieval, the results are 'subjectively' relevant - I have a huge heap of documents, and I need to decide what are relevant to my needs or not.

I am a prolific user of Google. I use it at least a dozen times a day, if not more. It is easy to think from the perspective of a user. There is something I want. I type it in the search engine, and I hope I find what I'm looking for. Sometimes, a lot of frustration ensues.

But why is this? Why is it sometimes so difficult to find what you're looking for?

More often than not, the reason is because people don't index their 'documents' effectively. They lose sight of what it's like to be a user themselves.

As the owner of a website (or two), this is pretty pertinent for me. It is all too easy to speed through the creation of a website summary, keywords or tags. Usually you just want to get the creation bit out of the way and get going. But do those keywords fulfil user's needs? For example, you run a football website. 'The best football site in the world', you may profess. But what if an American fan is searching for a football site. What if he uses the keyword 'soccer'? Your site may very well be the best in the world, but you're immediately cutting off a large proportion of your potential audience simply by not giving enough thought to your indexing terms.

All sites have different indexing needs, and these should be made with the user in mind. A Shakespeare website may want to index its documents by phrases. For example, a user may be searching by a particular line or quote, e.g. "To be or not to be." Therefore, indexing will have to be tailored to user's needs.

In the computer lab, our task was to query two search engines - Google and Bing. Easy, I thought. Perhaps too easy. But of course, I was wrong.

We had to use different search models in order to get to different types of information. And depending on the type of information we had to find, certain search models worked better.

For example -

Natural language queries. I don't use these very often. But they often turned up very useful results, particularly on informational, exploratory information (when that information was explicit. Finding out about the Civil War levellers confused the search a bit, since 'levellers' could be any number of things).
Quotation marks - I use these most often. They're very useful for finding documents that contain certain words or short phrases. When a natural language query failed, adding quotation marks and deleting stop words usually helped narrow the search.
Boolean operators. Something I NEVER use. I'd always thought of them as kind of antiquated and redundant. So they proved to be - sometimes. I discovered that they simply do not work with Google. However, they were compatible with Bing (Bing automatically uses the AND operator anyway). An example - searching for Jerusalem artichokes and how to cook them. For some reason, the search turned up quite a bit on growing Jerusalem artichokes. Finally I found a use for the NOT operator. With Google, this simply turned up more hits on growing artichokes. With Bing, it seemed to work as intended.

Another part of the exercise was to calculate the precision of each search engine's results. What did I discover?

Why, that Google has a higher precision rate than Bing. Of course. ;)