Sunday, 30 October 2011

DITA Assignment, Part One - Web 1.0 - Data Retrieval vs Information Retrieval

This short essay seeks to answer a seemingly innocuous question: what are the differences between data retrieval (DR)  and information retrieval (IR)?  To the layman, the difference between the two concepts may seem hazy.  Yet both are inherently different.

First, however, it is important to be clear on what data and information actually are.  Data may be described as “a term for quantitative or numerically encoded information”, whilst information is “data that has been processed into a meaningful form” (Feather & Sturges, 2003).

Data is usually stored in a database, a “systematically ordered collection of information”(Feather & Sturges, 2003).  Retrieving data from the database requires the use of a query language, such as SQL.  This is a “structured way for retrieving search requests”, using artificial language commands (Feather & Sturges, 2003).  

According to Baeza-Yates and Ribiero-Neto (1999) a “data retrieval language aims at retrieving all objects which satisfy clearly defined conditions such as those in a regular expression or in a relational algebra expression. Thus, for a data retrieval system, a single erroneous object among a thousand retrieved objects means total failure.”

To clarify, database queries are structured as such:-

select ColumnA from TableB where CriteriaC_is_met

Any error in this structure - however minor - will result in the failure of the search, i.e. no matches. (For more examples of SQL search queries, see here.)

Information, however, is largely unstructured, existing in a number of formats and indexed in different ways.  Consequently information retrieval is based upon user information needs, and these are naturally subjective (Rosenfeld & Morville, 2007).  This means two things: –
  1. search queries will be based on those user needs and;
  2. search results will either be relevant or not.
To take point one, information queries may be divided into different types:-  navigational (searching for a website); transactional (searching for a service); or informational (searching for information on a certain subject) (MacFarlane, 2011).  The user may know exactly what they want to find; then again, they may not.  This ‘anomalous state of knowledge’ (ASK) informs the type of search query the user makes.  Where IR departs from DR is that IR search queries may take on different forms, for example, natural language and Boolean queries. (For a table outlining the differences between IR and DR, refer to Appendix A).

From personal study using various search queries on two different search engines (Google and Bing), natural language queries generally return relevant results, although using quotation marks and deleting stop words will narrow the search and increase precision.  Boolean operators also returned different results, as both search engines interpreted search queries in different ways (see Appendix B for the results of the above study).

Depending on the type of information required e.g. transactional, informational etc., it is likely that search queries will return different results.  For example, Anne is doing a project on the Captain Swing Riots of the 19th century.  She wants as much information as possible, and decides to use two different search engines and compare their results.  In both Google and Bing she types in the natural language query ‘Who is Captain Swing?’ (minus quotation marks).  Google’s results were all relevant.  Bing’s top rated result was also relevant, but all the following results were irrelevant (returning information on a band called ‘Captain Swing’).  Curious, Anne then deletes the stop words from her previous query, and types “Captain Swing” into both search engines (quotation marks included).  This time four of Google’s top five results were relevant; one of Bing’s top five results was relevant.  Therefore, of the two search engines, Google had satisfied her user needs more effectively.

Later, while using the natural language query ‘what are Jerusalem artichokes and how do I cook them?’, Anne discovers that many of the results are about growing artichokes.  This time she uses another strategy to narrow down her search – Boolean operators.  She types in ‘Jerusalem artichokes AND cook NOT grow’.  This is effective in the Bing search engine, but not in the Google search engine.  She later discovers that Google accepts other forms of Boolean operators, and that by typing ‘Jerusalem artichokes + cook – grow’, she will again find more relevant results.

As can be seen, natural language queries deal in a certain amount of ambiguity, and may not necessarily provide appropriate results.  With data retrieval, a search provides either a match or no match.  With information retrieval, a search must fulfil the user’s need. In short, it must be relevant.

There are two ways of judging relevance – binary judgement (where something is relevant or it is not), or graded judgement (when some results are more relevant than others).  User satisfaction in IR may be evaluated by calculating the recall or the precision of the search results, where:- 

It is important to note that there is an inverse relationship between recall and precision - where one increases, the other must decrease.

There are drawbacks to different methods of IR.  Boolean operators are not intuitive, but rigid; a search on teaching French in schools may equally return results on teaching in French schools (Feather & Sturges, 2003).  Likewise, natural language queries may result in low-precision results due to irrelevant documents that contain high levels of keywords “by chance or out of context” (Lee, Seo, Jeon and Rim, 2011).  Deleting stop words and adding quotation marks decreases recall. There are many ways in which user needs may not be satisfied, and there is no ‘right way’ of improving search results.  This is simply because it is the user’s needs that determine the type of search query used.

It is therefore important that the information to be searched is appropriately managed.  For example, is it in the correct format?  Should it be searched through keywords or keyphrases?  What about conflating words, including synonyms, and ignoring stop words?  These methods are all vital in making information more accessible to the user (MacFarlane, Butterworth and Krause, 2011).

To conclude, data and information retrieval could not be more different.  Data has the advantage of not being subject-based.  A database is built with its own well-defined semantics.  It is the opposite for IR.  There are no well-defined semantics, and so the IR system has to interpret the semantic content of the documents and bring together what it deems relevant.  Reaching this goal appears to be a two-way street.  The information in the document itself must be well-managed by the creator; the user must also use an appropriate IR method according to his or her own information needs.  Likewise, the evaluation of search results will be determined subjectively by the user, according to those needs.

Blog URL:-


Baeza-Yates, R. and Ribiero-Neto, B. (1999). Modern Information Retrieval. [online] Boston, Massachussetts: Addison Wesley Longman Inc. Available at: [Accessed: 22 October 2011]. 

Feather, J. and Sturges, R. P. eds. (2003). International Encyclopedia of Information and Library Science. 2nd ed. London: Routledge.

Karlgren, J. (2004). Information retrieval: introduction. [online] Available at: [Accessed: 23 October 2011].

Lee, J., Seo, J., Jeon, J. and Rim, H. (2011). ‘Sentence-based relevance flow analysis for high accuracy retrieval.’  Journal of the American Society for Information Science & Technology [e-journal] 62 (9), pp. 1666-1675. Available through: JSTOR [Accessed: 25 October 2011].

MacFarlane, A. (2011). Lecture 04: Information Retrieval, INM348 Digital Information Technologies and Architectures. City University London [unpublished].

MacFarlane, A., Butterworth, R. and Krause, A. (2011) Lecture 03: Structuring and querying information stored in databases. INM348 Digital Information Technologies and Architectures. City University London [unpublished].

Rosenfeld, L. and Morville, P. (2007).  Information Architecture for the World Wide Web. 3rd ed. Cambridge: O'Reilly.

Appendix A

The following table by The Swedish Institute of Computer Science (SICS) clearly summarises the difference between data and information retrieval. [Accessed: 23 October 2011]

Information vs Data Retrieval

Exact match
Partial match
Query language
Natural (... well)
Query specification
Items wanted

Appendix B

The results of an exercise calculating the precision of various search results from Google and Bing.  The original spreadsheet may be viewed at

Wednesday, 26 October 2011

Web 2.0 - The internet as platform

This week's lecture introduced the idea of Web 2.0.

Web 2.0 is the idea of the internet as a platform, rather than a computer as a platform.  It is the web that can be written to as well as read.  Back in the day, if you wanted to put something on the web, it involved learning HTML and writing up a web-page manually.  Waaaaaay back in 1997, I had to go to my college library, get out a book on HTML, learn the basics, sit down in Notepad and type out my website.  Yawn.  And more often than not, it looked pretty pants.

Now fast-forward to 2011.  Why the hell would you want to bother with typing out pages of HTML just to make a website.  Google Sites has pre-made templates for you already.  Dreamweaver can write all the code for you.  Your thoughts can be put up on the web in a matter of minutes if you have a blog.  Seconds if you're on Facebook or Twitter.  And whatever type of computer platform you have, everything all looks and works virtually the same.

This is the crux of Web 2.0.  Everyone can publish on it without having any technical skill at all.  Web 2.0 effectively harnesses network effects that get better the more people use them.  Social networks are at the heart of Web 2.0, and Web 2.0 can be said to be so successful because, through sites like Facebook, they mirror the social networks and interactions in our everyday lives.

Interaction is basically at the heart of Web 2.0.  And this interaction isn't only of the purely social kind.  We all have the ability to become pseudo-experts by contributing to Wikipedia.  We can make Amazon better by leaving reviews.  We can create our own tags on Flickr and Delicious - indeed, in the ocean of un-indexed junk out there, folksonomies are becoming one of the most effective ways of organising web information.

The World Wide Web (version 2.0).

But all this inevitably comes at a price.  Some of the issues we discussed are mapped out here:-

  • Does having the ability to log every mundane event in your everyday life with ease create propensity to narcissism?
  • Is Wikipedia as reliable as, say, the Encyclopaedia Britannica, and does it promote a culture of amateurism?
  • Does the internet, as a platform for freedom of speech, promote a 'safe' environment to for people to act in an offensive and derogatory manner?
  • Would it be fair to say that important events somehow become trivialised by the hype of the web?
  • What does the privacy settings change on Facebook in February 2010 have to say about the public nature of the data we may unwittingly put on the web?
  • How do we deal with the ephemeral nature of the web?
What do you think?

Wednesday, 19 October 2011

Information Retrieval versus Data Retrieval

This week's exercises finally put last week's mind-mushing exercises into perspective.

The purpose was to draw a line between data retrieval (what we did last week) and information retrieval (what we did this week).  Data retrieval is the kind of thing we do when we query a database.  With information retrieval, the results are 'subjectively' relevant - I have a huge heap of documents, and I need to decide what are relevant to my needs or not.

I am a prolific user of Google.  I use it at least a dozen times a day, if not more.  It is easy to think from the perspective of a user.  There is something I want.  I type it in the search engine, and I hope I find what I'm looking for.  Sometimes, a lot of frustration ensues.

But why is this?  Why is it sometimes so difficult to find what you're looking for?

More often than not, the reason is because people don't index their 'documents' effectively.  They lose sight of what it's like to be a user themselves.

As the owner of a website (or two), this is pretty pertinent for me.  It is all too easy to speed through the creation of a website summary, keywords or tags.  Usually you just want to get the creation bit out of the way and get going.  But do those keywords fulfil user's needs?  For example, you run a football website.  'The best football site in the world', you may profess.  But what if an American fan is searching for a football site.  What if he uses the keyword 'soccer'?  Your site may very well be the best in the world, but you're immediately cutting off a large proportion of your potential audience simply by not giving enough thought to your indexing terms.

All sites have different indexing needs, and these should be made with the user in mind.  A Shakespeare website may want to index its documents by phrases.  For example, a user may be searching by a particular line or quote, e.g. "To be or not to be."  Therefore, indexing will have to be tailored to user's needs.

In the computer lab, our task was to query two search engines - Google and Bing.  Easy, I thought.  Perhaps too easy.  But of course, I was wrong.

We had to use different search models in order to get to different types of information.  And depending on the type of information we had to find, certain search models worked better.

For example -

  • Natural language queries.  I don't use these very often.  But they often turned up very useful results, particularly on informational, exploratory information (when that information was explicit.  Finding out about the Civil War levellers confused the search a bit, since 'levellers' could be any number of things).  
  • Quotation marks - I use these most often.  They're very useful for finding documents that contain certain words or short phrases.  When a natural language query failed, adding quotation marks and deleting stop words usually helped narrow the search.
  • Boolean operators.  Something I NEVER use.  I'd always thought of them as kind of antiquated and redundant.  So they proved to be - sometimes.  I discovered that they simply do not work with Google.  However, they were compatible with Bing (Bing automatically uses the AND operator anyway).  An example - searching for Jerusalem artichokes and how to cook them.  For some reason, the search turned up quite a bit on growing Jerusalem artichokes.  Finally I found a use for the NOT operator.  With Google, this simply turned up more hits on growing artichokes.  With Bing, it seemed to work as intended.
Another part of the exercise was to calculate the precision of each search engine's results.  What did I discover?

Why, that Google has a higher precision rate than Bing.  Of course. ;)

Addicted to Google and the "Church of Search".

Sunday, 16 October 2011

DITA exercises #3 - the AFTERMATH.

I finally got some feedback about my SQL exercises from session 3 of DITA.

What I have learned from the feedback is that there are many little 'pointers' that help to 'tighten up ' the commands and thus return more reliable data from your query.

For example:-

  •  Using the = sign with > or < in order to actually include the number typed in the query, rather than just those numbers lesser or greater than it. (e.g. using >= 1980 as opposed to > 1980; the former includes 1980 in the search).
  • The use of % to make sure you get as many returns on your query as possible (e.g. using "%Prentice Hall%" as opposed to "Prentice Hall"; the former includes all matches including the name Prentice Hall, not just matches comprising ONLY the name Prentice Hall).
  • Keep in mind the difference between numbers and characters - in SQL, 0028007484 is treated as a number, whilst 0-0280074-8-4 is treated as a string of characters.  Therefore the command = will work with the former but not the latter; the 'like' command should be used instead.
Helen, a fellow student, had an excellent way of explaining how queries should be arranged:-
"The columns of data you want to obtain from the database  (SELECT)The table or tables that this data is sitting in (FROM)The clauses that limit this data to exactly what you are interested in and no more (WHERE) columnA, columnB, columnC from tablenameXYZwhere criteria1_is_met and criteria2_is_me"
I also found that having a diagram of the database's structure was very useful, as it helped me to visualise where and how I could retrieve the data. 

Wednesday, 12 October 2011

What colour do you want that database?

Like many people, if someone said 'database' to me, I'd immediately think of a spreadsheet.  So when I knew we were learning databases this week, I was a little freaked out since frankly I hadn't done spreadsheets since I was in secondary school.

Needless to say, as with most things on this course, my preconceptions were totally wrong.

Like the internet, databases have a longer history than I'd thought.  But of course, back in the day there was a lot of duplication, inconsistency and redundancy of data.  Nowadays we tend to store data centrally, in order to mitigate these problems and aid the fast retrieval of data.

As an introduction to databases, we learned the basics - querying them using SQL.  In the lecture, it all seemed fairly easy, logical and generally straightforward.  I have to admit, I haven't got the most logical of minds, but once you get something, you should be able to follow the logic to its natural conclusion, right?

Heh heh.  Sitting in the computer lab after the lecture, my mind hit a complete blank.  What did I just learn?  How does it work again?  Like maths, it's like breaking a code - once you know the formula, it's easy to unravel the answers.  And I found I hadn't quite got a handle on that formula.  A proliferation of bad and incorrect commands followed.  And even more than that, a whole lot of simple, straight guessing.  Sometimes I would guess and guess until I gave up.  When I finally asked for help, I realised that there are just some commands you need to know; and that once you know them, you have the key.  A whole lot of doors unlocked.  Every time I learned a new 'key', it made it easier to guess what the answers to some of the exercises might be.

Still, it took me an evening of querying the Database Management System at home before I managed to complete all the exercises.  I recorded each answer, because I knew that if I didn't, I would forget them.  The best analogy I can come up with is SQL is like learning a language.  You will never learn to speak it fluently unless you use it, and if you don't practice you will get rusty.  So I am now spending my time making random queries of the database, trying to get it all to stick firmly in my head.

And I guess it will - after a million or so queries of practice. ;)

Yup.  Exactly what I would've said a few days ago...

Monday, 10 October 2011

Monday, 3 October 2011

DITA von Tease

What is the internet?

Yes, I am one of those people who would have answered - it's the World Wide Web.  And if someone had asked me, "What's the World Wide Web?", I would have answered - "Duh!  It's the internet!"

Today I learned that the internet and the WWW are not the same thing.

Really, it's one of those things you've heard about and know at the back of your head, but if you were to describe the difference between the two, you wouldn't have a hope in hell of explaining it.  Well, I wouldn't anyway.

Richard had a very good way of explaining it:-

"The internet is the road, and the World Wide Web is the car that drives the road."

In other words, the net is the infrastructure, and the World Wide Web is a service that allows you to 'navigate' the net.  There are other services that allow you to traverse the net in different ways to the World Wide Web.

The Information Super Highway.  There really is a reason for that vehicular analogy!

The 'net' began in the 1960's, and was developed by the military in order to make sure that even when one computer with information went down (or was destroyed by a nuclear warhead, as the case may be), that information would still be stored and shareable amongst other computers in the network.

Back in those days, computer servers were huge, clunky things that probably needed a whole department to run them.

The days when the Information Super Highway was more like the
Information Super Monolith.

Nowadays, the gap between client and server computers is narrowing.  Client computers are now becoming more and more powerful, and the traditional distinctions between the two are blurring.  In a way, we are all becoming servers.  I'm not sure exactly how it works, but torrenting and peer exchange seem to be proof of that closing gap.

Anyway, in the computer lab we got to make our first web page, after learning the basics of HTML.  We kind of take it for granted nowadays that someone else will make our webpage for us.  Many sites have web templates, or we can use a program like Dreamweaver to create one like we would create a document on Word.  The last time I actually wrote HTML was in 1997.  But even then I was no whizz-kid at it.  I certainly didn't really get it.  So what did I learn today?

  • Tags like <b> and <i> are now 'old-fashioned'.  Semantic tags like <em> and <strong> are now the way to go, because they're more 'global-friendly'.
  • I finally know what CSS stands for!  I finally know what CSS is!  WOOT! XD
  • LAN = Local Area Network; WAN = Wide Area Network; Internet = a vast network of all these networks.  TA DA!!!
See, it's really all very simple; I think that these computer scientists really just like to tease the hell out of us with their crazy acronyms... ;)