Sunday 30 October 2011

DITA Assignment, Part One - Web 1.0 - Data Retrieval vs Information Retrieval

This short essay seeks to answer a seemingly innocuous question: what are the differences between data retrieval (DR)  and information retrieval (IR)?  To the layman, the difference between the two concepts may seem hazy.  Yet both are inherently different.

First, however, it is important to be clear on what data and information actually are.  Data may be described as “a term for quantitative or numerically encoded information”, whilst information is “data that has been processed into a meaningful form” (Feather & Sturges, 2003).

Data is usually stored in a database, a “systematically ordered collection of information”(Feather & Sturges, 2003).  Retrieving data from the database requires the use of a query language, such as SQL.  This is a “structured way for retrieving search requests”, using artificial language commands (Feather & Sturges, 2003).  

According to Baeza-Yates and Ribiero-Neto (1999) a “data retrieval language aims at retrieving all objects which satisfy clearly defined conditions such as those in a regular expression or in a relational algebra expression. Thus, for a data retrieval system, a single erroneous object among a thousand retrieved objects means total failure.”

To clarify, database queries are structured as such:-

select ColumnA from TableB where CriteriaC_is_met


Any error in this structure - however minor - will result in the failure of the search, i.e. no matches. (For more examples of SQL search queries, see here.)

Information, however, is largely unstructured, existing in a number of formats and indexed in different ways.  Consequently information retrieval is based upon user information needs, and these are naturally subjective (Rosenfeld & Morville, 2007).  This means two things: –
  1. search queries will be based on those user needs and;
  2. search results will either be relevant or not.
To take point one, information queries may be divided into different types:-  navigational (searching for a website); transactional (searching for a service); or informational (searching for information on a certain subject) (MacFarlane, 2011).  The user may know exactly what they want to find; then again, they may not.  This ‘anomalous state of knowledge’ (ASK) informs the type of search query the user makes.  Where IR departs from DR is that IR search queries may take on different forms, for example, natural language and Boolean queries. (For a table outlining the differences between IR and DR, refer to Appendix A).

From personal study using various search queries on two different search engines (Google and Bing), natural language queries generally return relevant results, although using quotation marks and deleting stop words will narrow the search and increase precision.  Boolean operators also returned different results, as both search engines interpreted search queries in different ways (see Appendix B for the results of the above study).

Depending on the type of information required e.g. transactional, informational etc., it is likely that search queries will return different results.  For example, Anne is doing a project on the Captain Swing Riots of the 19th century.  She wants as much information as possible, and decides to use two different search engines and compare their results.  In both Google and Bing she types in the natural language query ‘Who is Captain Swing?’ (minus quotation marks).  Google’s results were all relevant.  Bing’s top rated result was also relevant, but all the following results were irrelevant (returning information on a band called ‘Captain Swing’).  Curious, Anne then deletes the stop words from her previous query, and types “Captain Swing” into both search engines (quotation marks included).  This time four of Google’s top five results were relevant; one of Bing’s top five results was relevant.  Therefore, of the two search engines, Google had satisfied her user needs more effectively.

Later, while using the natural language query ‘what are Jerusalem artichokes and how do I cook them?’, Anne discovers that many of the results are about growing artichokes.  This time she uses another strategy to narrow down her search – Boolean operators.  She types in ‘Jerusalem artichokes AND cook NOT grow’.  This is effective in the Bing search engine, but not in the Google search engine.  She later discovers that Google accepts other forms of Boolean operators, and that by typing ‘Jerusalem artichokes + cook – grow’, she will again find more relevant results.


As can be seen, natural language queries deal in a certain amount of ambiguity, and may not necessarily provide appropriate results.  With data retrieval, a search provides either a match or no match.  With information retrieval, a search must fulfil the user’s need. In short, it must be relevant.

There are two ways of judging relevance – binary judgement (where something is relevant or it is not), or graded judgement (when some results are more relevant than others).  User satisfaction in IR may be evaluated by calculating the recall or the precision of the search results, where:- 


It is important to note that there is an inverse relationship between recall and precision - where one increases, the other must decrease.

There are drawbacks to different methods of IR.  Boolean operators are not intuitive, but rigid; a search on teaching French in schools may equally return results on teaching in French schools (Feather & Sturges, 2003).  Likewise, natural language queries may result in low-precision results due to irrelevant documents that contain high levels of keywords “by chance or out of context” (Lee, Seo, Jeon and Rim, 2011).  Deleting stop words and adding quotation marks decreases recall. There are many ways in which user needs may not be satisfied, and there is no ‘right way’ of improving search results.  This is simply because it is the user’s needs that determine the type of search query used.

It is therefore important that the information to be searched is appropriately managed.  For example, is it in the correct format?  Should it be searched through keywords or keyphrases?  What about conflating words, including synonyms, and ignoring stop words?  These methods are all vital in making information more accessible to the user (MacFarlane, Butterworth and Krause, 2011).

To conclude, data and information retrieval could not be more different.  Data has the advantage of not being subject-based.  A database is built with its own well-defined semantics.  It is the opposite for IR.  There are no well-defined semantics, and so the IR system has to interpret the semantic content of the documents and bring together what it deems relevant.  Reaching this goal appears to be a two-way street.  The information in the document itself must be well-managed by the creator; the user must also use an appropriate IR method according to his or her own information needs.  Likewise, the evaluation of search results will be determined subjectively by the user, according to those needs.

Blog URL:- http://digisqueeb.blogspot.com


References

Baeza-Yates, R. and Ribiero-Neto, B. (1999). Modern Information Retrieval. [online] Boston, Massachussetts: Addison Wesley Longman Inc. Available at: http://people.ischool.berkeley.edu/~hearst/irbook/index.html [Accessed: 22 October 2011]. 

Feather, J. and Sturges, R. P. eds. (2003). International Encyclopedia of Information and Library Science. 2nd ed. London: Routledge.

Karlgren, J. (2004). Information retrieval: introduction. [online] Available at: http://www.sics.se/~jussi/Undervisning/IRI_vt04/Overview.html [Accessed: 23 October 2011].

Lee, J., Seo, J., Jeon, J. and Rim, H. (2011). ‘Sentence-based relevance flow analysis for high accuracy retrieval.’  Journal of the American Society for Information Science & Technology [e-journal] 62 (9), pp. 1666-1675. Available through: JSTOR [Accessed: 25 October 2011].

MacFarlane, A. (2011). Lecture 04: Information Retrieval, INM348 Digital Information Technologies and Architectures. City University London [unpublished].



MacFarlane, A., Butterworth, R. and Krause, A. (2011) Lecture 03: Structuring and querying information stored in databases. INM348 Digital Information Technologies and Architectures. City University London [unpublished].

Rosenfeld, L. and Morville, P. (2007).  Information Architecture for the World Wide Web. 3rd ed. Cambridge: O'Reilly.


Appendix A

The following table by The Swedish Institute of Computer Science (SICS) clearly summarises the difference between data and information retrieval. [Accessed: 23 October 2011]


Information vs Data Retrieval

DR
IR
Matching
Exact match
Partial match
Model
Deterministic
Probabilistic
Query language
Artificial
Natural (... well)
Query specification
Complete
Incomplete
Items wanted
Matching
Relevant



Appendix B

The results of an exercise calculating the precision of various search results from Google and Bing.  The original spreadsheet may be viewed at http://www.student.city.ac.uk/~abkb824/Exercises.xlsx




No comments:

Post a Comment