First, however, it is important to be clear on what data and
information actually are. Data may be
described as “a term for quantitative or numerically encoded information”,
whilst information is “data that has been processed into a meaningful form” (Feather & Sturges, 2003).
Data is usually stored in a database, a “systematically ordered
collection of information”(Feather & Sturges, 2003). Retrieving data from the database
requires the use of a query language, such as SQL. This is a “structured way for retrieving
search requests”, using artificial language commands (Feather &
Sturges, 2003).
According to Baeza-Yates and Ribiero-Neto (1999) a “data retrieval language aims at retrieving all
objects which satisfy clearly defined conditions such as those in a regular
expression or in a relational algebra expression. Thus, for a data retrieval
system, a single erroneous object among a thousand retrieved objects means
total failure.”
To clarify, database queries are structured as such:-
select ColumnA from TableB where CriteriaC_is_met
Any error in this structure - however minor - will result in the failure of the search, i.e. no matches. (For more examples of SQL search queries, see here.)
To clarify, database queries are structured as such:-
select ColumnA from TableB where CriteriaC_is_met
Any error in this structure - however minor - will result in the failure of the search, i.e. no matches. (For more examples of SQL search queries, see here.)
Information, however, is largely unstructured, existing in a number
of formats and indexed in different ways.
Consequently information retrieval is based upon user information needs, and these
are naturally subjective (Rosenfeld & Morville, 2007). This means
two things: –
- search queries will be based on those user needs and;
- search results will either be relevant or not.
To take point one, information queries may be divided into different
types:- navigational (searching for a
website); transactional (searching for a service); or informational (searching for
information on a certain subject) (MacFarlane, 2011). The
user may know exactly what they want to find; then again, they may not. This ‘anomalous state of knowledge’ (ASK)
informs the type of search query the user makes. Where IR departs from DR is that IR search
queries may take on different forms, for example, natural language and Boolean
queries. (For a table outlining the differences between IR and DR, refer to Appendix A).
From personal study using various search queries on two different
search engines (Google and Bing),
natural language queries generally return relevant results, although
using quotation marks and deleting stop words will narrow the search and
increase precision. Boolean operators
also returned different results, as both search engines interpreted search
queries in different ways (see Appendix B for the results of the above study).
Depending on the type of information required
e.g. transactional, informational etc., it is likely that search queries will
return different results. For example,
Anne is doing a project on the Captain Swing Riots of the 19th
century. She wants as much
information as possible, and decides to use two different search engines and
compare their results. In both Google and Bing she types in the
natural language query ‘Who is Captain Swing?’ (minus quotation marks). Google’s results were all relevant. Bing’s top rated result was also relevant,
but all the following results were irrelevant (returning information on a band
called ‘Captain Swing’). Curious, Anne
then deletes the stop words from her previous query, and types “Captain Swing” into both search engines (quotation marks included). This time four of Google’s top five results
were relevant; one of Bing’s top five results was relevant. Therefore, of the two
search engines, Google had satisfied her user needs more effectively.
Later, while
using the natural language query ‘what are Jerusalem artichokes and how do I
cook them?’, Anne discovers that many of the results are about growing
artichokes. This time she uses another
strategy to narrow down her search – Boolean operators. She types in ‘Jerusalem artichokes AND cook
NOT grow’. This is effective in the Bing
search engine, but not in the Google search engine. She later discovers that Google accepts other
forms of Boolean operators, and that by typing ‘Jerusalem artichokes + cook –
grow’, she will again find more relevant results.
As can be seen, natural language queries deal in a certain amount of ambiguity, and may not necessarily provide appropriate results. With data retrieval, a search provides either a match or no match. With information retrieval, a search must fulfil the user’s need. In short, it must be relevant.
There are two ways of judging relevance – binary judgement (where something is
relevant or it is not), or graded judgement (when some results are more relevant than
others). User satisfaction in IR may be
evaluated by calculating the recall or the precision of the search results, where:-
It is important to note that there is an inverse relationship between recall and precision - where one increases, the other must decrease.
There are drawbacks to different methods of IR. Boolean operators are not intuitive, but rigid; a search on teaching French in schools may equally return results on teaching in French schools (Feather & Sturges, 2003). Likewise, natural language queries may result in low-precision results due to irrelevant documents that contain high levels of keywords “by chance or out of context” (Lee, Seo, Jeon and Rim, 2011). Deleting stop words and adding quotation marks decreases recall. There are many ways in which user needs may not be satisfied, and there is no ‘right way’ of improving search results. This is simply because it is the user’s needs that determine the type of search query used.
There are drawbacks to different methods of IR. Boolean operators are not intuitive, but rigid; a search on teaching French in schools may equally return results on teaching in French schools (Feather & Sturges, 2003). Likewise, natural language queries may result in low-precision results due to irrelevant documents that contain high levels of keywords “by chance or out of context” (Lee, Seo, Jeon and Rim, 2011). Deleting stop words and adding quotation marks decreases recall. There are many ways in which user needs may not be satisfied, and there is no ‘right way’ of improving search results. This is simply because it is the user’s needs that determine the type of search query used.
It is therefore important that the information
to be searched is appropriately managed. For example, is it in the correct format? Should it be searched through keywords or
keyphrases? What about conflating words,
including synonyms, and ignoring stop words?
These methods are all vital in making information more accessible to the user (MacFarlane, Butterworth and Krause, 2011).
To conclude, data and information retrieval
could not be more different. Data has
the advantage of not being subject-based.
A database is built with its own well-defined semantics. It is the opposite for IR. There are no well-defined semantics, and so
the IR system has to interpret the semantic content of the documents and bring
together what it deems relevant.
Reaching this goal appears to be a two-way street. The information in the document itself must
be well-managed by the creator; the user must also use an appropriate IR method
according to his or her own information needs.
Likewise, the evaluation of search results will be determined
subjectively by the user, according to those needs.
Blog URL:- http://digisqueeb.blogspot.com
References
Baeza-Yates, R. and Ribiero-Neto,
B. (1999). Modern Information Retrieval. [online] Boston, Massachussetts: Addison Wesley Longman Inc.
Available at: http://people.ischool.berkeley.edu/~hearst/irbook/index.html [Accessed: 22 October 2011].
Feather, J. and Sturges, R. P. eds. (2003). International Encyclopedia of Information and Library Science. 2nd ed. London: Routledge.
Karlgren, J. (2004). Information retrieval: introduction. [online]
Available at: http://www.sics.se/~jussi/Undervisning/IRI_vt04/Overview.html [Accessed: 23 October 2011].
Lee, J., Seo, J., Jeon, J. and Rim, H. (2011). ‘Sentence-based relevance flow analysis for high accuracy retrieval.’ Journal of the American Society for Information Science & Technology [e-journal] 62 (9), pp. 1666-1675. Available through: JSTOR [Accessed: 25 October 2011].
MacFarlane, A. (2011). Lecture 04: Information Retrieval, INM348 Digital Information Technologies and Architectures. City University London [unpublished].
MacFarlane, A., Butterworth, R.
and Krause, A. (2011) Lecture 03: Structuring and querying information
stored in databases. INM348 Digital Information Technologies and Architectures. City University London [unpublished].
Rosenfeld, L. and Morville, P. (2007). Information Architecture for the World Wide Web. 3rd ed. Cambridge: O'Reilly.
Appendix A
The following table by The Swedish Institute of Computer Science (SICS) clearly summarises the difference between data and information retrieval. [Accessed: 23 October 2011]
Information vs Data Retrieval
DR
|
IR
| |
Matching
|
Exact match
|
Partial match
|
Model
|
Deterministic
|
Probabilistic
|
Query language
|
Artificial
|
Natural (... well)
|
Query specification
|
Complete
|
Incomplete
|
Items wanted
|
Matching
|
Relevant
|
Appendix B
The results of an exercise calculating the precision of various search results from Google and Bing. The original spreadsheet may be viewed at http://www.student.city.ac.uk/~abkb824/Exercises.xlsx
No comments:
Post a Comment