Thursday, 29 December 2011

Web 2.0 and Web 3.0 – A Gamer’s Perspective


With the advent of faster internet connections, online gaming has grown in popularity; likewise, the emergence of Web 2.0 has given birth to the world of social networking.  But these two entities are not always entirely exclusive of one another.  Along with Facebook have also grown a plethora of social network games, such as Farmville and Mob Wars, where participants can interact with one another as well as play the game itself.

Similar games such as IMVU and Second Life include the option for members to create custom content (CC), which adds an extra dimension of interactivity for participants, allowing them to trade, exchange, and even buy and sell custom creations via the internet.  This essay will discuss the ways in which Web 2.0 and 3.0 have enabled users to create and exchange such content, as well as swap tutorials, discuss game information, and find an audience.  Specifically the essay will look at The Sims (2000), an old game by Maxis (now a subsidiary of Electronic Arts) which, despite two newer sequels, still retains a large following.  Many sites related to this game are now obsolete, motivating fans to preserve and archive custom content which can no longer be found elsewhere.  Three points will be discussed: - 1) how gamers have used Web 2.0 and 3.0 tools to effectively create, archive and publish The Sims custom content on the web; 2) how they use current advances in ICTs to create ‘information networks’ with other fans and gamers and; 3) how the archiving of custom content has been digitally organised and represented on the web.  Lastly, we will see how present technologies can be further used to enhance gamers' experiences in the future.

1. Creating and Publishing Custom Content (CC)

1.1.  XML and the creation of custom content for The Sims

Objects in the game The Sims are stored in the IFF (Interchange File Format) file format, a standard file format which was developed by Electronic Arts in 1985 (Seebach, 2006).  Over the years, gamers and fans have discovered ways to manipulate IFF files, and thus clone and customise original Sims objects.  This has been done with the tacit endorsement of the game creator, Will Wright (Becker, 2001), and has been supported by the creation of various open-source authoring and editing tools such as The Sims Transmogrifier and Iff Pencil 2.

The Sims Transmogrifier enables users to ‘unzip’ IFF object files, exporting the sprites into BMP format, and converting the object’s metadata into XML format (see Fig. 1).  XML (eXtensible Markup Language) is a standard ‘metalanguage’ format which “now plays an increasingly important role in the exchange of a wide variety of data on the web” (Chowdhury and Chowdhury, 2007, p.162).  Evidence of its versatility is that it can be used to edit the properties of Sims objects, either manually or via a dedicated program. The game application then processes the data by reading the tags in context.

Fig. 1 - An example of a Sims object's metadata expressed in XML.  The tags can be edited to change various properties of the object, such as title, price and interactions.

This highlights two points relating to the general use of XML:- a) it is versatile enough to be used to edit the properties of game assets and; b) it is accessible enough for gamers to effectively edit it in order to create new game assets.

1.2. Pay vs. Open-Source

The pay vs. open-source debate has not left The Sims world unscathed.  In fact, it is a contentious issue that has divided the community for years.  CC thrives on open-source software; indeed, cloning proprietary Maxis objects in order to create new ones would not be possible without free programs such as Transmogrifier.  These tools have supported the growth of an entire creative community that modifies game content.  However, there are software tools that are not open-source and which are often met with disapproval by gamers.  Some CC archive sites, such as CTO Sims, have opted to make these programs available to its members for free, which has been controversial for some.

Even more contentious is the selling of CC, which has been made possible with the use of APIs (see Fig. 2).  Many players feel that they should not have to pay for items which have, after all, merely been cloned from Maxis originals.  Some creators choose to sit on the fence, putting Paypal donation buttons on their sites for those who wish to contribute to web-hosting costs.  It seems that such a debate would not exist without the existence of online transaction web services like Paypal, which seem to have fed a controversial market in CC.  On the other hand it is largely the open source 'movement' which has allowed creators to make their content in the first place, and so the latest web technologies have essentially fuelled both sides of a debate which looks set to continue.

Fig. 2 - A Paypal API added to the site Around the Sims.  The donation button allows payments to be made directly through Payal's site.  Registration is not necessary.

1.3. Publishing CC on the Web

There are several methods through which creators publish their work.  One is blogging.  Fig 3 shows an example of a creator's blog, Olena's Clutter Factory, which showcases her CC and allows gamers to download them.  The Clutter Factory is a very successful site – it not only provides high quality downloads, but also a level of interactivity not often seen with The Sims fan sites.  Because it is hosted by Blogger, it allows fellow Google members to follow updates as new creations are added, post comments, and browse via tags.

Fig 3. - A blog entry at Olena's Clutter Factory
Another example of effective publishing is the author's website, Sims of the World, which is hosted by Google Sites (see Appendix 1).  Google Sites allows the webmaster to use a variety of tools in order to publish data on the web.  This includes the ability to create mashups using what Google calls 'gadgets'.

Google gadgets are essentially APIs, “miniature objects made by Google users … that offer … dynamic content that can be placed on any page on the web” (Google, 2011).  Examples of Google gadgets are the Picasa Slideshow and the +1 facility.  Sims of the World uses Picasa Slideshow to display the site owner's online portfolio to fellow gamers in a sidebar.  Gamers are therefore able to view examples of the creator's work before choosing to download it.  If they like what they see, they have the option to use the +1 button, which effectively recommends the site to other gamers by bumping up its rating on Google's search engine.

Google Sites also allows the addition of outside APIs.  Sims of the World has made use of this facility by embedding the code from a site called Revolver Maps, which maps visitors to the site in real time.  Not only can the webmaster view previous, static visits to the site, they can also view present visitors to the site via a 'beacon' on the map.  This tool allows the ability to view visitor demographics and the popularity of the site.  For a more in-depth view, the site's visits may be mapped via Google Analytics, which provides more detailed information.

Google itself is an excellent tool for web publishers as it accommodates editing whilst on the move.  The Google Chrome browser now comes with a handy sync function, which stores account information in the 'cloud', and allows it to be accessed wherever the webmaster happens to be.  Browser information such as passwords, bookmarks and history are stored remotely, which makes accessing the Google Sites manager quick and convenient, with all relevant data available immediately.  This is an invaluable tool for webmasters, as well as for the regular internet surfer.

2. Social Networking and The Sims

"Participation requires a balance of give (creating content) and take (consuming content)." 
(Rosenfeld and Morville, 2002, p.415).

The customisation of game assets from The Sims has opened up a huge online community which shares, exchanges and even sells CC.  The kudos attached to the creation of CC means that creators often seek to actively display and showcase their work on personal websites, blogs and Yahoo groups.  This has created a kind of social networking where creators exchange links to their websites, comment on one another’s works, and display their virtual portfolio.  As Tomlin (2009, p.19) notes: “The currency of social networks is your social capital, and some people clearly have more social capital than others.”
How then do creators promote their social capital?

Gamers have been quick to use social networking in order to showcase creations or keep the wider community informed.  An example is the Saving the Sims Facebook page, which has a dedicated list of CC updates, as well as relevant community news.  Other examples include Deestar's blog, Free Sims Finds, which also displays pictures of the latest custom content for The Sims, Sims 2 and Sims 3.

Forums have also become the lifeblood of The Sims gaming community and its “participation economy” (Morville and Rosenfeld, 2002).  These include fora such as The Sims Resource, CTO Sims and Simblesse Oblige, which give gamers the opportunity to share hints, tips, and other game information as well as their own creations.  It also offers the chance to create relationships with other gamers and fans.  Themed challenges and contests are a popular past-time at these fora, with prizes and participation gifts given out regularly.  Threads are categorised by subject, can be subscribed to, and at CTO can even be downloaded as PDFs for quick, offline reading. 

Simblesse Oblige is a forum that specialises in gaming tutorials.  Unlike webpage-based tutorials, Simblesse Oblige's tutorials have the advantage of being constantly updated by contributors (similar to the articles on Wikipedia).  Forum members may make suggestions in the tutorial thread, add new information to the tutorials, or even query the author.  Members are welcome to add their own tutorials (with the option of uploading explanatory images) to the forum and share their own knowledge with peers.  This constantly changing, community-driven forum is a far cry from the largely one-dimensional, monolithic Usenet lists of Web 1.0.

3. Digital Organisation of The Sims CC

3.1. Archiving old CC

The Sims is an old game, and over the years many CC sites have disappeared.  In recent years The Sims has seen a resurgence in popularity, and consequently there has been a move to rescue and preserve such content via sites such as CTO Sims, Saving the Sims, or Sims Cave.  Unfortunately, this 'rescue mission' has been a slap-dash, knee-jerk reaction to the sudden disappearance of so much ephemera - items that no one thought they would need.  Much has been done to save the contents of entire websites via the Internet Archive's Wayback Machine, for instance, but this cannot guarantee that any one website has been scraped in its entirety.

Sims collectors or archivists have resorted to community donations to help save what has been lost.  This is a patchy process, as gamers generally do not collate custom content in an organised fashion, sometimes not knowing where their 'donations' come from or what exactly they consist of.  Thus, archiving custom content is a time-consuming business which involves determining the nature of each file, labelling it, determining its properties and provenance and publishing it under the correct category (e.g. type of custom content, original creator/website etc.).

3.2. The Future of CC Archiving?

CTO Sims is a site - part archive, part forum - which attempts to rescue custom content from dead sites and publish it for the use of the community, categorising each item as closely to the original host site as possible.  However, this is often impossible, as some sites are no longer accessible even via the Internet Archive.  Since additions to the archive are made on a largely ad hoc basis (as and when donations are uploaded by the community), categorising is often inefficient, and duplications often occur, since there is no catalogue.  The number of files is ever-expanding, creating an increasing need for a cataloguing or indexing system of some kind. 

With the use of XML, RDF (Resource Description Framework) and other indexing tools, this may be the perfect time for such a system to be developed by the community.  It is hoped that present and future technologies will aid in the further preservation of these niche ephemeral items, which display a unique side of online pop-culture.  (See Appendix 2 for an example of a proposed RDF schema).

References and Bibliography

Becker, D., 2001. Newsmaker: The Secret Behind ‘The Sims’. CNET News,16 March 2001.  Available at: [Accessed on 28 December 2011].

Brophy, P., 2007. The Library in the Twenty-First Century. 2nd ed.  London: Facet.

Chowdhury, G. G. and Chowdhury, S., 2007.  Organizing Information: from the shelf to the web.  London: Facet.

Feather, J. and Sturges, R. P. eds., 2003. International Encyclopedia of Information and Library Science. 2nd ed. London: Routledge.

Google, 2011.  Gadgets. [online] Available at: [Accessed on 28 December 2011].

Rosenfeld, L. and Morville, P., 2002.  Information Architecture for the World Wide Web. 2nd ed.   California: O’Reilly.

Seebach, P., 2006. Standards and specs: The Interchange File Format (IFF). 13 June 2006.  Available at: [Accessed on 27 December 2011].

Sihvonen, T., 2011. Players Unleashed! Modding The Sims and the Culture of Gaming. Amsterdam: Amsterdam University Press.

Tomlin, I., 2009. Cloud Coffee House: The birth of cloud social networking and death of the old world corporation.  Cirencester: Management Books 2000.

Appendix 1 

An example of a mashup on Sims of the World.

Appendix 2 

An example of a proposed RDF schema applied to CC for The Sims, composed by the author.

This picture shows a skin for The Sims.  The skin file (.skn, .cmx and .bmp) would be the subject.  ‘The Seamstress’ would be the object.  The predicate would be ‘has title’ (

Here are examples of other RDF triples that can be applied to this item:-

Subject: this item
Object: skin file

Subject: this item
Object: Ludi_Ling

Subject: this item
Object: 11/08/2011

Subject: this item
Object: birthday present for JM

Subject: this item
Object: rar.

Subject: this item
Object: Member Creations/Skins
Predicate: Category

Sunday, 4 December 2011

The World of Open

One thing I'm all for is the world of open on the web.  I can't be bothered with registering, subscribing and paying.  Surely the whole point of the net is to have everything at your fingertips?  Who wants to jump through all those hoops?

I'm a big fan of FileHippo.  Many of the programs I've downloaded are open source.  I'm not crazy about the whole open source movement (I take a practical approach rather than an ideological one), but I fully support anything to do with open source and open data.  With freedom of information, it makes it a whole lot more practical to make everything available online instead of spending hundreds of man hours working through request after request.  Not to mention the advantages of creating RDF schemas for such data and creating huge mashups of information.

On a personal level, open source comes a lot into my everyday life creating for the Sims.  There is a big divide in the community about being 'fileshare friendly', or choosing to withhold or charge for the content you have created.  This goes not only for custom items, skins and objects, but also for custom programs made to get the most out of the game.  It is a very contentious issue, and still divides the community after many years.  It seems that open source is something that people feel strongly about on both sides of the divide.

Sharing my creations freely is, for me, being part of an online community to get the best out of your game.  Also, the ability to push the boundaries with your creations and 'show off' what you can achieve is also a big motivation.  It is also a great way for creators and users to share their ideas and needs, to look outside the box and to create things from a whole new viewpoint.  Creators create in response to other player's needs as well as their own.  This makes the community atmosphere very productive, stimulating and open-minded.  And more often than not, people are passionate enough about a service to donate in order to cover costs, so it works both ways.

Friday, 25 November 2011

The journey into Web 3.0....

I'd always read of the semantic web as Web 2.0.  So it was a surprise to come across the concept of Web 3.0.  Actually, the term 'semantic web' has been bandied around for a long time.  What it actually is is up for debate.  The easiest way to think about it is in terms of Web 1.0 = read only, Web 2.0 = read/write, and Web3.0 = read/write/execute.

What I take from the concept of Web 3.0 is the idea of making machines smarter, more 'intuitive'.  Of course, since computers are inherently stupid, this involves us giving the means with which to become smarter and more intuitive.  Like giving them the semantics to work with, in order to retrieve data more effectively.  In short, data is not only machine readable, but machine understandable.

This is all technically theory, since the semantic web doesn't really exist - yet.  We still don't really have a system whereby computers can build programmes, websites, mashups, what have you, in an intelligent and 'intuitive' way according to our needs.  But the potential is there, with tools like XML and RDF.  These involve the creation of RDF triples, taxonomies, and ontologies.  What the hell, you may ask?  And I may well agree with you.  How I've come to understand it is that these are essentially metadata applied to information in order to relate those pieces of information together in a semantic way.  These relationships between data or pieces of information make it easier to retrieve them, because they are now given a context within a larger whole.

This does have its parallels with Web 1.0.  RDF Schemas (taxonomies expressed as groups of RDF triples) have a certain similarity to the concept of relational databases.  RDF schema 'nodes' can be equated to the primary key in relational databases.  In fact, the whole idea of RDF schemas can get a little confusing, so some of us found it easier to think of it in terms of a web-based relational database.

Relational database <-> RDF schema
The idea behind this is essentially to create relational databases on the net, but to link them semantically to one another, effectively creating the ability to find correlations between vast amounts of data that would otherwise be remote and essentially invisible.  The advantage is that we can get the accuracy of data retrieval results on the Web.  To take the diagram's example; we have a database of student's GCSE results.  If we pair this up with a demographic database, and apply an ontology to them (thereby creating rules of inference), we can discover correlations between GCSE results and student demographics.

The potential to make even more advanced and relevant mashups is the huge advantage of this technology.  However, the problem is that it requires quite a bit of skill and time investment in order to create these schemas, taxonomies and ontologies.  Most people would be satisfied with what they can create via Web 2.0 technologies - with social networking, blogging and mashups.  The potential to make faulty schemas is also a problem - as with a database, any error in organising the information, or in the data retrieval process will mean a complete inability to access data.

But the main problem I see is that these schemas are only really applicable to certain domains, such as librarianship, archiving, biomedics, statistics, and other similar fields.  For the regular web user, there are no particular benefits to applying an RDF schema to, say, your Facebook page - but it would probably be useful to Mark Zuckerberg.

The semantic web is probably more useful in the organisation of information on the web, and harnessing it.  But it would probably be largely invisible and irrelevant to the legions of casual web users out there.

Wednesday, 16 November 2011

The Web on the Move, Part 2

Taking what we had learned from Monday's lesson, our computer lab exercise asked us to design a mobile device app that fulfils our needs as City students.  There was a lot of discussion and good ideas flying around; a lot of looking at our own smartphones and seeing how they presented their apps and such.  There was also much checking of City's Moodle system and its many flaws and drawbacks.

I'd initially been a bit worried about this exercise, not knowing exactly what I could bring to it; but the conversation and ideas were so stimulating, I soon became very excited about designing my own app.  As soon as I got home, I started making a mockup of a City Moodle app on Photoshop.

A City Moodle app mockup.  Does not discriminate between smartphones (but only because I don't actually have an iPhone).

  • Mail - Instant access to your City email account.
  • Discussion - Feed of the latest posts on the Moodle discussion boards with the option of instant posting/replying.
  • Compass - GPS or wi-fi based showing your campus location.  It allows you to key in your classes and get room directions.
  • Bluetooth - Allows you to instantly download lecture notes and slides so that you can read them on the move.
  • Scanner - Liam's brilliant idea!  See a book in the bookshop, and scan the barcode to check its availability at City's library.
  • Library - Access to your library account and the library catalogue; allows you to reserve and renew books whilst on the move.
On top of this there is a drop-down menu which allows you to access your course details whenever you want.  The drop-down menu unfolds over the length of the screen, and folds when you touch outside the body of the menu.  I'm sure there's a lot more that could be integrated into this.  I think it would be a great idea to have such an app for City students; the exercise made me realise just how useful it could really be.

The Web on the Move, Part 1

So I guess this lesson succeeded in that it made me upgrade my ancient (i.e. 2 year old) phone and get a new touch-screen smartphone!  So unfair that I have to wait hours for it to charge before I use it! T_T

Frankly, I was surprised I managed to hold out for so long.  Smartphones are now pretty ubiquitous, and being a bit of a gadget fan, the temptation was certainly there to capitulate to the web as mobile platform.  So why are smartphones so popular?

Having a computer in your pocket whilst on the move is certainly a big pro.  The main pro of mobile devices is that they are context aware.  The know where they are, and can provide lots of information about their location, e.g. in-built GPS system.  The cons are obviously the limited screen and keyboard size (not so much of a problem for someone with tiny hands like me, and I like bite-sized phones, but hey...).  Other cons like limited connectivity and battery life are Moore's Law problems, and will probably get solved if you sit around for another couple of years (which is what I ended up doing before I got my new phone, lol!).

The main problem with mobile devices is that they don't really fulfil user's information needs.  It is the current technology that defines our needs.  For example, with my old phone, I knew there was no point in trying to connect to City's Moodle system, because my phone simply could not handle it.  It would have been mighty useful to have access to Moodle like I would have had on a Smartphone, but it was impossible and so I would just wait until I had access to my laptop.

Looking at it from another angle, technology can also provide users with things they never knew they wanted.  Just now I saw an advert for the new iPhone 4S, with voice control technology.  Want to know if you need an umbrella when you go out tonight?  Just ask your phone.  It'll give you an immediate answer.  But who would have thought you'd even want this before it became reality?

My laptop has in-built voice recognition technology.  Wow, I thought!  I can really do with this.  After setting it up and using it for a day to get it to learn my speech patterns, guess what - I never used it again.  Maybe I didn't need it so much after all.

Wednesday, 9 November 2011

Personalisation and increased functionality on Web 2.0

A couple of weeks back we discussed the idea of the internet as a platform being one of the defining features of Web 2.0.  This week we learned more about how the ‘internet as platform’ functions – namely through the use of web services and API’s.

Web services allow the personalisation of information that is passed through the web to the user.  Essentially it is ‘middleware’, re-processing machine-readable data from the server in order to present it in a form that is uniquely tailored to the client.  As the client, we neither have to see nor understand how that data is re-processed in order to gain access to it.  It is delivered to us in a simple and convenient form by the web service without needing any significant programming or computing knowledge on our part.

For example, I  have 2 laptops – one for home use and one for Uni work.  How can I have access to my web browsing data from just one machine when I am away from the other?  Google allows me to sync my browser data via my Google account; so when I use the Chrome browser from anywhere, I still have access to my bookmarks, passwords, usernames and other saved data.

XML is the ‘language’ of web services, similar to HTML yet much more flexible.  Learning to write HTML back in the late 90’s, I discovered that it was basically rigid – HTML ‘tags’ or commands are prescribed and can only be read by certain programs.  XML is similar in that it uses such tags and commands, but these can be created by the user and provide unseen, machine-readable metadata.  It is also readable by a wide number of programs.  For example, object data for the game the Sims is encoded in XML, and can be manually manipulated in order to create custom content.

Nowadays, most programming is done through the use of API’s (Application Programming Interfaces), which essentially hides complex internal structures from the user of the API, thus making it more user-friendly.  API’s can be used by other programmers and developers to create their own widgets, gadgets, apps and countless other useful little gizmos which can personalise your data.  Hence we get Twitter feed apps for our iPhones and Google Map sat nav on our tablets. 

In the ‘real world’, API’s have really brought out the functionality of my website.  I am able to embed a slideshow of all my creations onto my website, so users can immediately see what I have to offer.  I also have a Revolver Maps gadget embedded into my page, so that I can keep track of where my visitors come from, and who is currently browsing my site.  I also have a link to an RSS feed, and a Google +1 gadget added so that people can recommend and keep up to date with my site.  The power of the API and the web service lies in the fact that they are so accessible, so flexible, and able to tailor information to both you and your audience.  Ten years ago, setting up a sophisticated visitor counter to your website would’ve probably required some technical know-how.  Nowadays, thanks to API’s, it’s all there at the click of a button.

Sunday, 30 October 2011

DITA Assignment, Part One - Web 1.0 - Data Retrieval vs Information Retrieval

This short essay seeks to answer a seemingly innocuous question: what are the differences between data retrieval (DR)  and information retrieval (IR)?  To the layman, the difference between the two concepts may seem hazy.  Yet both are inherently different.

First, however, it is important to be clear on what data and information actually are.  Data may be described as “a term for quantitative or numerically encoded information”, whilst information is “data that has been processed into a meaningful form” (Feather & Sturges, 2003).

Data is usually stored in a database, a “systematically ordered collection of information”(Feather & Sturges, 2003).  Retrieving data from the database requires the use of a query language, such as SQL.  This is a “structured way for retrieving search requests”, using artificial language commands (Feather & Sturges, 2003).  

According to Baeza-Yates and Ribiero-Neto (1999) a “data retrieval language aims at retrieving all objects which satisfy clearly defined conditions such as those in a regular expression or in a relational algebra expression. Thus, for a data retrieval system, a single erroneous object among a thousand retrieved objects means total failure.”

To clarify, database queries are structured as such:-

select ColumnA from TableB where CriteriaC_is_met

Any error in this structure - however minor - will result in the failure of the search, i.e. no matches. (For more examples of SQL search queries, see here.)

Information, however, is largely unstructured, existing in a number of formats and indexed in different ways.  Consequently information retrieval is based upon user information needs, and these are naturally subjective (Rosenfeld & Morville, 2007).  This means two things: –
  1. search queries will be based on those user needs and;
  2. search results will either be relevant or not.
To take point one, information queries may be divided into different types:-  navigational (searching for a website); transactional (searching for a service); or informational (searching for information on a certain subject) (MacFarlane, 2011).  The user may know exactly what they want to find; then again, they may not.  This ‘anomalous state of knowledge’ (ASK) informs the type of search query the user makes.  Where IR departs from DR is that IR search queries may take on different forms, for example, natural language and Boolean queries. (For a table outlining the differences between IR and DR, refer to Appendix A).

From personal study using various search queries on two different search engines (Google and Bing), natural language queries generally return relevant results, although using quotation marks and deleting stop words will narrow the search and increase precision.  Boolean operators also returned different results, as both search engines interpreted search queries in different ways (see Appendix B for the results of the above study).

Depending on the type of information required e.g. transactional, informational etc., it is likely that search queries will return different results.  For example, Anne is doing a project on the Captain Swing Riots of the 19th century.  She wants as much information as possible, and decides to use two different search engines and compare their results.  In both Google and Bing she types in the natural language query ‘Who is Captain Swing?’ (minus quotation marks).  Google’s results were all relevant.  Bing’s top rated result was also relevant, but all the following results were irrelevant (returning information on a band called ‘Captain Swing’).  Curious, Anne then deletes the stop words from her previous query, and types “Captain Swing” into both search engines (quotation marks included).  This time four of Google’s top five results were relevant; one of Bing’s top five results was relevant.  Therefore, of the two search engines, Google had satisfied her user needs more effectively.

Later, while using the natural language query ‘what are Jerusalem artichokes and how do I cook them?’, Anne discovers that many of the results are about growing artichokes.  This time she uses another strategy to narrow down her search – Boolean operators.  She types in ‘Jerusalem artichokes AND cook NOT grow’.  This is effective in the Bing search engine, but not in the Google search engine.  She later discovers that Google accepts other forms of Boolean operators, and that by typing ‘Jerusalem artichokes + cook – grow’, she will again find more relevant results.

As can be seen, natural language queries deal in a certain amount of ambiguity, and may not necessarily provide appropriate results.  With data retrieval, a search provides either a match or no match.  With information retrieval, a search must fulfil the user’s need. In short, it must be relevant.

There are two ways of judging relevance – binary judgement (where something is relevant or it is not), or graded judgement (when some results are more relevant than others).  User satisfaction in IR may be evaluated by calculating the recall or the precision of the search results, where:- 

It is important to note that there is an inverse relationship between recall and precision - where one increases, the other must decrease.

There are drawbacks to different methods of IR.  Boolean operators are not intuitive, but rigid; a search on teaching French in schools may equally return results on teaching in French schools (Feather & Sturges, 2003).  Likewise, natural language queries may result in low-precision results due to irrelevant documents that contain high levels of keywords “by chance or out of context” (Lee, Seo, Jeon and Rim, 2011).  Deleting stop words and adding quotation marks decreases recall. There are many ways in which user needs may not be satisfied, and there is no ‘right way’ of improving search results.  This is simply because it is the user’s needs that determine the type of search query used.

It is therefore important that the information to be searched is appropriately managed.  For example, is it in the correct format?  Should it be searched through keywords or keyphrases?  What about conflating words, including synonyms, and ignoring stop words?  These methods are all vital in making information more accessible to the user (MacFarlane, Butterworth and Krause, 2011).

To conclude, data and information retrieval could not be more different.  Data has the advantage of not being subject-based.  A database is built with its own well-defined semantics.  It is the opposite for IR.  There are no well-defined semantics, and so the IR system has to interpret the semantic content of the documents and bring together what it deems relevant.  Reaching this goal appears to be a two-way street.  The information in the document itself must be well-managed by the creator; the user must also use an appropriate IR method according to his or her own information needs.  Likewise, the evaluation of search results will be determined subjectively by the user, according to those needs.

Blog URL:-


Baeza-Yates, R. and Ribiero-Neto, B. (1999). Modern Information Retrieval. [online] Boston, Massachussetts: Addison Wesley Longman Inc. Available at: [Accessed: 22 October 2011]. 

Feather, J. and Sturges, R. P. eds. (2003). International Encyclopedia of Information and Library Science. 2nd ed. London: Routledge.

Karlgren, J. (2004). Information retrieval: introduction. [online] Available at: [Accessed: 23 October 2011].

Lee, J., Seo, J., Jeon, J. and Rim, H. (2011). ‘Sentence-based relevance flow analysis for high accuracy retrieval.’  Journal of the American Society for Information Science & Technology [e-journal] 62 (9), pp. 1666-1675. Available through: JSTOR [Accessed: 25 October 2011].

MacFarlane, A. (2011). Lecture 04: Information Retrieval, INM348 Digital Information Technologies and Architectures. City University London [unpublished].

MacFarlane, A., Butterworth, R. and Krause, A. (2011) Lecture 03: Structuring and querying information stored in databases. INM348 Digital Information Technologies and Architectures. City University London [unpublished].

Rosenfeld, L. and Morville, P. (2007).  Information Architecture for the World Wide Web. 3rd ed. Cambridge: O'Reilly.

Appendix A

The following table by The Swedish Institute of Computer Science (SICS) clearly summarises the difference between data and information retrieval. [Accessed: 23 October 2011]

Information vs Data Retrieval

Exact match
Partial match
Query language
Natural (... well)
Query specification
Items wanted

Appendix B

The results of an exercise calculating the precision of various search results from Google and Bing.  The original spreadsheet may be viewed at

Wednesday, 26 October 2011

Web 2.0 - The internet as platform

This week's lecture introduced the idea of Web 2.0.

Web 2.0 is the idea of the internet as a platform, rather than a computer as a platform.  It is the web that can be written to as well as read.  Back in the day, if you wanted to put something on the web, it involved learning HTML and writing up a web-page manually.  Waaaaaay back in 1997, I had to go to my college library, get out a book on HTML, learn the basics, sit down in Notepad and type out my website.  Yawn.  And more often than not, it looked pretty pants.

Now fast-forward to 2011.  Why the hell would you want to bother with typing out pages of HTML just to make a website.  Google Sites has pre-made templates for you already.  Dreamweaver can write all the code for you.  Your thoughts can be put up on the web in a matter of minutes if you have a blog.  Seconds if you're on Facebook or Twitter.  And whatever type of computer platform you have, everything all looks and works virtually the same.

This is the crux of Web 2.0.  Everyone can publish on it without having any technical skill at all.  Web 2.0 effectively harnesses network effects that get better the more people use them.  Social networks are at the heart of Web 2.0, and Web 2.0 can be said to be so successful because, through sites like Facebook, they mirror the social networks and interactions in our everyday lives.

Interaction is basically at the heart of Web 2.0.  And this interaction isn't only of the purely social kind.  We all have the ability to become pseudo-experts by contributing to Wikipedia.  We can make Amazon better by leaving reviews.  We can create our own tags on Flickr and Delicious - indeed, in the ocean of un-indexed junk out there, folksonomies are becoming one of the most effective ways of organising web information.

The World Wide Web (version 2.0).

But all this inevitably comes at a price.  Some of the issues we discussed are mapped out here:-

  • Does having the ability to log every mundane event in your everyday life with ease create propensity to narcissism?
  • Is Wikipedia as reliable as, say, the Encyclopaedia Britannica, and does it promote a culture of amateurism?
  • Does the internet, as a platform for freedom of speech, promote a 'safe' environment to for people to act in an offensive and derogatory manner?
  • Would it be fair to say that important events somehow become trivialised by the hype of the web?
  • What does the privacy settings change on Facebook in February 2010 have to say about the public nature of the data we may unwittingly put on the web?
  • How do we deal with the ephemeral nature of the web?
What do you think?

Wednesday, 19 October 2011

Information Retrieval versus Data Retrieval

This week's exercises finally put last week's mind-mushing exercises into perspective.

The purpose was to draw a line between data retrieval (what we did last week) and information retrieval (what we did this week).  Data retrieval is the kind of thing we do when we query a database.  With information retrieval, the results are 'subjectively' relevant - I have a huge heap of documents, and I need to decide what are relevant to my needs or not.

I am a prolific user of Google.  I use it at least a dozen times a day, if not more.  It is easy to think from the perspective of a user.  There is something I want.  I type it in the search engine, and I hope I find what I'm looking for.  Sometimes, a lot of frustration ensues.

But why is this?  Why is it sometimes so difficult to find what you're looking for?

More often than not, the reason is because people don't index their 'documents' effectively.  They lose sight of what it's like to be a user themselves.

As the owner of a website (or two), this is pretty pertinent for me.  It is all too easy to speed through the creation of a website summary, keywords or tags.  Usually you just want to get the creation bit out of the way and get going.  But do those keywords fulfil user's needs?  For example, you run a football website.  'The best football site in the world', you may profess.  But what if an American fan is searching for a football site.  What if he uses the keyword 'soccer'?  Your site may very well be the best in the world, but you're immediately cutting off a large proportion of your potential audience simply by not giving enough thought to your indexing terms.

All sites have different indexing needs, and these should be made with the user in mind.  A Shakespeare website may want to index its documents by phrases.  For example, a user may be searching by a particular line or quote, e.g. "To be or not to be."  Therefore, indexing will have to be tailored to user's needs.

In the computer lab, our task was to query two search engines - Google and Bing.  Easy, I thought.  Perhaps too easy.  But of course, I was wrong.

We had to use different search models in order to get to different types of information.  And depending on the type of information we had to find, certain search models worked better.

For example -

  • Natural language queries.  I don't use these very often.  But they often turned up very useful results, particularly on informational, exploratory information (when that information was explicit.  Finding out about the Civil War levellers confused the search a bit, since 'levellers' could be any number of things).  
  • Quotation marks - I use these most often.  They're very useful for finding documents that contain certain words or short phrases.  When a natural language query failed, adding quotation marks and deleting stop words usually helped narrow the search.
  • Boolean operators.  Something I NEVER use.  I'd always thought of them as kind of antiquated and redundant.  So they proved to be - sometimes.  I discovered that they simply do not work with Google.  However, they were compatible with Bing (Bing automatically uses the AND operator anyway).  An example - searching for Jerusalem artichokes and how to cook them.  For some reason, the search turned up quite a bit on growing Jerusalem artichokes.  Finally I found a use for the NOT operator.  With Google, this simply turned up more hits on growing artichokes.  With Bing, it seemed to work as intended.
Another part of the exercise was to calculate the precision of each search engine's results.  What did I discover?

Why, that Google has a higher precision rate than Bing.  Of course. ;)

Addicted to Google and the "Church of Search".

Sunday, 16 October 2011

DITA exercises #3 - the AFTERMATH.

I finally got some feedback about my SQL exercises from session 3 of DITA.

What I have learned from the feedback is that there are many little 'pointers' that help to 'tighten up ' the commands and thus return more reliable data from your query.

For example:-

  •  Using the = sign with > or < in order to actually include the number typed in the query, rather than just those numbers lesser or greater than it. (e.g. using >= 1980 as opposed to > 1980; the former includes 1980 in the search).
  • The use of % to make sure you get as many returns on your query as possible (e.g. using "%Prentice Hall%" as opposed to "Prentice Hall"; the former includes all matches including the name Prentice Hall, not just matches comprising ONLY the name Prentice Hall).
  • Keep in mind the difference between numbers and characters - in SQL, 0028007484 is treated as a number, whilst 0-0280074-8-4 is treated as a string of characters.  Therefore the command = will work with the former but not the latter; the 'like' command should be used instead.
Helen, a fellow student, had an excellent way of explaining how queries should be arranged:-
"The columns of data you want to obtain from the database  (SELECT)The table or tables that this data is sitting in (FROM)The clauses that limit this data to exactly what you are interested in and no more (WHERE) columnA, columnB, columnC from tablenameXYZwhere criteria1_is_met and criteria2_is_me"
I also found that having a diagram of the database's structure was very useful, as it helped me to visualise where and how I could retrieve the data. 

Wednesday, 12 October 2011

What colour do you want that database?

Like many people, if someone said 'database' to me, I'd immediately think of a spreadsheet.  So when I knew we were learning databases this week, I was a little freaked out since frankly I hadn't done spreadsheets since I was in secondary school.

Needless to say, as with most things on this course, my preconceptions were totally wrong.

Like the internet, databases have a longer history than I'd thought.  But of course, back in the day there was a lot of duplication, inconsistency and redundancy of data.  Nowadays we tend to store data centrally, in order to mitigate these problems and aid the fast retrieval of data.

As an introduction to databases, we learned the basics - querying them using SQL.  In the lecture, it all seemed fairly easy, logical and generally straightforward.  I have to admit, I haven't got the most logical of minds, but once you get something, you should be able to follow the logic to its natural conclusion, right?

Heh heh.  Sitting in the computer lab after the lecture, my mind hit a complete blank.  What did I just learn?  How does it work again?  Like maths, it's like breaking a code - once you know the formula, it's easy to unravel the answers.  And I found I hadn't quite got a handle on that formula.  A proliferation of bad and incorrect commands followed.  And even more than that, a whole lot of simple, straight guessing.  Sometimes I would guess and guess until I gave up.  When I finally asked for help, I realised that there are just some commands you need to know; and that once you know them, you have the key.  A whole lot of doors unlocked.  Every time I learned a new 'key', it made it easier to guess what the answers to some of the exercises might be.

Still, it took me an evening of querying the Database Management System at home before I managed to complete all the exercises.  I recorded each answer, because I knew that if I didn't, I would forget them.  The best analogy I can come up with is SQL is like learning a language.  You will never learn to speak it fluently unless you use it, and if you don't practice you will get rusty.  So I am now spending my time making random queries of the database, trying to get it all to stick firmly in my head.

And I guess it will - after a million or so queries of practice. ;)

Yup.  Exactly what I would've said a few days ago...

Monday, 10 October 2011

Monday, 3 October 2011

DITA von Tease

What is the internet?

Yes, I am one of those people who would have answered - it's the World Wide Web.  And if someone had asked me, "What's the World Wide Web?", I would have answered - "Duh!  It's the internet!"

Today I learned that the internet and the WWW are not the same thing.

Really, it's one of those things you've heard about and know at the back of your head, but if you were to describe the difference between the two, you wouldn't have a hope in hell of explaining it.  Well, I wouldn't anyway.

Richard had a very good way of explaining it:-

"The internet is the road, and the World Wide Web is the car that drives the road."

In other words, the net is the infrastructure, and the World Wide Web is a service that allows you to 'navigate' the net.  There are other services that allow you to traverse the net in different ways to the World Wide Web.

The Information Super Highway.  There really is a reason for that vehicular analogy!

The 'net' began in the 1960's, and was developed by the military in order to make sure that even when one computer with information went down (or was destroyed by a nuclear warhead, as the case may be), that information would still be stored and shareable amongst other computers in the network.

Back in those days, computer servers were huge, clunky things that probably needed a whole department to run them.

The days when the Information Super Highway was more like the
Information Super Monolith.

Nowadays, the gap between client and server computers is narrowing.  Client computers are now becoming more and more powerful, and the traditional distinctions between the two are blurring.  In a way, we are all becoming servers.  I'm not sure exactly how it works, but torrenting and peer exchange seem to be proof of that closing gap.

Anyway, in the computer lab we got to make our first web page, after learning the basics of HTML.  We kind of take it for granted nowadays that someone else will make our webpage for us.  Many sites have web templates, or we can use a program like Dreamweaver to create one like we would create a document on Word.  The last time I actually wrote HTML was in 1997.  But even then I was no whizz-kid at it.  I certainly didn't really get it.  So what did I learn today?

  • Tags like <b> and <i> are now 'old-fashioned'.  Semantic tags like <em> and <strong> are now the way to go, because they're more 'global-friendly'.
  • I finally know what CSS stands for!  I finally know what CSS is!  WOOT! XD
  • LAN = Local Area Network; WAN = Wide Area Network; Internet = a vast network of all these networks.  TA DA!!!
See, it's really all very simple; I think that these computer scientists really just like to tease the hell out of us with their crazy acronyms... ;)

Monday, 26 September 2011

Information/Library Sciency...Stuffs?

Well, as they say, "You learn something new every day", and today, on my first day of the MA/MSc Library Science course, I learned some really random stuff that everyone should know but most people don't.  Like a byte is made up of 8 bits, and so therefore a kilobyte is 1024 bytes, and a megabyte is 1024 kilobytes and so on.

To everyone who knows all this already, and who are sniggering behind their sleeves at me... Yeah.  I admit it.  I was one of those dumb people who actually believed that a kilobyte is 1000 bytes.  But hey, what is my first Library Science class for, if not to learn the BASICS of the BASICS.  You'd think that since I spend half my life online, I'd know all this stuff already.  But I realised, sometimes you get so immersed in something, and it becomes so much a part of your life, that you don't really stop to think about the ins and outs and how things work.  It's a bit like language.  How often do you think about grammar, or the letters that make up the words you say and why they're arranged that way?  So today, I really learned for the first time about the 'letters' of digital information - bits and bytes, and files and formats.

Random stuff (for me) to keep a note of:-

  • Formats can be interoperable, but most often are not.
  • What's the difference between files and documents?  Files = named collections of related digital information.  Documents = a build up of information using different files, e.g. a blog entry incorporating text, images, videos etc.

The easier part of the Digital Information Technology & Architectures class was actually making a web page with all the links and the images and such.  I'm much more comfortable with the doing than I am with the theory (except when Blogger decides it hates me :p).

Next up - Library & Information Science Foundation... Well, that's a pretty nebulous title.  I didn't quite know what to expect.  The last thing I was expecting was history - one of my fave hobbies!  Lyn gave an awesome and informative lecture on the history of information, documents and writing.  We kind of take it for granted that we live in a proverbial sea of information, something that surrounds us and is just 'there'.  But where did it all begin, and when did it start be organised?  And how?  From cave paintings to scrolls, from the telegraph to the World Wide Web, the creation of information is a journey that is still going on, expanding, developing  exponentially... And the big question is, how do we organise this proliferation of information?  And that's where Library Science comes in.  Fascinating stuff.

Which got me to thinking... history + libraries = the Library of Alexandria.  A storehouse of ancient knowledge lost when it burned down somewhere between 48 BCE and 642 CE.

Think about all the priceless information we lost along with the library.  Think of the glee of all those librarians had they ever had the chance to transcribe, archive and digitise all those documents.  Sadly, they're all (mostly) gone - and therefore probably one less thing for us to worry about. :p