An interesting aspect of the phenomenon Web 2.0 is it’s resemblance to worldwide library to which anyone could bring its contribution. It is a fact than especially in universities, but not only, the web has been considered for a long time a source of documentation; nowadays the process being amplified by the scale to which one could contribute with its own knowledge. This facet of the Web 2.0 has often called as Library 2.0.
Compared to a traditional library however, one should consider the issue of archiving and preserving this content. Especially from a library perspective the preservation of what some would call this “cultural memory”, triggered many discussion, debates and research in this area. Nevertheless practical ideas on how we can achieve this are still in their infancy. Some might recall the existence of the Internet Archive, currently the single large-scale preservation effort. Other efforts in this area either are not so well known or try to focus on some particular type of content.
The endeavor is more difficult since it is difficult to predict the changes in the web. Its strengths from the content access point of view (such as links, tags, blog rolling, chaining and syndication feeds) is also a weakness for archiving all this material. Research in this area revealed that the average life time of a web page is somewhat between 45 and 75 days. In addition, the constant changing and aggregation of content amplifies the problem.
The most important obstacle yet deals with the technologies issues. First it’s about content that becomes obsolete because the format, access protocol or standards used when it was published evolved in the mean time. It is often difficult to upgrade all content to the new technologies and some of it is always lost. Aside the preservation of the content it’s also a matter of preserving the experience which resides in the presentation and user interaction with the site when considering the palette of web client and server platforms deployed. And while the content data is always easy accessible this is seldom true for the code, scripts and databases that drive that site’s experience.
There will always be parts of the sites inaccessible to an external party, and the extra effort required in uncovering the inaccessible files by very well involve the site’s owner or administrator. In addition, with the currently available standards, the site itself can make requests to automated systems trying to archive its content to ignore some of the files. The Robot Exclusion Protocol for example is a way of informing search or archiving engines (the so called crawlers) to skip a set of files in the process that are, otherwise, publicly accessible. It is estimated that 400 to 550 times of the web content is hidden or protected from the end-user.
Suggested ways of uncovering this hidden web involve the use of software that can detect and then try to replicate the behavior of different web pages, meaning that even the code that directed that behavior remains hidden, the experience is preserved. Finally, cooperation with the site owner is the best solution for the data collection problem but is also the less scalable. The deployment of OAI-Metadata Harvesting Protocol could enable automatic retrieval of the entire site’s content for archiving purposes, however the reluctance for various reasons in large-scale adoption this technology as well as the extra effort and costs involved kept it so far at the stage of theoretical rather that a practical solution.
In relation to the multiple facets that a web page could be presented to its users a new term has been invented: the content cardinality. A greater than one cardinality means that the same web page, indentified by the same URL, can have a slightly or more different content, when viewed by different users. This is obviously true for sites that customize their content depending on the user accounts, or publish different content for every instance of the page.
Another issue regarding to the preservation of web content is the legal matter. Especially the intellectual property and privacy related content is often a sensitive area that hinders the archiving attempts.
These issues regarding the long-term preservation of Web 2.0 triggered various cons and pros. Some could argue that the dynamics of the web consisting of numerous daily blog postings, data mash-ups, ever-changing wiki pages and personal data that have been uploaded to social networking sites is of limited value and it doesn’t worth the trouble. However, there are a lot of materials with at least some value and some believe that we should preserve at least that part.
Unfortunately as I explained before not only the tools available to touch the hidden part of the web are still in their infancy but no major project archiving the most accessible part of the web has take shape yet. Temporary solutions could include individual protection where the site itself provides some means of archiving old data. This is not only a feature of most wikis but more and more sites provide to their users some sort of archiving repository. But nevertheless, it is obvious that for the goal of preserving the web essence, new research and maybe new technologies are still required.