Tuesday, August 14, 2007

Book Searching is not the same as Book Preserving

It has been fairly clear to the library community for a while now that the Google Book Search project is not going to deliver sufficient quality that 'preservation' is assured. There is now a rather detailed critique at First Monday, from Paul Duguid. His essay (noted via Peter Brantley) focuses on some editions of Sterne's bizarre novel, Tristram Shandy, included in GBS in several editions. His conclusion:

The Google Books Project is no doubt an important, in many ways invaluable, project. It is also, on the brief evidence given here, a highly problematic one. Relying on the power of its search tools, Google has ignored elemental metadata, such as volume numbers. The quality of its scanning (and so we may presume its searching) is at times completely inadequate [14]. The editions offered (by search or by sale) are, at best, regrettable. Curiously, this suggests to me that it may be Google’s technicians, and not librarians, who are the great romanticisers of the book. Google Books takes books as a storehouse of wisdom to be opened up with new tools. They fail to see what librarians know: books can be obtuse, obdurate, even obnoxious things. As a group, they don’t submit equally to a standard shelf, a standard scanner, or a standard ontology. Nor are their constraints overcome by scraping the text and developing search algorithms.
When I mentioned the article to a friend he said that it was possibly a little unfair. But I guess that is the issue that Google has to confront. If Google is going to assume the responsibility of scanning, and to speak plainly, the responsibility of establishing, these texts, it will attract the highest standards of scholarly nitpicking. Which is often and notoriously unfair. That after all is why Professors study the early editions of Tristram Shandy. They are professional and unrelenting pickers of nits. Companies such as ProQuest are used to collecting and aggregating materials with careful and scholarly procedures. They know that they will be pilloried if and when their scanning is unreliable or their selections are unwarranted.

I think that Dr Duguid has some good points, but there is perhaps more of a case to be made for Google than he allows. After all his paper is a very good example of how easy it is to cite and use the material that Google is assembling. He clips and shows the messy pages he has found. Scholars will like that (as those in Europe will dislike the fact that for strange copy-right related reasons the Google citations do not work. In the US that link will give you the first/?second page of the Harvard edition. Nothing visible in Europe.).

But I also wonder about Google's methodology. Why should they ignore the way that librarians and scholars have assessed this material in the past? Not recording volume numbers seems like a laughable error. On the day in which the New York Times reports Google's and Microsoft's urgent drives to capture and utilise health records, we may wonder whether the medical services which Google develops can possibly be so apparently haphazard as the Book Search record appears to be.


Alain Pierrot said...

“establishing the content”

This scholarly goal does not seem to fit Google's project, and, I agree, it would be unfair to assess Google's results against scholarly text establishment rules.
However, even without this kind of requirement, some points are worth mentionning and raise questions.
1°) To Google: how will you rank the different scanned books (eventually many releases of one title!) matching a query?
2°) To the libraries opening their shelves to Google:
a) how do you assess the quality of the digitized versions of your books?
b) how do you build the list of titles to be scanned?
c) how do you intend to maintain your essential service of communicating the right books (releases) fitting the individual needs of individual readers?

Adam Hodgkin said...

These are all good questions and I think it is time that the Libraries who are partnering with Google took more of a leadership role. Especially in insisting on the provision of basic and very useful metadata and in ensuring that the production standards are acceptable. Is there a foum in which the GBS Libraries can share their experience -- if not, there should be a Google BS users group. Perhaps Michigan will take the lead in organising it...

Alain Pierrot said...

I couldn't find a relevant forum (but I didn't spend a lot of efforts...).

As for metadata, between MARC, MARCXML, ONIX, METS, ... a whole range of standards are available.

A good place to keep a look:

For French readers, Jean-Michel Salaün is commenting Paul Duguid's analysis:

Both Karen Coyle and Jean-Michel Salaün are experts in the library and digital domains.