|
Geoff Nunberg reports that the metadata for the works in Google Books' online library—the means to finding books in the first place—has been seriously mis-managed.
Nunberg, an adjunct full professor at the School of Information at the University of California at Berkeley, reports that the information about the books in Google Books, such as author, title, and date (referred to as " metadata"), is beset by many mistaken entries, inconsistent cataloging, missing information, and wildly inaccurate dates--all of which means that finding the books in Google Books' collection can be very difficult, and sometimes all but impossible.
He illustrated to severity of the problem by showing that many books are dated decades before their subject matter or authors existed: "Do a search on 'internet' in books written before 1950 and Google Scholar turns up 527 hits." He continues: "you can simply enter the names of famous writers or public figures and restrict your search to works published before the year of their birth. 'Charles Dickens' turns up 182 results for publications before 1812."
One of the ironies of the serious metadata deficiencies at Google Books is that (Nunberg implies) most of the books were scanned at libraries, where the metadata had already been carefully prepared by librarians—little of which was apparently used. In lieu of using the metadata already provided in the libraries from which the books were scanned, it appears that Google Books attempted to have computer algorithms pull the metadata out the books automatically.
Though one of the main problems arising from the faulty metadata is the inaccessibility of thousands of books, an additional problem is that many of the books are now linked to false information: someone may find a book but come away believing that the book was printed years before its author was born, for example. In some ways, the metadata effectively erases years of scholarship dedicated to unraveling confusing publication histories: someone unfamiliar with the history of a book may not know, for example, that the date of publication provided by the book itself was false.
Nunberg has written about the metadata problems in two articles, one for the Chronicle of Higher Education, and one on a linguistics blog:
By Geoffrey Nunberg
Chronicle of Higher Education (August 31, 2009)
http://chronicle.com/article/Googles-Book-Search-A/48245/
Language Log (August 29, 2009)
Filed by Geoff Nunberg under Books, Computational Linguistics
http://languagelog.ldc.upenn.edu/nll/?p=1701
(The Language Log post has numerous illustrations.)
As a frequent user of Google Books myself, I can corroborate his complaints. Though Google Books thankfully provides access to many books which my comparatively small university library does not house (especially since many of the books I am interested in were published long before the public domain wall in the first quarter of the 20th century, and therefore show up in searches for "full access" books), it can be very frustrating to find that only 3 volumes of a 4-volume work are available. It is possible that all 4 volumes were in fact scanned, but given that the 4th volume was mis-labeled in its meta-data, it is almost impossible to access.
One can do a title search for a particular journal, only to find missing issues of the journal—which never showed up in the original search—by accident later on. The usual workaround, I have found, is to search for phrases which regularly occur within the journal—though this technique does not work as well in older German-language journals, where Google's OCR seems unable to deal with the admittedly similar-looking letters of the Fraktur alphabet.
Ultimately, the metadata problem seriously lessens the use value of the Google Books collection, given that much of its reserves are (for the time being) effectively inaccessible or linked to corrupted and often misleading information.
UPDATE: Google Books has issued a response (subscribers only) in the letters section of the Chronicle of Higher Education ("Yes, Google Book Search Has Mistakes. It's Only Human," October 5, 2009).
UPDATE II: Peter Jacso has also written a piece, " Google Scholar’s Ghost Authors, Lost Authors, and Other Problems: Why the Popular Tool Can't Be Used to Analyze the Publishing Performance and Impact of Researchers" ( Library Journal, Sept 24, 2009) documenting similar metadata weaknesses connected to Google Scholar. The main problem appears to be that Google, even though it was offered accurate metadata by librarians, thought it could generate the metadata through an algorithm. Among the results of this decision: over 900,000 papers attributed to the author, "Password." Jacso writes,
In its stupor, the parser fancies as author names (parts of) section titles, article titles, journal names, company names, and addresses, such as Methods (42,700 records), Evaluation (43,900), Population (23,300), Contents (25,200), Technique(s) (30,000), Results (17,900), Background (10,500), or—in a whopping number of records—Limited (234,000) and Ltd (452,000). The numbers kept growing by several hundred thousands hits for the cumulative total of the above "authors" during the few days this paper was being written.
|