Friday, September 15, 2006

Hello again - BYU Digitized Book Project - Family Histories

For those of you who have noticed that this blog has not been added to in quite a while, I want to assure you that all is well and I will be continuing to add comments from time to time but probably not as active as previous, due to the fact that I went back to work for a living and have a job that takes me out of town a lot.

My website for Upstate New York Genealogy at www.ny-genes.com still is very active and many of you kind folks have been sharing a lot of data with me and thanks. I have also secured the following: www.UNYG.com , www.UNYG.net , and www.UNYG.org and will probably be sending all of those to my Upstate New York Genealogy website in the future.

The following is a message from Steve Fox who is working on the BYU Family History Book digitizing project. Used here with permission.

“Hi everyone. My name is Steve Fox and I am developing the Family History Book Scanning cooperative project with BYU. Your conversations on this subject has reached me from a member of your list, and I have spent some time reading through the thread on the “dangers” of the OCR text files attached to the Family Histories. I know this thread is intended for genealogical research, so I don’t know how much you would like me to address the issue of the OCR text, or if you would like me to just say “I hear you,” and go away. So let me know.

First, let me explain that the Family History scanning project completed in 2005 was a pilot project in which we wanted to learn the ins and outs of scanning, OCR, and publishing books on the Internet. We plan to do this in the future on a very large scale.

We have over 120,000 family histories in our collection, so the process has to be fast and require little human intervention. Since this was a pilot project, we could have withheld the images from the public, but we felt strongly that people would benefit from having the images online and that we could learn from the feedback we would receive. These images are linked to records in the Family History Library Catalog, saving the expense of ordering a film copy at a family history center, or making a trip to Salt Lake City. I can explain what you are seeing in the OCR text by telling you 1) we did not edit or add to the text in any way, it’s just the way it comes from the OCR engines, and 2) we ran 3 OCR engines (separate programs)on each book.

OCR programs are prone to errors, and we felt by running 3 different OCR processes, and keeping all the results, it would increase the likelihood that the correct names would be in the index. This also resulted in the variants you have seen in the text that look like distortions of names and dates. If you look at other words in the text you will see the same thing happening. We did not intend the OCR text to be another rendition of the book, it was simply there to support the full text search. Unfortunately, at the pace we need to achieve in processing this number of books, we don’t have the resources to read each book and correct the OCR text, so errors will continue to appear there.

Let me make clear that the names in these files are not going into the IGI or any other of our products. The text file only exists for the full-text searching in ContentDM, the publication software we use to post the images on the Internet. We are proposing significant changes to our process for 2007, and hope to do another 10,000 books.

Your comments here have been very useful to us because we now understand how some of you view the OCR text file. One of the proposed changes is to use only a single OCR engine. While this will reduce the number of variants in the text file it will increase the risk of miss-read words the OCR engine can’t recognize. This change is also needed because new versions of the publication software can’t handle the variant terms. I hope more of you will provide feedback on this project.

I would be very interested in knowing if you use the printing version of each book, found at the bottom of the table of contents (to the left of the images) which combines the whole book into a single file (2 to 3 files for very large books). We also have wrestled with the issue of 1 page per PDF file, versus very large single PDF files, so for now we are producing both. Your comments on these and any other issues related to this project will be very welcome. You can contact me at foxsj@ldschurch.org . I will do my best to answer any questions you may have. Steve”

No comments: