Friday, January 11, 2013

The massive Amazon ASIN database at reBOOKed

One of the problems most reported about reBOOKed is that it is a pain to enter books from one's private collection. This is especially true for books that are not already in the database, as the user has to manually enter the Title, the Author, and (if they are ambitious) the Amazon ASIN. This is cumbersome and daunting for folks with large libraries.

The ideal solution would be to have some sort of scannerbot that comes to your home, automatically scans your library and inputs your books for you. That's not going to happen until sentient robots appear, so don't hold your breath. A secondary solution would be to have a scanner that works on book barcodes to get the information. There are solutions like that out there in the wild, and that may happen someday. Not now, but someday.

An interim solution is to get more books into the database to increase the odds that the book YOU are adding is already there. Over 500 books have been entered (mostly by reBOOKed staff and long suffering family members), but that's just a drop in the bucket. So, what's the best way to get more books into the database - including their Amazon ASIN numbers?

Recently, I stumbled across an online listing of Amazon ASINs that were scraped from the web and made available by archive.org's Open Library project. The downloadable file - which contains only a list of the 10-character codes - is over 400Mb. It's huge beyond belief. I broke it into files containing one million codes each and ended up with 47 files. That's 47 million entries. And none of them have any title/author associated information, just the ASIN codes.

To get the associated title and author, one could copy each code, paste it into the search box at Amazon, and record the results. This would take approximately 100 years to accomplish and would very quickly kill the operator from carpal tunnel and/or boredom. Rather than risk a stress-related death, I decided to try to use the tools that already existed to create a more automated process.

Inside reBOOKed, there is code that reads the ASIN for a book, asks Amazon for a copy of the picture, publication date, reviews, etc., and displays that information on the book's page. (Here's an example, viewable if you're a reBOOKed member and logged in.) To adopt this for the massive ASIN project, I simply wrote some code that reads through the list of ASIN codes, determines if they represent a book (as opposed to toys, food, tools, and everything else that Amazon sells these days), and writes the title and author to an ever-growing file.

The problem with the code is that it only runs in a browser and only if hosted on reBOOKed (it's part of the Amazon license). And, since browsers time out, it can't be a process that runs for hours or even minutes. No, I had to write code that reads in 100 lines, writes the result, then reloads the page to start reading the next 100 lines. It works, but it's ugly. Anything for you, my fabulous reBOOKed users.

So now that I'm getting hundreds of thousands of book listings, what can I do with them? I quickly discovered that there are hundreds of duplicates. Some are perfect duplicates, some are slight misspellings or punctuation differences, some are more complicated (author list reorganization, for example). One could plow through these individually to find and delete all of the duplicates, but that starts to have stress-related death overtones. Also, many hundreds of the listings are things like census reports, user manuals, and articles from magazines that - while valid - are really not all that useful.

My solution (partial solution really) to this problem is to dump all of the results into Excel and use a script to clean up the problems. The script will need to continue to be tweaked as new problems are found (the word "paperback" added to some listings, the name of a variety of publishers added to some titles, the list goes on and on). That's been done, and about 25% of all results are deleted immediately.

The result is still pretty ugly, though. The thousands of results include some that are obviously errors. One certainly does not want to corrupt good data (entered by loyal reBOOKed users) with junky error (generated by me). But one certainly does not want to add steps to the already cumbersome book-entering process.

The solution, as it now exists, is to present two separate lists (one after the other) in the book selection box. The first part of the list is the known good information (loyal reBOOKed users), the second part of the list is the junky information (yours truly). The nice thing about this approach is that I can continue to tweak the junky data over time and update it into the website whenever it changes.

If a loyal user chooses a book from the junky list, it will immediately be added to the more useful list and the junky entry will be deleted. I also added a "report a problem" button if, for example, the junky listing doesn't have the right picture (many of the Amazon listings don't - I think it has to do with used book reselling). Over time, the useful list should get more useful and the junky list less oppressive.

This is a long term project. Amazon limits the number of queries to 1 per second from anyone who doesn't sell thousands of Amazon products (which reBOOKed doesn't). At 1 per second, in an ideal world, that comes to about 1.5 years just to get the information from Amazon. This is not an ideal world, so double that. We'll be at this for a while. Stay tuned.

No comments: