Data quality issues? Is this not something only relevant to business analysts, data-warehouse specialists or market researchers? 10 years ago probably most of us would have thought so. And according to Gartner, Inc. ( the world’s leading information technology research and advisory company ) the cost of poor Data Quality is also going to continue in the future to hurt organizations around the world (listen to Gartners Voice interesting podcast here: The Cost of poor Data Quality).
We are going to show in this article that the Internet has also changed that: No longer Data Quality issues are hidden behind corporate (fire)walls. The Internet is “transporting” Data Quality issues directly to the computer of everyone who is shopping online.
To make this article more interesting we are going to demonstrate that Data Quality issues cost you as a Consumer time and money by looking at the largest online retailer as an example: Amazon (probably everyone else has the same issues but because of Amazons size it is easier to spot them on Amazon). Please continue reading if you do not want to spend $20 too much for your next USB Hub just because of some Data Quality issues in Amazons product catalog.
Amazon is because of its proven excellent service my preferred online shop. I am interested to buy via Amazon this Hi-Speed USB 2.0 7-Port Hub produced by Belkin because someone recommended the product to me. I expected to find the Hub at a reasonable price directly by searching Amazon with the Belkin part number “F5U237v1″. I found this number on the Belkin product page. Try this Amazon search yourself. I got today this as a result (please click the image to see details):
Basically one and the same Hub is surprisingly not found only once in the product catalog of Amazon.com but eight times . It goes by different names like “7-PORT USB 2.0 Hub by Belkin”, “Belkin Hi-Speed USB 2.0 7-Port Hub F5U237v1 “, “BLKF5U237V1 USB Hub, 2.0, 7-Port, Black ” and others. For each of these 8 entries there are several different merchants offering it. The price range across these redundant entries in the Amazon catalog goes from $30 to $63.99.
If you are unlucky you would maybe first navigate on Amazon.com to “Home+Garden”/”Home Improvement”. Searching from there for the Hub will find today only two different entries in the product catalog. The cheapest price you can get today in Amazons department “Home Improvement” is $52.50 (new). Compare this with the best price of $30 for the same Hub (new) in the department “Electronics” and you will probably agree that Data Quality is something that costs Consumers time and money.
But the issue of Data Quality of the Amazon product catalog does not have only a negative impact on finding in a simple way the best price on Amazon. The customer reviews are to me one of the most valuable features of Amazon. Looking at this entry “Belkin Hi-Speed USB 2.0 7-Port Hub – Hub – 7 ports – Hi-Speed USB” of the Amazon product catalog and its review I probably would think again about buying the Hub. This entry “Belkin Hi-Speed USB 2.0 7-Port Hub F5U237v1″ for the same Hub is providing a more balanced view from different reviewers. Bottom line: The Data Quality issue shown above is also having a negative side effect on one of the most valuable features of Amazon: the customer reviews.
Based on our example one probably could argue that mostly the merchants are creating the Data Quality issue of Amazons product catalog. But if we are looking at this example: “Belkin USB + Firewire Haub 6 PORT USB 2.0″ it is becoming clear that this is not the case. The misspelling of this entry (“Haub” instead of “Hub”) in the German catalog of Amazon (also Amazon itself is using this entry in the catalog to sell the product) probably makes it difficult for Consumers to find and buy the product.
Cindy Cunningham (who worked in the past for Amazon on the product catalog) provided in 2003 some interesting details of the complex issues Amazon is facing around Data Quality of the product catalog. As we have just seen, Gartner is right: Data Quality is an issue today and in the future for organizations like Amazon. But it is important to understand that this is not an “internal” Enterprise issue only. Data Quality issues are hitting today directly every Consumer every day.
The good news is coming last: Amazon really cares about the quality of its product catalog. Amazon sees the product catalog as one of their key competitive assets. We hope in our best interest as Amazon customers that top developers will apply to this recent job opportunity (Oct 26, 2007):
Software Development Engineer – Product Matching-022892
If you’re looking for engineering challenges related to automatic text processing, scalability, and performance we’re looking for you. The Amazon Product Matching team is responsible for processing millions of product descriptions each day and determining their similarity to other items already in our massive global product catalog. This is a high-throughput Information Retrieval problem that includes components of extraction, relevance ranked full-text search, heuristic processing, and statistical analysis. You will bring a background in not only object-oriented design and software development, but experience in IR and NLP. We will expect you to build and run a high-quality, low-latency service at the core of Amazon’s e-commerce platform. We will support your software development efforts and your professional growth within a small-team culture: ownership will be your guide and your reward.
We work closely with our customers to support each new product category launch as well as incrementally improving the quality of the Amazon product catalog, one of our key competitive assets.
We are seeking experts in building enterprise server side applications with strong emphasis on adaptability and auditability. Team goals include exploring state-of-the-art data storage techniques while scaling the present service.
* B.S. Computer Science and 2 years industry experience
* M.S. Computer Science and/or 5 years experience preferred
* Proficient in C++ and/or Java as well as Linux data processing tools
* Skilled OOD
* Experience with distributed software architectures
* Professional coder / knows how to write robust, high-performance code
Amazon Recruiting Department
Some questions to our readers:
- Have you found similar Data Quality issues on Amazon or other major online shops?
- If you are working around Data Quality issues yourself: Are there tools on the market to address Data Quality issues in product catalogs that you would recommend based on your own experience?
Thank you for reading.