Thursday, July 23, 2009

Title Not Found

With regard to the Amazon/1984 stories: The ignorance of how databases work and what data-syncing is continues. This article and comments at Network World illustrate the ignorance.

Background: Kindle users were able to buy an ebook that was distributed without copyright authorization. As soon as Amazon was made aware of the problem, they removed the title (Orwell's 1984, to make the story interesting) from their catalog.

That's the back-story. What happened next should be a surprise only to people who don't understand how these things work, and I'm not referring to how ebooks are created or sold (or anything having to do with copyrights or license agreements).

A database is a collection of tables. Tables are a collection of records. Records are a collection of fields. Fields are made up of bits/characters.

(A record may be synonymous with a "transaction" but it isn't always.)

I'll start with a single table.

A database like Amazon's contains customers. Customers have certain characteristics, such as names, shipping and billing addresses, join dates, last purchase date, phone numbers, etc. Some of those characteristics will be stored in a "customer" table, some will be stored in subordinate tables, but they're only subordinate because we think of them that way. They're just tables. The relationships the tables have to each other (if any) is because of our understanding of how the data relates, not because computers/databases have any inherent understanding of these relationships.

The Amazon database also contains a "products" table. Those products make up their catalog. Products have characteristics of their own, such as supplier, unit of sale, weight, description, name, etc.

The "supplier" of an item will also be a record in a "supplier" table. That table will have the contact information for the supplier, their main contact address, as well as any other data elements Amazon might wish to maintain at that level.

Amazon will also have a price file (or a "price table") because all retailers do. They'll also have some sort of inventory tracking table (either maintained locally for the items they stock in their own warehouse or combined with a hook into the supplier's inventory system, if inventory is maintained by the supplier). In an operation like Amazon, inventory is all over the place, so a combination of methods is used.

Customer and supplier records are reasonably static. That doesn't mean that they don't ever change. They do change, because suppliers and new customers are always added, in addition to keeping them updated when suppliers and customers move. Transaction tables are constantly changing... that's why they're called "transactions."

When you log on to Amazon you sign in with your email address, but because email addresses can change, that isn't your customer number. (Some database designers are stupid enough to use an email address as a customer record number, but as I said, they're stupid.) It is a given that every customer has a customer record number and every supplier has a supplier record number. Each product in the Amazon catalog has a product number, too, which can be a single item, or a combination of items sold as part of a bundle. (Think of a computer that may be sold in a bundle that includes a laptop computer, a power cord, and instruction manuals--each having unique part numbers, but when combined create a unique product number.)

Let's say that my customer record number is C001. I'll keep it simple and we'll assume that I only have one billing and one shipping address (we know that Amazon customers can store dozens of each, but we'll ignore that).

I initiate an order for product P001.

A transaction table exists to create a relationship record of that order. That record says that customer C001 bought P001. Nothing in the transaction table knows what a P001 is, nor does the transaction table know that C001 is me. It is just linking the two together with other information to keep the transaction unique (such as assigning it a transaction number, the date the purchase was made, the date the transaction was created, etc.).

On the front end, Amazon displays a report of this transaction to me and sends it in the form of an order confirmation in email. That order confirmation will include the name of item I purchased because reports can be programmed to "look up" data from other tables and other databases. Since we have a product number (P001), we have the descriptive information for that item from the products table. Since we have a hook from products to suppliers, the report can also display details about the supplier, such as:

On 7/23/2009 you purchased Qty: 1, Item #: P001-Databases for Morons from "Technical Publishers."

If I purchased more than one item, it would list the additional items purchased.

The above might give the appearance that the transaction order contains the descriptive information, but it doesn't. It only contains record numbers and dates. A table that contains redundant data, such as a transaction table that contains anything but the record number of a product or customer is referred to as "denormalized." That redundancy is bad, not only because it uses more space in the database than it needs to, but because it can quickly create "out of balance" or "out of sync" situations (depending on the transaction type). When a customer changes their name, for example, their customer record is updated, and only that record. If the order table copied the customer name to the table when an order was created, it won't get notified/changed when the name change occurs (without redundant/unnecessary programming). So this stuff is kept simple, i.e., normalized.

On the backend, a bunch of things happen when a customer clicks "submit" on their order. Something equivalent to a packing slip or pull order is created for the items in Amazon's warehouse. For items not in Amazon's warehouse, the equivalent of a pull order is sent to the drop shipper/supplier. Once the item(s) are pulled from inventory, a transaction is created to reduce inventory for the quantity of items purchased. A shipper order number is created (with a vendor such as UPS or Fedex) once the package is boxed. Debits are processed against my credit card. At each stage of the order fulfillment process a status change is noted in the transaction record that triggers an update in an email sent to me:
1. Order acknowledgement
2. Order processing
3. Order filled/shipping order assigned

If there are any glitches, such as a delay in processing, a discovery that an out of stock conditions exists, etc., an exception will be generated.

When a system is first designed the developers try to trap as many exceptions as they can. When programming changes are required later, it usually has nothing to do with bugs in programming code, as those are caught and corrected early. It is generally because there are new exceptions (where human beings get involved in the process and make mistakes). It is also because there are new types of product offerings that didn't exist when the system was initially created.

Reprogramming is always more problematic than initial programming. This is because there are hundreds (sometimes thousands) of hooks into different tables, databases, and systems that need to be adjusted when changes are made in one place that have to be cascaded to other parts of the process. The programmer responsible for making the changes doesn't know what all those hooks are, and exceptions/mistakes occur when the system encounters a condition that wasn't adjusted.

One of the processes that retailers have to account for is removals from catalog. There are dozens of reasons why a retailer may wish to "de-list" a product. It could have been a limited run, special pricing, a supplier goes out of business, the supplier wasn't reliable, a new supplier had the item available at a better profit margin, etc. A one-size-fits-all method of handling this won't work, because there isn't a one-size-fits-all reason. That makes things even more complicated for a programmer, but that's why they make the big $$$.

If a supplier is determined to be unreliable, the supplier won't be used and that includes delisting all the products they sold. If an item is no longer made, the supplier is still used for other items, just not that item.

The point being that (regardless of why or how it is done) retailers are constantly changing their catalog offerings.

You can't delete records. Well, you can, technically, but you don't because that creates a "title not found" condition. If, for example, a supplier record is deleted from the supplier table, when I look through my order history at Amazon, the spot that was originally filled in with "Technical Publishers" will be blank (or filled in with the error message "title not found"). If a customer record is deleted, when Amazon attempts to reconcile their credits from the credit card companies, they won't be able to tie back to a specific customer. Because of this, records are not deleted. They're "closed." This can be as simple a step as changing a status flag from "O" ("Open") to "C" ("Closed").

If I asked Amazon to delete my account, they wouldn't actually delete it. They'd "close" it, because deleting it would create havoc throughout their system (as well as create havoc from an accounting standpoint, because all the history associated with my account would be lost, and that's a no-no from a bookkeeping/accounting standpoint).

Now some folks might get upset that their data isn't deleted, but that's just tough. Before computers came into the picture to track these things, companies kept ledgers of all this stuff. No one would have suggested that a company go through 15 years of ledgers and take scissors to their books of record. The fact that people don't make the logical leap from ledger to computer record is their stupid fault, not the fault of the retailer. Not only would a company not want to delete your record, in many cases (where book of record requirements exist) it could be unlawful to do so.

So Amazon got into the ebook business. Their systems were initially designed for the sale of tangible items. There was certainly a lot of reprogramming required to offer this new service. One of the main changes is that Amazon had to create a storage account, where all the items a customer has purchased can be stored. And this is where this (obviously) gets confusing for some people.

They didn't literally create a storage box for each customer, as we might buy a new physical hard-drive to store additional data. A literal storage device would mean that Amazon would store the same item over and over again, hundreds of times, in each customer's unique storage box. That would be redundant and unnecessary, and the physical storage requirements would be mind-boggling. What they created was a virtual (meaning, "not real") storage box, similar to an order transaction record. Amazon has one pointer to a copy of each electronic media they sell, each with a unique part number.  When customer A001 buys one, they create a transaction which is equivalent to an e-book storage table so it shows up in my storage box list. It is a virtual library of "my" stuff. The physical copy is stored in one location. The list of books in my storage box is an index (akin to a "link") to the file name and location of the item.

What I do not know for certain, but I would guess, is that Amazon doesn't have all the e-book files on their own servers. I would guess that they point to files on servers Amazon doesn't maintain themselves, as they have distributors who stock items for them (and I have some inside knowledge to validate my suspicion that they do this). What the record of each item would include would be the file name and the server location where it was stored. Whether the item was stored on Amazon's servers or on a supplier's server would be invisible to a customer, as customers are also unaware that their storage box is virtual, not literal.

A customer's Kindle is linked to their virtual storage box. When a customer buys a new item, a transaction is created that displays the item in the virtual storage box that, in turn, allows the item to be downloaded to the Kindle device. The customer can choose which items they want to have on their Kindle, up to the maximum storage of the device itself. This allows a customer to buy more items than the Kindle device can literally store, and can control which items are physically stored at any given time.

When a Kindle customer has their device radio on, the system will sync the list of items in their virtual storage with the list of items currently on the device. This is similar to doing a manual, physical inventory check every now and again.

To put this in perspective, let's say that you went to the library. You go to the card catalog. You find the book you want in the catalog and the Dewey Decimal system locator. You go to that section of the library but the book isn't there.

If you immediately assert that Ninjas broke into the library, knowing that you wanted to read that book, and took that book to a fire pit and burned it so you couldn't read it, might others think your conspiracy theory a bit daft (and you a bit scary), given that it could simply be that someone else had already checked the booked out of the library? The librarian could have put the book in the wrong place. The card for the book could have had the wrong identifier number typed on it. The library may have delisted the book, but forgot to take the card index out of the card file. Where human beings are involved mistakes will be made, most often, innocently. Yes, it could be that Ninjas did all that, but until we have evidence of that, Occam's Razor requires we go with the simplest and most common explanation first.

What happens when Amazon delists an e-book item because the supplier is found to have been in violation of copyright law, with respect to some or all of the titles they sell? Did Amazon Ninjas break into your house and purposely destroy an ebook or is some other, more reasonable explanation plausible?

Was the supplier record deleted or closed? Was the product number deleted, or closed, and what happens to the titles listed in the virtual storage on Amazon when that happens? Was the file deleted from Amazon's servers? Was the link to the supplier's server deleted from the e-book's record?

Regardless of the dozens of innocent technical glitches that could have caused this, the title was no longer available on Amazon's catalog; therefore, the item was no longer available in the Kindle virtual storage box. When a customer synced their device to their virtual storage, the item was no longer there. Amazon didn't, despite the protests of idiots everywhere, send Ninjas out to reach into each customer's Kindle and run some sort of delete/destroy routine. The item itself, or the link to the item or the item's server no longer existed, so it was no longer in the index of the Kindle device.

Amazon will have to reprogram their system so this doesn't happen again. They said that, and apologized for the glitch. They refunded the money for the items purchased.

Folks who think that Amazon virtually broke into their Kindle to delete files have no clue how databases work, what virtual storage boxes are, or what syncing means.

Amazon discovered what happens when you de-list an item, in a way that worked for tangible items, in a process that had not been programmed to handle it, and they're fixing it. Maybe Oswald did it, but until we have evidence of that, let's assume a programming glitch because those happen thousands and thousands of times every day: no conspiracy or grand-theft illusions required.

Please, if you do not understand the above, shut up. You're embarrassing yourself.