Mechanical

Interview

Woken up this morning by my phone pinging me with a text message. This is odd as my phone was downstairs and on silent, but sure enough, I had a text message.

Checked my email and saw that Josh McCadams posted his logic programming interview with me. Unless you're really interested in logic programming, I recommend you steer clear. Nothing I say will make a lick of sense.

As a side note, if you do listen and you heard me say that I only knew of one "relational database" (things like PostgreSQL, MySQL, Oracle, SQL Server, etc., are not relational), gorthx sent me information about other open-source relational databases and I wrote a bit about relational databases on my O'Reilly blog. This blog entry hit Reddit and the comments are show just how far Reddit has declined.
Read the 'relational databases' entry, and...

The big difference is that a Relational database won't allow duplicate tuples/rows?

But...

Isn't that one of the deadly sins of database design?

The first rule of design, as I learned waaay back(1990 or thereabout), was 'Thou shalt understand thy data', and designing something that not only allows duplicate rows, but also needs it, show that someone didn't understand the data.

Not that I'm a professional or anything...
(Mostly doing SQL work on my Psion PDA, or in MySQL on my website)
A relational database is much more than that. A query is guaranteed to not return duplicate results. And while you're right that a relational database won't even allow duplicates in tables (called "relations" in the relational model), popular databases allow this, even though they shouldn't. That's because they haven't been designed from the ground up to be relational. Comparisons against datatypes which don't compare are compile time errors. There's a lot more, but those are two of the most obvious features.

Part of the problem is that Edgar Codd came up with the relational model while working at IBM, but they never implemented. Customers started demanding it and IBM had a team work on it. Unfortunately, they didn't get it right, they had no contact with Codd, and came up with a half-baked implementation called "SEQUEL". Larry Ellison saw this and "borrowed" the idea for Oracle. Even this half-baked implementation was superior enough to most competitors that it won the day and people stopped working on the relational model.

Edited at 2008-01-20 02:01 pm (UTC)
A SQL query won't return duplicates, either, if the database doesn't contain duplicates. (Assuming that whoever wrote the damn query wasn't a complete and utter moron, of course. Then all bets are off... )
Still, it's making certain that the duplicates never show up in the data in the first place that's the big problem.

I'm in a 'webapps' group at the office, which means I'm one of a select few(forgot the most importnt rule I learned in the RNoAF: Never volunteer) who fixes and updates a set of web-based tools.
(Most of the apps was written by consultants... )

One app 'just needed' a bit of tweaking, adding new postcodes and a few bits here and there...
Most of the data seemed as if it was laid out OK, but...
One table not only contained all postcodes(and corresponding names - separated by space - but also several other 'seldomly changed' stuff, like 'documentation type' and so on. In fact the table had TWO attributes; 'DType' and 'Description', the first was a CHAR3 and the second CHAR50...

I ended up throwing together a quick'n dirty VB-script that did SQL INSERT statements so that I could update the data needed. Then I zipped the scripts and hid the file in my personal homedirectory and told them that the webapps group would NOT support that app anymore because it needed a complete rewrite, which is out of our scope.
(I know that the system will need new functions added, soon, and with the database in the shape it is, it'll end up a total mess, if it's at all possible.)

Actually, when you start writing complex queries, it's quite easy for SQL to return duplicates. Further, because today's database don't follow the relational model, this problem gets worse, particularly since some of the higher normal forms are subtle. Consider the following table:

+-----------------------------------+
| Supplier_Product_Price_Shipping   |
+-----------------------------------+
| SNO     | PNO    | Price | Days   |
+---------+--------+-------+--------+
| S1      | P1     | .19   | 14     |
| S1      | P1     | .14   | 7      |
| S1      | P2     | .15   | 7      |
+------------------+-------+--------+


Imagine that this table represents the price per product, per supplier, based on how fast you want it shipped. That table is not normalized and for many people, that's very difficult to see. Each row looks unique and it might look like the dependencies are correct, but they're not.

One problem is that SQL does not distinguish between values and meaning. Should you slap DISTINCT on a query when you're pulling a subset of data? Maybe. But the subset may not have enough distinguishing data to know if each row is distinct or not. As a result, you have to know when to append enough information on a SELECT to return distinct data. The database should already know that.

Microsoft screwed this up in older versions of ACCESS when they used to have DISTINCT on by default. If you tried to select the distinct totals from an orders table, you'd get incorrect results if any total was duplicated. In a relational database, the DISTINCT is superfluous and it knows that just because two values are the same, it doesn't mean that they're the same thing (e.g., how many guys have the first name of "John"?).
I think I see what you mean...

Depending on what you search for, (say, searching for a specific product from a specific supplier, but not specifying or requesting delivery time or price.) you can get 'duplicate' results as P1 will show up several times.

I would probably have split this into two tables, one with Supplier/Product and a SPnr(unique index) in one table, and a second table with SPnr./Price/Delivery in another table.

Then again, I like to make overly complicated databases...
(I have this monstrous maze of a database to keep track of my computer collection and present it on the web... )

As for ACCESS...
IF on_remote_drive(database$) then
{Slowdown = number_of_tables(database$) * date% }

Feels like it, at least.

And if there were more than one user...
AAAARGH!

M$ SQL Server V1.0, running on an OS/2 v1.3 Lan Manager v2.2 server(on a 16MB, 20MHz/386) felt like a speed demon compared to a 'simple' Access database with 4 or 5 tables if it was on a networked drive and more than 2 users tried to access it...
(The SQL Server setup was part of an ancient network monitoring tool. I saved the floppies for posterity, and may even reinstall it on a PC one day... )
Cool. I can file myself under "and friends" and claim that I've hit the big time since I've been accused on the Internet of pulling things out of my ass. Whoo-hoo.