Chapter 02 - Data Modeling and Database Design
Chances are good that when you started learning about development in the programming world, you began with flowcharting. Flowcharts are a great tool for diagramming programmatic flow and for help in laying out the different components of your system. The purpose is to discover logic flaws and missing functionality before coding, rather than later during the cycle.
The reality is that flowcharts are rarely done. In fact, how many times have you been in a crunch to pull one together for a project, but waited until after the project was done to do the chart? This happens more often than any of us would like to admit. It's easy to pull it together after the fact because you already know what you've designed into the program, and you're not going to flowchart something you opted not to include in the application. It's a pretty safe way to approach things, provided you've reviewed the program along the way to ensure that the functionality meets the needs of your customer.
Database design isn't nearly as open to modification along the way. Because changing a database table after the fact alters the foundation for all other parts of the system, even in subtle ways, it often requires major overhaul work in the balance of the system. A simple change to a table can mean that entire portions of the application will stop functioning.
Probably the most important thing you can do when you bring up a database-oriented system is to diagram and design the database structure that will support the application. If you don't do this, you'll end up making some subtleand often not-so-subtlechanges to the structure later, probably at a time when the changes are far more expensive in the development cycle.
When you decide to develop a system based on Microsoft SQL Server, you've taken the first step toward implementing two separate architectures in your software. These are client/server and relational database tables. Both are powerful if used correctly and present a real-life advantage to your system if you take time to pay attention to the rules along the way.
This chapter introduces you to those rules. We'll show you the theory and practice that make up a client/server system, including how you can start to determine the best location (client or server) for a given piece of functionality to be implemented, guidelines for breaking apart procedural flow, and much more.
We'll also show you other parts of this book that can help you implement these concepts in your application. Since the decision to move to client/server depends heavily on the flow of the application and where things are physically being completed, SQL Server provides an ideal component to provide server-side functionality to bring client/server to your applications. It's important to understand what pieces of the database design will dovetail into your desires to move to the client/server world.
Your overall goal is likely a 3NF, or third normal form, database. This will be the best compromise in most cases between extremes when it comes to normalization versus functionality and ease of implementation. There are levels beyond 3NF but, in practice, they can begin to cloud the database design with more design issues than functional issues.
When you delve into the world of normalized databases, you are, by definition, starting down the road of relational databases. Structures before normalized databases used a series of pointers to retain relationships between different tables and values. You may recall implementing linked lists, where each row in a database table contains a pointer to both the next and previous rows. To traverse the database, you simply walked up and down this list of links between records.
In the relational world, you define columns that relate to each other among tables. These columns are keys to other values. Keys are used to uniquely define a relationship to another instance or set of information. This chapter will get into more about keys as you work with the definitions of the different normalization levels.
A key difference between SQL Server-type database implementations and other more traditional PC-based databases is the terminology used to describe the databases and their information. Keep in mind that a device, or physical file on the disk drive of the server, contains one or more databases. Databases contain one or more tables, and tables contain one or more columns of information. For each table of columns, one or more rows may exist.
In more traditional terms, there was no concept of a database as in SQL Server. Instead, you had a file that contained records of fields. The following table shows a basic comparison of terms between a SQL Server implementation and a more traditional database such as Btrieve or dBASE.
You'll need to keep these terms in mind and, if you use the newer terms in describing the tables you're designing, you can avoid problems with ambiguity between developers and designers.
With relational databases, you don't use ordered, or sorted, rows. You use real-time statementsthose that are evaluated when they are called or issued to control the presentation of the information. You also use joins and views to control how information is retrieved, rather than try to store the information in the most advantageous format possible at the outset. This allows for more dynamic access to the information in the database tables. It also lets you simply store the information, and then retrieve it in any manner you like.
Take a look at the different types of normalization up to, and including, the third normal form.
In the past you may have implemented a database schema where, for example, you stored the item code for each item orderedsuch as in a point of sale system, with the order record. Later, when your program queried the order, it retrieved and parsed this field and could determine what was ordered with that order record. Figure 2.1 shows an example of this. You had the opportunity to store one or more item numbers in with the order record.
Fig. 2.1 - Without 1NF, you could store more than one logical item in a physical record. This isn't valid when you normalize your database.
With 1NF, duplicates aren't allowed. You need to create a schema where only one item will be recorded for each order record on file. So to implement the point-of-sale solution mentioned earlier, you would have an order represented by one to n records containing the item code information that made up the order. This provides a slightly different challenge to retrieve the information for the record. You must have a means of retrieving each record associated with the order and making sure that you've retrieved each record, but no more or less. Of course, this will lead to order numbers, called out in each record. Later, this chapter gets more into the database design and entity relationship models. Figure 2.2 shows the results of this first pass at normalizing a database table.
Fig. 2.2 - With a 1NF table, each row is a single, atomic record. It must be able to stand alone.
For now, simply remember that with 1NF, you must have each row contain only one instance of the information, and all column values must be atomic.
You'll need to add an OrderItemID column to the table to fulfill the requirements for 2NF (see fig. 2.3). You'll keep the OrderNum, but the OrderItemID will be the primary unique identifier for the item in a given row.
Fig. 2.3 - After you add a unique row ID for each line item, this table now fits the 2NF model.
That's a strange definition until you understand what's really happening here. The whole goal of normalizing your tables is to remove redundant, non-key information in your tables. Reviewing figure 2.3 shown earlier, you'll quickly see that this model is broken quite nicely. Because you're storing descriptions in the table, and because these descriptions probably are used elsewhere in the database, storing them individually in the table is a problem.
Here's where you start to see the real advantages of moving to 3NF. Remember, the simple examples that we're using here relate to a grocery store. Imagine if you were storing your information as shown in figure 2.3, and that you needed to change the description of "Milk" to "Milk 2%" because your vendor has maliciously introduced "Milk 1%" to the mix. You'd need to write a utility to update the sales database tables and any other table that ever referenced "Milk" to show the correct percentage. A real nightmare that will often lead to problems is inconsistent information in the database. It would take forgetting or not knowing about only one table that needed to be updated to completely invalidate your use of the item "Milk" for sales history.
Normalization is the key to taking care of this. In figure 2.4, the problem is corrected quite simply by removing the description from the table altogether.
Fig. 2.4 - To fit the 3NF model, the description column is removed from the order items table.
Since an item code is already assigned to Milk, you have the information you need to set up the Inventory table, as shown in figure 2.5.
Fig. 2.5 - By creating an inventory table, you create a "home base" to refer to and use any time you reference inventory items.
Of course, once you've begun normalizing your table, it brings to mind the question of how to get a complete picture of the item sold. How can you find out all the information about the line item, what its description is, the sale price, and so forth? This is where relational databases and their use of Views come into play. In the example, you can create a quick query that returns the information you need quickly and easily. Figure 2.6 shows what a sample view would return to your application for the table examples.
Fig. 2.6 - When you create a relational view of the database tables, you can retrieve and work with a complete picture of the information, although it may be dispersed across several tables.
See Chapter 6, "Retrieving Data with Transact-SQL" for more information.
See Chapter 9, "Managing and Using Views" for more information.
You can see that by using the combined information between the tables, you still have the same full data set to work with, but at the same time, you're limiting the sources of information. You'll find quite often that the way you end up working with the information may not change, only the methods used behind the scenes to retrieve that information. This is the case with the 3NF table in this example. You're still storing the same information baseyou're just retrieving the information using a different means, which in this case is a logical view of the two related tables.
Next, you'll look into the logical pieces of the functional parts of your system. These pieces will play a key role in your database and overall system design decisions.
With SQL Server, you can implement true client/server systems. By this we mean that these systems can adhere to the concepts of client/server, allowing you divide functional components into cooperative operations that accomplish your application's goal. This sounds strange, but what it amounts to is dividing processing between the client and server in a way that makes sense to the application. With database-oriented systems, especially those where the database subsystem is open and accessible from many different points, it makes sense to implement an intelligent database layer that will manage the data. This layer is responsible only for storage and inquiries as they relate to the information. It has no responsibility for the presentation of information.
In the next couple of sections, you'll review what types of functions and operations reside in the client and server sides of the client/server model. Although these concepts aren't exhaustive, you need to understand them. For many people, client/server is just a fancy term for a PC database that resides in a common location and is accessed by many different workstations. After reading this chapter, you should understand that client/server is much more than a common storage location. You can't create a client/server system by using Microsoft Access database tables, for example, regardless of whether the tables are stored on a file server or a local system because no intelligent engine can process the database independently of your application. The logic controlling the data is still driven by your client-side application.
The client application will typically be written in a host languageoften Delphi, PowerBuilder, Visual Basic, C, or C++, for example. These applications allow users to perform add, change and delete operations against the database, where applicable.
The client application should avoid, at nearly all costs, having to work with the entire database table's contents. When a set of information is worked with, you should always think of it as a results set, not the entire data set available to you. By results set, we mean that you should ask the server application to filter and limit the information that will be presented to you so that the operations you carry out are completed against as small a set of information as possible.
One of the best descriptions we've heard comparing older systems with client/server is that of a file cabinet versus a folder. In older systems, you'd typically be doing the equivalent of asking for a file cabinet full of information, so you can take the time to sift through the contents to find the file you want. In this scenario, your client-side application is the piece doing the sifting. All information from the database table is passed through the client, and the client does the filtering to find the information you want to work with.
In the client/server world, you simply request the file folder you want, and that's what's returned. You don't filter through the file cabinetthe server process does. This limits network traffic, as only the results set is passed back over the network. The other very significant benefit of this is that it also increases performance for your application. Typically, server systems are powerful, very strong computing platforms. Since this optimized server platform can work with all information locally, it can do so at top speed. It will be processing the information at the best rate possible.
In short, your client-side application should be optimized to work with results sets. This works hand in hand with database structure and design because you need to make sure that you create the database in such a way that it can support this requirement. You'll have to define the joins, queries, stored procedures, and table structures to support this optimized query into the contents of the database.
In summary, here are some guidelines for the client side of your application:
With SQL Server, your goal is to create the results sets required by the client-side applications. The database engine will be responsible for carrying out the information storage, update, and retrieval in the system. When you first start working with SQL Server, notice that it has no user interface (UI) at all. Yes, utilities are available to help you manage it, but SQL Server in and of itself has no UI. This is by design. SQL Server exists to fulfill requests made of it to the point of returning the results from those requests. Unlike Access, dBASE, FoxPro, and others like them, SQL Server has no involvement in showing users the results of these queries.
When you're designing your database structures, you need to keep a very close eye on how you implement informational control in your system. For example, it may be that different people will need different access levels to the information. Security may beand often isa major issue in the query of the database. If this is the case, your table structures and joins need to reflect this requirement. Chapter 19, "SQL Server Security" provides additional information about the security considerations for your system.
You'll also need to keep in mind how users are going to be accessing your information. Remember that in today's world of open systems, new challenges exist in presenting and controlling information. As you create your database tables and the rules that govern them, you need to assume absolutely nothing about the client side of the application. A good question to ask yourself is, "When I receive this information, what needs to happen with it?" You can answer it by saying, "It needs to be stored for later retrieval." But is that all?
If you're storing sales information, you should validate the item code being sold. Does it exist in the inventory database? Is sufficient stock on hand to sell this item? Do you force sufficient stock levels, or do you allow a "negative stock" situation to occur and simply log the discrepancy in a suspected transactions table?
Each issue requires work on the database side through rules and triggers. It's true that you could expect the client application to complete these tasks, but what if someone is accessing your database from Excel or Word? Can you really assume that they've had the presence of mind to make sure that these checks are taking place? These important issues should be carried out by the server to make sure that they happen, regardless of the point of entry.
For additional information about rules and triggers, see Chapter 14, "Creating and Managing Triggers" and see Chapter 11, "Managing and Using Rules and Defaults" In addition, see Chapter 22, "Accessing SQL Server Databases Through Front-End Products"
It may seem like the right thing to do at the time you're implementing that really intricate trigger or rule, but a caution is in order: it can become a nightmare trying to move too much functionality into the wrong side of the client-server model. Think long and hard about other ways you can implement something if you find yourself putting into place an operation that breaksthe client/server model. You'll be glad you did.
Database flowcharts consist of entity relationship diagrams, or ERDs. ERDs show exactly how a database is structured, what the relationships are between the tables, what rules and triggers are involved in maintaining referential integrity, and so forth. One big benefit of the ERD is that you can sit down with the client and take a logical walk through the database, making sure that the system serves the client's needs.
Fig. 2.7 - This diagram shows a relational diagram for the grocery sales system.
we've added the customer table to be able to track an order for a customer, but apart from this addition, the table structure reflects the earlier tables and relationships. Take a look at how these basic objectsthe entities, attributes, and relationshipsapply to the simple model.
Fig. 2.8 - The basic entity is represented by a box that typically contains two sections.
The portion above the dividing line represents the identifying portion of the row. Remember, to have a normalized database in 3NF, you need to be able to uniquely identify each row instance in the database table. By placing the identifying characteristics above the line, it's easy to read and determine how the record will be retrieved in most cases.
In figure 2.8, you can see that, by the definition for the customer table, you'll most likely be retrieving records from it by using the CustomerID.
As the name implies, non-key attributes are those items that make up the entity that don't depend on any other entity. In other words, they don't make up, nor are they a part of, a key that's used in the entity.
Key attributes come in two different types: primary and (for lack of a better term) non-primary. Primary keys are always shown above the line, indicating that they're identifying attributes for this entity. If the attribute is a key to the entity but not a part of the identifying structure for the entity, it's placed below the line.
If an item refers to a key value in another table, it's known as a foreign key. Again, if you reference the basic model as in figure 2.9, you can see that the Customer table doesn't have any foreign-key segments, but the Order table does, as indicated by the "(FK)". The foreign keys in the Order table are non-identifying, but help designate the customer that the order refers to.
Fig. 2.9 - The basic ERD shows foreign keys as primary (identifying) and non-primary (non-identifying) columns in the sample tables.
Moving from the Order table to the Line Item table, you'll see that the OrderNumber is listed as an identifying component of the Line Item table. This means that to find a specific instance of an order line item, you need to know the OrderItemID, the ItemCode, and the OrderNumber. In this implementation, the Line Item table is an associative table between the Inventory table and the Order table.
A customer makes orders in the Order table, and an order includes line items that reference inventory items. You can also read the diagram in the other direction. For example, you could also say that inventory items are referenced by line items. In any event, you should be able to show concise paths for information to follow when trying to reach an end result.
In the examples, we're using ERwin by Logic Works. This tool allows you to define the different objects, and then place the relationships between the objects appropriately. In the relationship between the Customer and Order tables, figure 2.10 shows that the relationship is non-identifying and that the relationship is a zero, one or more relationship.
Fig. 2.10 - ERwin allows you to easily define the relationship between the Customer and Order tables.
You can also see that the key in the Order table that will be used to retrieve the customer information is CustomerID. It's automatically added to the Order table, and it's added to the non-key portion of the record. If you define an identifying relationship, CustomerID will be moved to the top portion, or identifying key portion, of the Order table entity.
Each of the other relationships is defined in a similar manner. You can walk down the table structure and determine exactly how the different entities will interact. The next couple of sections look at a methodology that will guide you through the design process with the customer.
While bringing a customer up to speed, terminology, methodology, and approach aren't the goal of client reviews. The goal of any system you'll endeavor to write is making sure that the database structure will support the functionality of the system. This is where the old maxim of "Determine the output first, the input will follow" comes to bear on a project. It's certainly true that the test of any system is the output. If you've created the best system ever devised to allow input of information, it's a sure bet that if you don't have a way to get meaningful information out of the system, your time has been all but wasted.
If you didn't know that the users needed to have an aging report from their point-of-sale system's accounts receivable subsystem, would you automatically store the date that the original invoice went out?
Truly the only way you can ensure that you're not coding in vain is to make sure that you can fulfill the output needs for the system.
Here's a general set of guidelines that not only will allow you to ensure that you've hit at least the high points of your target audience, but will also map nicely to the database design topics we've covered here:
If you didn't have the associative table between the Order Table and the Inventory table, you would end up with a many-to-many relationship between the two. This isn't a good way to accomplish this, as you wouldn't have a singular path for identifying an instance of an order record. Figure 2.10 shows the associative table.
Fig. 2.10 - An associative table has been implemented to remove the many-to-many relationship problem imposed by the Order and Inventory tables.
For example, if you're working with static tables in your application, as may be the case with the inventory table in the example, you may want to load that table to a local Access table for access by your application. If you were to do this once a day, you'd be reasonably assured of having the correct and up-to-date information at the workstations.
It may be that you're loading down tables that, when taken alone, don't provide a complete picture. For example, if you have a customer table, an account balance table, and a sales representative tableall related based on customer numberyou'll end up with a three-way join to return a complete picture of the information.
You may want to consider denormalizing this set of tables at the client. In this case, you could create a single table that would hold the information from all three tables as a single row of information. That way, when you request information for customer X, you'll receive all the information you need, and a further join won't be required.
Review things carefully with your customers, whether the customer is internal or an external client, and this will provide substantial leverage in your projects and will help you toward coming in on time and on budget.
The following chapters provide additional information that you'll find useful in your database design efforts: