tadhg.com
tadhg.com
 

Bookmarking Project: Database Structure

23:54 Fri 19 Jan 2007. Updated: 00:49 21 Jan 2007
[, , , , ]

I’ve made progress on my bookmarking project, and have most of the input side done, including the JavaScript and the PHP. The first thing I had to figure out on the backend side was the database structure, which is what I discuss here.

While it’s fairly simple, I’ve (shockingly) never made a table purely to manage relationships between entries in other tables before.

For this project, I wanted to track pages I’ve read online. I wanted those pages to have tags so that I can easily find them later. I could just put the tags in a comma-delimited list in a field in the pages table, but that’s not the right approach.

The other fields could be put in the pages table, things like URL, Author, Content, etc. So I needed three tables: pages, tags, and pages2tags. The first two are self-explanatory. The third manages the relationship between them.

The pages table has the following fields:

page_id
title
source
url
comments
date_added
content_type
character_set
domain
author

This is the non-tag metadata about the page, plus the content of the page (source). page_id is a different kind of metadata, the identifier for the entry in the table rather than something about the page itself.

The tags table is simpler, having only two fields:

tag_id
tag

The tag field contains the string that is the name of the tag. The tag_id field is just like page_id in the pages table.

So, if I’m not jamming the tags for each page into a field for that page, how do I correlate between the two?

The pages2tags table has three fields:

rel_id (Frank points out below that this field isn't strictly necessary.)
tag_id
page_id

rel_id is the equivalent of the page_id and tag_id fields in their tables. But in this, page_id and tag_id establish a relationship between a tag and a page.

Let’s say I bookmark to a Glenn Greenwald post and tag it with: politics, America, privacy, intelligence, recommended. This creates:

* An entry in the pages table, with the data for each field, plus a unique page_id.
* A new entry in the tags table for each tag that isn’t already in that table.
* The tag_id values for each tag.
* Five new entries in the pages2tags table, one for each of the tags. Each of the entries has the page_id for the new page, and the tag_id for that tag.

That’s it. To find out what tags a page has, look up that page_id in the pages2tags table. To find out what pages have a given tag, look up that tag_id in the pages2tags table.

That’s the data structure, a fairly simple one. But that model, of relationship tables, is extremely powerful and used very frequently.

Incidentally, in developing this project so far I’ve found myself dividing things into the following areas:

* Features/Use Cases—what the application needs to do.
* Data Structure—what data needs to be stored, and how to do that. (Which is what I covered in this post.)
* User Interface—what the user will need to do to use the application. (Which is what I covered in my post about the Javascript side.)
* The “connecting layer” that takes whatever the UI provides and turns it into the requisite data structure. (Which is what I did in PHP.)

This is hardly a revolutionary approach, and I don’t claim to have thought of it on my own; I’m merely stating that it’s making a lot of sense to me now, and I’m beginning to automatically divide projects up into these areas.

2 Responses to “Bookmarking Project: Database Structure”

  1. Frank Says:

    Your pages2tags table is an example of a junction or intersect table, which is the standard way to normalize a many-to-many relationship. Although there is not usually a reason to include a separate sequence in that table, like you have with the rel_id. Most of the time, you would just make the primary key a combination of the foreign keys, page_id and tag_id, since there would never be a reason to have a duplicate row. Is there any particular use you were thinking of for rel_id?

  2. Tadhg Says:

    Thanks for that info—at time of writing, I couldn’t remember the term “junction table”.

    No, I didn’t have a particular use for the rel_id, I guess I just put it in there by default. I added that as a note to the article itself, though.

Leave a Reply