tadhg.com
tadhg.com
 

Guide to How the Web Works II: For Website Owners

21:14 Sun 18 Aug 2013
[, , , , ]

Second in a planned series of five posts about the technical side of the web. The first post covered what every web user should know, and this one is intended for people who own websites—who also need to know what was in the first post.

This is a work in progress, and I welcome feedback.

This is not a guide on how to run a business online, nor on how to build a website, but rather covers the basics you need to know in order to do those things.

Where it is: DNS

As discussed previously, the addresses for websites are made up of constituent parts, the most important of which are scheme (e.g. http://), domain (e.g. tadhg.com), and path (e.g. /wp/2013/07/28/guide-to-how-the-web-works-i-for-web-users/). Your website will use HTTP or HTTPS, because it’s a website; it’ll have various sections with paths pointing to them; but what about the part in the middle, the “domain name”?

As with phones, computers on a network are typically addressed numerically. The internet is a global network, and each machine on it has an individual address, called an IP address[1]. Currently this is a string of four numbers, from 0 to 255, separated by a dot, e.g. 205.251.242.54[2].

IP addresses aren’t easy to remember, so DNS provides a way to map labels to them. Simplifying greatly, you pay (on a yearly basis) to get a domain name into a registry that internet devices look up to find corresponding IP addresses—somewhat like an automated centralized phone book, except that domain names in their full state must be unique.

Domain names use dots to separate the pieces, which are most broad at the end[3], i.e. I can put any number of pieces, separated by dots, before the start of tadhg.com and they will all be considered part of the tadhg.com domain, but I cannot meaningfully make any claim to tadhg.com.ie.

The last section of a domain name is a TLD, for “top level domain”. The most common of these is .com, followed by .net and .org and the various two-letter country codes, and by newer non-country TLDs such as .name. Uniqueness is required only below the TLD, which is one reason why nissan.org and nissan.com are different sites.

Some country TLDs enforce second-level categorization. For example, Australia requires commercial entities to use .com.au, so that if Amazon opened in Australia their Australian address would be amazon.com.au.

Using a www subdomain for websites is traditional, but not required; your site should have one, but should respond in exactly the same way regardless of whether or not the user adds www. to the URL or not.

DNS can be changed. You can move your website to a different machine with a different IP address and point your domain at that new address. However, this change is not instantaneous, and can take days to percolate to the distant reaches of the internet.

This also means that if you ever lose control of the domain name, the new controller can point it at their own machine; your content will still exist, but will be very difficult for people to find.

Where it is: Hosting

DNS points devices to your IP address, but you still need a machine at that IP address that handles HTTP requests. This is your web server.

Realistically, you have three options for this:

Home
Host it your home internet connection[4] using a machine of your own. This gets trickier if your ISP doesn’t provide you with an unchanging IP address (not all do).

Hosted Platform
If your site has some specific function, like a blog, you can host it on a platform providing that service, such as wordpress.com.

Virtual Server
You rent part of a machine from a provider such as Linode. Usually, they’ve set up the machine and you take it over. Here, your server is one of many running on the hardware. Some of these services have built-in administration tools meant to make it easier to manage them.

Other options exist[5], but the above three are best for beginners.

Once you have this set up, you should have the three necessary components for a web server:

  • A domain name pointing at an IP address.
  • A machine with that IP address.
  • Your content being returned by that machine to HTTP requests.

Search Engines

You have to be very successful in order for any significant number of people to remember what your domain name is. New visitors will be incredibly unlikely to know it in advance. That means users need a way to find it, and the most likely was is via a search engine.

By “search engine” I primarily mean Google, but others still exist, including Duck Duck Go and Bing.

Unlike the DNS system, these search engines are not like a phone book. You can pay to get your business put into a specific category in the phone book, but you cannot do exactly the same thing with any of the search engines.

Most search engines have two sets of results: the “natural” results and the paid results. The majority of users—probably including you—learn to distinguish between the two quickly, and focus almost entirely on the “natural” results.

These results are the output of rather complicated systems that map the web and try to relate the locations on that map to search terms. Doing this is far more difficult than it sounds, and Google is where it is today because it was the first search engine to actually do it well.

It is not in the interest of search engines to let sites pay their way to the top of these results. If they did, their users would soon become fed up with the lower-quality results, and switch to another service, thus destroying the value of the paid results for that search engine. This is why Google won’t let you just pay them to be the first result for some search term.

Furthermore, if they won’t let you pay them, they also won’t tolerate for a long a system in which you pay someone else who games their system to get you to that first slot. Bear this in mind when considering paying for search engine optimization.

There are ways to improve your standing in the results, but the first is quasi-tautological: be popular.

The way that Google and other sites judge “quality” boils down to what other sites link to you. If high-quality sites link to you, your ranking will improve[6]. If your content is interesting and you’re lucky, other sites might notice you and link to you, improving your ranking. Posting good content and doing non-invasive promotion are probably the best and safest ways to rise in the search engine results.

URL Construction Practices

This is in the “website owner” piece rather than the “web developer” piece because it’s related to search engine results, and because it’s a good practice that is overlooked far too frequently.

URLs on your site should be comprehensible, logical, and friendly. They should not look like ampersands and percent signs are waging a dirty war for control of the location bar.

They should also be predictable. If you have a contact page, the path for it should be /contact. If your site represents several locations with different contact information, and you’ve split those onto different pages, then your San Francisco contact page should be /contact/san_francisco—as well as /contact/sf. In such a case you should consider making /sf/contact and /san_francisco/contact go to the same spot[7].

Some examples (fake, but not without realistic bases) of how not to do it:


http://tadhg.com/wp?post_id=1234


http://tadhg.com/wp?category=2&author=1&title=guid_to_how_the_web_works_1_web_users&year=2013&month=07&day=28


http://tadhg.com/article/0,,30200-1303092,00.html

Avoid systems that make it difficult to bookmark and share individual pages—clean URLs help a great deal with this, and you want people to both come back and to share the site with each other.

Good URL construction is likely to benefit your search engine ranking.

Linkrot

Links that point to content that no longer exists are annoying. But no content is forever, so it’s inevitable. However, if the content exists but has been moved, there’s no excuse for breaking the old URLs. Just because you’ve redesigned your website doesn’t mean that http://example.com/contact no longer works because you decided that http://example.com/contact_us is better—the former should be redirected to the latter. Doing otherwise inconveniences everyone who might come from a pre-redesign link, and those might include new customers or readers whose first experience of your site is now annoyance.

This is another reason to be careful about URL construction—once you put something up online, links to it should work for as long as it’s still there. Not trying to ensure this is careless, obnoxious, and detrimental to your site.

In addition, breaking those links will likely have a negative impact on your search engine ranking, as the search engines notice the links are broken and therefore discount the benefit your site was gaining from links to those URLs on other sites.

Text is Best

The web is still a fundamentally text-based medium. Images and video and audio are all hugely popular, but without textual descriptions of those things, they’re very difficult to find.

Regardless of what kind of site you have, its textual elements should be text, and not images. Information you want to convey should be in text form, unless you’re podcasting or are doing pure video.

If you don’t get why this is true, consider the category of website considered the worst offender for this: restaurants. I’ve had to type out the address and/or phone number of restaurants many times in order to use it or share it, because the restaurant decided that plain text wasn’t good enough and thus used images or Flash instead. There are certainly times when I’ve been choosing between restaurants and chose the one whose website it was easier to extract information from.

Users are coming to your site for information and content. Do not make it hard for them to get those things.

Copying

This includes not trying to stop them from copying what you put up. If they can see it, they can copy it; there are no magical solutions to this issue. What you’re reading right now is your browser’s local copy of the copy that my web server gave it. While it might be worth it to watermark photos, going beyond that is a lost cause that will just alienate people who might have otherwise become loyal fans.

Error Recognition

If you encounter problems with your or another website, you at least need to be able to recognize and distinguish between the following four categories of error:

DNS
If the DNS setup for the site isn’t working, your browser will return something along the lines of “could not find the site”. Try the following:

  1. Open the site in a different browser to make sure it’s not a browser issue.
  2. Open another site to make sure you have internet access.
  3. Open a command line (e.g. Terminal in OS X) and enter ping sitename.tld and hit return (where “sitename.tld” is your domain).

    If that results in an “unknown host” error, it’s a DNS problem.

  4. Go to http://isup.me to see whether the problem is just for your part of the internet or is more widespread.
Unresponsive
If your web server’s DNS is working but the machine running it is having problems, your browser may return something like “example.com is not responding”.

The same steps as with DNS apply here—however, here ping will not return “unknown host” but will show the site’s IP address. If ping also shows responses from that IP, you may be able to narrow down the problem to the web server software rather than to the entire machine.

Not Found
Your site should have a custom 404 Not Found page, but sometimes browsers present their own. In any case, “Page Not Found” should be visible here, and perhaps the number 404, which is the error code the web server returns when given a path it doesn’t recognize.

  1. Re-enter the URL, with particular care about the presence or absence of trailing slashes.
  2. Try other URLs, including just the domain name, to determine whether or not the error is with that specific page or with the entire site.
Server Error
This should result in a page that has the term “server error” (maybe “internal server error”) on it, and is likely to be the result of a misconfiguration of or programming error in the web server software. The same steps that apply to the Not Found case apply here.

Distinguishing between these cases—and the case where your internet access is down—will make troubleshooting far easier, and will improve your ability to communicate with any technical folks you’re working with to resolve the issue.

[1] This is a simplification; in practice one public IP address can be used by multiple machines to access the internet through use of Network Address Translation. There are many other details I’m skipping over.

[2] The “currently” qualifier is there because that describes an IPv4 address, and there is a slow movement to get the internet to use IPv6, which uses four four-digit hexadecimal numbers separated by colons.

[3] In contrast to both paths and IP addresses, both of which are most significant at the start: 192.xxx.xxx.xxx means “in the 192 address block”, and /users/xxxx/xxxx means “in the users directory”. Tim Berners-Lee has said that this order in domain names was a mistake.

[4] Although some ISPs have highly irritating rules against doing this.

[5] These include:

Dedicated (Colocation)
You get a machine and pay a data center to leave it there and hook it up to power and internet. You’re still responsible for maintaining the machine, setting it up, etc.

Dedicated (Rental)
You rent a machine from a provider such as Rackspace. Usually, they’ve set up the machine and you take it over.

Platform as a Service
You rent part of a machine, plus a variety of other services, from a provider such as Heroku. Here, they’re likely to have done more setup work for you, and in addition these services are typically set up to scale as needed—you won’t reach a hard limit for e.g. disk space as you might with the above options.

Virtual
Similar to the virtual server option, but here you are likely responsible for more management, and are charged on a usage basis rather than a fixed fee. Amazon AWS is probably the best-known example of this.

[6] Judging quality algorithmically is extraordinarily difficult, which is one reason why there are so few successful search engines.

[7] However, you should choose just one of these to be the canonical URL, and the others should redirect the browser to that one. Doing this should be trivial.

« (previous)

Leave a Reply