http://www.cs.umd.edu/~pugh/intro-www-tutorial/
by William Pugh
(pugh at cs.umd.edu)
Dept. of Computer Science
Univ. of Maryland, College Park
Among other reasons for doing things this way, you might be using a browser that doesn't support multiple font sizes or wish to have a browser that provides an outline-based view of an HTML document.
However, as people have started to do more elaborate things with the WWW, some people have tried to get more control over the how their document looks. For example, some extensions allow you to control the size, color or font of text.
Another issue is that companies have raced past the official WWW Consortium: different browsers support different extensions, and some extensions are poorly thought out.
Most of what is described in this document is part of the official HTML 2.0
standard. I've included a few useful extensions, marked with
.
These are fairly widely implemented, but you shouldn't depend on them.
I've ignored some obsolete extensions even though they
are widely used and supported.
This document was written in January, 1996. The WWW and HTML are rapidly
evolving; by 1997,
parts of this document will be substantially out of date
(although if you stick to the things in the 2.0 specification, your pages
should still look fine).
Unfortunately, I don't have the time in this mini-course to go over what makes a web page attractive, useful or easy to use. Fortunately, a number of others have done a much better job than I could do, including:
One quick note however: assume anyone reading your web page will get your documents at about 1K/second. Think real hard before having more than 20K of images on a web page.
A very useful suggestion: look at the source code for web pages; most browsers have a view source option.
Tags are delimited by angle brackets (e.g., <p>) and are
case insensitive. Some tags are containers: they have a start and end tag
(e.g., <h1> and </h1>). A end tag is always
formed by putting a slash at the start of the tag. Container tags must be
properly nested (e.g., <strong>
you can't <em>
overlap</strong>
two character styles</em>
).
Tags that a browser does not recognize (because they are extensions it doesn't handle) are generally ignored.
Tags can have attributes, such as <img
src="pic.gif" align=top>, which has two attributes src
and align
, each of which is assigned a value.
It is always safe and sometimes required to quote attribute values
(particularly if they contain any unusual characters).
Sometimes, an attribute is simply present or not present, rather than
being assigned a value.
HTML documents can contain special characters, such as à -- these
are denoted by an & followed by a code followed by
a semicolon; a code can be either
a name or # followed by a number.
For example, the string à
will produce à.
A fairly complete list of special characters
is given at
http://www.w3.org/pub/WWW/MarkUp/html-spec/html-spec_9.html#SEC9.7, but
note that not all browsers implement that list.
Most importantly, if you want to have
a less than sign appear in a document, you must use <,
and if you
want a ampersand to appear, use &.
You are probably OK if a greater than sign appears naked in your document,
but you might want to use >
to be on
the safe side.
All white space is equivalent: a space, ten spaces, a new line or ten blank lines are all treated the same.
An HTML comment is written as <!--
Your comment goes
here -->
. Note that two dashes are required to both
start and end a comment. Not all browsers recognize comments (this goes
for a lot of HTML) and people can see your comments (by viewing the source
of your document) so don't put any secrets in a comment.
The overall structure of an HTML document is:
<html> <head> <title>Your title goes here</title> </head> <body> Your contents goes here </body> </html>You might be able to leave off the outer <html> ... </html>, and in more sophisticated applications, you might put more in the heading. But for now, I recommend that you not worry about it.
Headers are used to indicate section headings. A <h1>
level 1 header</h1>
,
is the top/biggest/most-important section header (e.g., a chapter heading),
a <h2>
level 2 header</h2>
is the next most significant, and
so on. You can use header levels 1 through 6, and a level X header
is delimited with <hX>
and </hX>
. You
probably want to avoid skipping levels (e.g., having a level 3 header immediately
inside a level 1 header, with no level 2 header in sight). This will probably
work, but future browsers may allow you to look at a document in outline
form and would get confused by such a structure.
The contents of a header can include text, images and line breaks (but not paragraphs, lists, horizontal rules or preformatted text).
<p>
and </p>
,
although the closing
</p>
is optional and you will rarely see it or use it. In an earlier
version of HTML, <p>
was used to separate paragraphs rather than to
start them. You will probably see that in some old documents, but you should
avoid it. The contents of a paragraph can include text, images and line breaks
(but not headers, lists, horizontal rules or preformatted text).
The exact formatting of a paragraph is up to the browser; it might put a blank line before each paragraph, or not. Empty paragraphs might be displayed as blank or ignored.
You can (in some browsers) center or right align headers or paragraphs,
for example by using
<h2 align=center>
... </h2>
or <p align=right>
(this is one of
the reasons why <p>
was changed to start a paragraph rather than separate
paragraphs).
Multiple <br>
's may or may not cause multiple line breaks.
<hr>
to cause a horizontal rule on a line by itself. Here
is an example:
Ordered lists are denoted with <OL>
... </OL>
and unordered
lists are denoted with <UL>
... </UL>
. Both of these much contain
a series of list items, the start of each marked with <LI>
(like paragraphs,
you can close a list item with </LI> but it is not needed and rarely
done).
Each list item can be pretty much whatever you want other than a header.
A definition list is denoted with <DL> ... </DL>. There are two types of items in a definition list: Terms (<DT>) and Definitions (<DD>). As with <LI>, you do not need to close <DT> or <DD>.
An important note: < and & still have special meaning in preformatted text. You can't just convert a text document to html by putting <PRE> ... </pre> around it: You have to watch out for occurrences of < and & in your text.
Blockquote
<blockquote> ... </blockquote> denotes a chunk of your document that is quoted from elsewhere; typically a browser indents that portion of your document (headers and all).
<em>
... </em>
<strong>
... </strong>
<code>
... </code>
Used for computer text; often displayed in teletype font
<i>
... </i>
<b>
... </b>
<tt>
... </tt>
<big>
... </big>
<small>
... </small>
Normally, the image is just treated as a (big) character. An alignment of top
tells the browser to align the
top
top of the picture with the top of the line, and
an alignment of
middle or bottom
tells the browser to align the
middle
or bottom
of the image with the baseline of the text.
An alignment of
left
or right
introduces some serious magic.
Rather than displaying the image within the current line of
text, it is displayed on the left or right side of the window, and text
wraps around it. Many browsers don't support this, and it is hard to predict
exactly how your image will look on a different browser or with a different
window width.
If a browser supports left/right alignment of images,
it may also support
<br clear>
to cause a line
break to a place that is clear of left and right justified images.
Alternatively, if may allow clear=
left|right|all as
an attribute for pretty much anything (including headers and paragraphs) (this
is part of the proposed HTML 3.0 standard); this will cause the browser
to move the display of that element down until the left/right/both margins
are free of floating images.
Early web browsers didn't display anything until they
had downloaded all images in the document.
More recent browsers try to display the web page as soon as possible.
However, until it starts receiving an image, it doesn't know how big it is
and how much space to reserve for it on the page.
This can slow the display of the page and/or cause the page to be reformatted
as the documents are downloaded.
In some browsers,
specifying the
height and width
of an images
in the
<img>
tag
eliminates this problem.
At the moment, there are two images formats you should be primarily concerned with:
A progressive version of jpeg is just starting to become available
(allowing rough approximations from just part of a file)
.
protocol://machine.name[:port]/dir1/dir2/file
The protocol describes how to get/access the document. Some typical protocols
are http (hypertext transfer protocol), ftp (file transfer protocol), gopher
(gopher protocol), file (a local file). The machine.name must be a standard
Internet domain name. Warning: wam
might resolve appropriately within campus,
but not outside of it: use fully specified names.
The port is some TCP wizardry you don't really need to know about. If omitted, it uses the standard for whatever protocol you are using (80 for http). The only bit of information you might find useful: if the port is less than 1025 on a UNIX system, it must be set up by the system administrator of that machine. If 1025 or greater, anyone could be running it.
The directories specify a path from the root of web file structure. You use UNIX style pathnames even if the server or client is on a Wintel or Apple system. One frequent exception: If the first directory is ~name, that resolves to the directory that has been set up for name.
If your path specifies a directory rather than a file, you will get a default file name (typically index.html but it might be something else). If such a file doesn't exist, you might get a directory listing or an error (depends on how the server is set up).
Normally, a reference to an HTML file is considered a pointer to the beginning
of the document. You can also point to an arbitrary named location within
a HTML document. To do so, simply append #location to the URL.
To name a location, use an anchor
(<a>
) with a name specified:
<a name=
location>
... </a>
This associates the name specified
with the text inside the anchor. An anchor tag can specify a name
,
an href
, or both.
You can leave off various prefixes of a URL and have the URL be treated as relative
to the location of the page containing the link.
As with UNIX file names, you can use ..
as a directory name to
climb up to the parent directory.
For example, within the document
http://www.cs.umd.edu/users/pugh/index.html
the following shows the interpretation of some relative URL's.
Relative URL | Absolute URL |
---|---|
/Department/About.html | http://www.cs.umd.edu/Department/About.html |
intro-www-tutorial/ | http://www.cs.umd.edu/users/pugh/intro-www-tutorial/ |
pugh.gif | http://www.cs.umd.edu/users/pugh/pugh.gif |
../keleher | http://www.cs.umd.edu/users/keleher |
#papers | http://www.cs.umd.edu/users/pugh/index.html#papers |
Tables are very useful, but only some browsers implement them,
and they are not implemented consistently.
They don't all recognize the same tags/attributes, and
some allow only plain text in table cells, others allow anything
(including lists and other tables).
There is
a proposal for a table standard, and most of the
browsers that implement tables implement
the standard. Within this section, I'll use
for features that are not a part of the proposed standard.
Among other uses, you can use tables to generate multi-column documents (but this only works in browsers that allow arbitrary contents for a table cell). I've done this in the summary section below.
Overall structure of a table:
<table>
<tr>
row 1
<tr>
row 2 ...
</table>
You can also specify a caption:
<table>
<caption>
caption text 1 </caption>
<tr>
row 1
<tr>
row 2 ...
</table>
Overall structure of a row:
<td>
first entry 1
<td>
second entry 2 ...
You can close rows (</tr>
)
and cells
(</td>
) but
it isn't needed (
unless you have tables inside of tables).
You can also use
<th>
for table cells.
Using <td>
creates
a data cell;
using <th>
creates a header cell.
A data cell is typically displayed in normal font and left-justified.
A header cell is typically displayed in a bold font and centered.
If you close a header cell (optional), use </th>
.
<table border>
<table border=
int>
<tr align=
left|center|right>
<td align=
left|center|right>
<tr valign=
top|middle|bottom>
<td valign=
top|middle|bottom>
<td rowspan=
int>
<td colspan=
int>
Here is an example table (taken from Teach Yourself More Web Publishing with HTML in a Week by Laura Lemay):
Used Belt Deflection | Set deflection of new belt |
|||
---|---|---|---|---|
Limit | Adjust Deflection |
|||
Alternator | Models without AC | 10mm | 5-7mm | 5-7mm |
Models with AC | 12mm | 6-8mm | ||
Power Steering Oil Pump | 12.5mm | 7.9mm | 6-8mm |
Here is the HTML to generate it:
<TABLE BORDER> <CAPTION>Drive Belt Deflection</CAPTION> <TR> <TH ROWSPAN=2 COLSPAN=2></TH> <TH COLSPAN=2>Used Belt Deflection</TH> <TH ROWSPAN=2>Set<BR>deflection<BR>of new belt</TH> </TR> <TR> <TH>Limit</TH> <TH>Adjust<BR>Deflection</TH> </TR> <TR ALIGN=CENTER> <TH ROWSPAN=2 ALIGN=LEFT>Alternator</TD> <TD ALIGN=LEFT>Models without AC</TD> <TD>10mm</TD> <TD>5-7mm</TD> <TD ROWSPAN=2>5-7mm</TD> </TR> <TR ALIGN=CENTER> <TD ALIGN=LEFT>Models with AC</TD> <TD>12mm</TD> <TD>6-8mm</TD> </TR> <TR ALIGN=CENTER> <TH COLSPAN=2 ALIGN=LEFT>Power Steering Oil Pump</TD> <TD>12.5mm</TD> <TD>7.9mm</TD> <TD>6-8mm</TD> </TR> </TABLE>
This section describes several kinds of programs that are designed
to help validate your web pages. Two of these,
weblint
and
htmlcheck
, you only need to run when you create your web page:
they check that your HTML is correct.
Simply looking at your page with a WWW browser is not sufficient. Many browsers
attempt to cope with HTML errors, but different browsers are able to cope with
different errors. Some errors that Netscape 1.1 used to cope with
aren't tolerated by Netscape 2.0.
Another checker, MOMspider
, checks your links to see if they are
valid. This is useful not only when creating a web page,
but also as a weekly check to see if any of the
(off-site) web pages you point to have changed.
A listing of HTML validation tools, including some other nice
things like WWW cross-reference generators,
is provided at:
http://www.khoros.unm.edu/staff/neilb/weblint/validation.html
http://www.khoros.unm.edu/staff/neilb/weblint.html
weblint
should be installed real-soon-now
on the departmental Unix machines.
There are a number of options (type man weblint
or
view
http://www.khoros.unm.edu/staff/neilb/weblint/manpage.html
)
but you can
run it by with just:
weblint index.html
You can also just submit a URL or HTML to a form at:
http://www.unipress.com/weblint/
Weblint looks for certain bad things in your html and gives you
fairly useful error messages when it finds them. Some of the things
that weblint
complains about aren't illegal, simply
things it thinks are bad style (like having a <h3>
header
immediately inside a <h1>
header).
http://www.webtechs.com/html-val-svc/
htmlcheck is installed on the departmental Unix machines.
There are a number of options (type man htmlcheck
or
see the WWW documentation),
but you can run it by with just:
htmlcheck index.html
You can also just submit a URL or HTML to a form at:
http://www.webtechs.com/html-val-svc/
htmlcheck tries to parse your document using the official specification for HTML (you can tell it which specification to use). When it finds an error, the error message may not be very useful and it may get confused so that any later error messages are worthless.
In my use, I always use weblint
first and correct
any errors it finds.
Only then do I use htmlcheck
; once I've gotten rid of the big
errors that weblint
finds, the error messages from
htmlcheck
are often more useful. I've found
that htmlcheck
will often find problems that weblint
will miss, so I use both tools.
http://www.ics.uci.edu/WebSoft/MOMspider/
Momspider is run once a week on the departmental machine and checks links in html files to make sure that point to valid pages (it checks them even if they point to another machine).
For web pages on the CS machine, you can fill out an on-line form to have your web page checked once a week and have a report emailed to you if there are any problems.
Sometimes, you will get back an error report but when you check it out yourself you don't have any problem. If the machine hosting the other page was down when Momspider ran, then you'll get an report even if the machine came back up 5 minutes later.
http://www.w3.org/hypertext/WWW/Tools/Filters.html
Some of the most useful ones for converting to HTML are:
text2html.sed
converts text into preformatted HTML text. This just replaces < with < and
& with & and slaps
<pre>
...
</pre>
around the whole thing.
txt2html-pre
on departmental
Unix machines.
txt2html
converts a plain text document to a formatted HTML document. If it finds a line
of --------
, it replaces it with a <hr>
, and so
on. If it finds a URL, it makes it active.
txt2html
on departmental
Unix machines.
latex2html
converts a latex document to html. Anything
it doesn't know how to convert (math, a figure, a table) it converts to
an embedded picture.
Really useful (it you have documents in latex).
You should read the man pages or web documentation, it has a lot of
capabilities you won't get by default.
It tends to break an latex document up so that each section becomes
a separate file; if you don't like this, you can change this
behavior.
In general, the ones I've played with are not ready for prime time (as of January 1996). The problems are:
All of the ones I've seen work OK as a first pass, but I'd need to spend some time cleaning up the resulting HTML code before I was happy with it. Some of them are not happy editing HTML documents created by anything other than themselves.
HTML editors are improving. Within 6 months, I expect HTML editors to be an important tool for creating HTML documents.
Courtesy of Jeff Hollingsworth
http://www.microsoft.com/msoffice/freestuf/MSWord/download/ia