Kaye and Geoff's web page documentation
To achieve this sort of interaction requires a computer program with the ability to read and write files and to interact in a general way with the operating system. The program and its files need to be stored on a computer which is always on and always connected to the internet. In addition, for security reasons the program needs to be accessable to web pages via a controlled interface which only allows the desired information to be passed in either direction. So the obvious place for such programs is the same computer on which your web server is running (or another computer to which it is networked).
Programs designed to work in this way are called Common Gateway Interfaces (CGIs). A CGI is a script or program which runs under the direction of the web server, and typically adds dynamic behaviour to web pages by accessing databases, doing calculations from inputs, selecting files and so on. Users normally provide input data to a CGI using a form written in HTML. The browser and web server are responsible for passing these data to the CGI, which processes them and then passes information back via the server to the browser, telling it what page to display next (often the information is the actual HTML for a web page containing the results of the processing). Note that all CGIs (except those interfacing with the web page via AJAX) must return something - the browser sends the data to the web server which invokes the CGI, and the server expects something in return which it can pass back to the browser.
This document explains the way in which a web page passes data to a CGI and what the CGI might do with that data. There are actual examples of code, but they are generally not suitable (without some modification) to use in real web pages since they have been simplified by leaving out error checking and any sophisticated behaviours. In particular most of the programs do not include any protection against malicious use. Security should always be considered when writing CGIs. Any input field can be given any sort of value by someone trying to compromise your system, so input fields should be limited to those really required, and their values should be validated as tightly as possible. Particular care needs to be taken with values which are to be used in an executable environment, including calls to the operating system, file processing, printing, and so on.
CGIs have to be located in a defined area on the server (traditionally the cgi-bin directory on Unix systems); they cannot just be in your normal HTML page area. If you are an author on someone else's server (for example an ISP) and you want to write CGIs, you should make sure that they allow them. It also helps to ask about access to telnet capabilities, access to the web server error log, and if there are any restrictions (for example limiting access to operating system features) which the ISP applies. It is normally important to know what operating system the web server runs on, since this will limit the choice of languages which are available for you to use for your CGI, and also define the system interactions which your CGI can take advantage of. It should be clear that to create CGIs you not only need programming skills but you also need to have a reasonable understanding of the operating system it will run on. Here we are assuming that this is the case and are only attempting to give you the extra information you need to get you started with writing CGIs.
All our examples presume a Unix server and are written in Perl. Just about all versions of Unix, and the Macintosh (as part of the Unix underlying OSX), come with Perl and it is available for PCs as well. If you have a Mac then it can be very useful to use the local web server to test your CGIs since you do not need to endlessly copy files to a remote server and reset their permissions. Be aware however then there can be some differences between the Mac's Unix and versions commonly found on commercial web servers.
Interaction between a form and the CGIWhat is sent from a form to the server?
The browser sends the form's contents to the server as a single string, with each field separated by an ampersand (&). Each field is of the form name=data. The name is the value of the name attribute in the HTML which defines the form. This can be made clear in the following example, where we have created a form as follows:
It is up to the CGI to read and unpack the information in the string, and handle the information as required.
In fact, the string may not be exactly as shown, since most of the "special characters" (such as &$() or space) are escaped - they are translated by the browser and so appear as different characters, or as hexadecimal numbers preceded by a percent sign. Have a look at the translation table, to see how special characters are translated. The easiest way to illustrate this is by example: complete the fields in the form below and it will (subject to a few security restrictions) send back a copy of the data received on the server. You can try the form a number of times, putting in "special" characters (for example -+&%()) to see what happens to them (our tests suggest that the only non-alphanumeric characters less than ASCII code 127 which are not escaped are asterisk, hyphen, period and underscore; the rules say that all other characters should be escaped).
The preceding example has the method attribute in the form tag set to "post". If the method attribute is "get" then the information is passed to the CGI not via standard input, but as an environment variable called "QUERY_STRING". For example with this form:
You can view the Perl script which is invoked by this form.
The environment variable which carries the passed information is called "QUERY_STRING" because there is an alternative way of doing the same thing - by appending a query string to the URL in the action attribute of the form tag, for example:
Note that the query string cannot be arbitarily long; web servers typically apply a maximum of 1024 characters to this string. Even if a query string limit is not set, there will be a limit on the length of the entire URL. The amount of information allowed to be passed using a "post" request is not unlimited, but is usually much greater (128Kb to 2Gb) than that allowed with "get".
There is yet another way of passing information into an environment variable. If the CGI is invoked with an action such as:
It can be a bit "kludgy" transfering large amounts of information via an environment variable, so generally using method="post" is the preferred approach for passing information to a CGI. However the example below illustrates one simple application where the query string can be very useful - when a CGI is invoked without using a form. Here we want a CGI to be invoked from a pair of menu items, where each item produces a variant of the CGI's output possiblilities.
Perl script and see how it works or try it out:
Environment variablesWhen a browser 'converses' with a server, it must identify itself, and it may send parameters in the calling string. As a writer of CGIs you have access to the information which the server knows about the browser (for example the browser type and the IP address of its server or proxy). This information is passed in (Unix) environment variables which can be accessed by your CGI. Again, this is easy to illustrate by using a CGI to return the environment variables. You can use the form below (whose only active element is a submit button) to look at the environment variables:
The HTML for the form is straightforward:
and you can view the perl script which is used to pass the environment variable values back to the browser.
The following examples illustrate some of the power of CGIs. The first is a script to email the contents of a form.
Of course it is possible to just invoke the mail system using a "mailto:" value for the action attribute within a form, for example:
Browsers vary in how a "mailto" is handled. Some include the message as text in the mail, and some even manage to format it to some extent, but others just send it "raw" with all the escaped characters (as explained above) included. Others respond to the "mailto" by starting up a local copy of a mail program without sending anything immediately. If you want to know how your browser will act under these circumstances, the easiest way to find out is to try it and see, but remember that others using your web pages may have a different browser.
If you want the behaviour of your web page to be predictable under these circumstances, you can pass the contents of the form to a CGI which then processes the information and emails the result to the desired recipient. This allows full control of the process - for example the input fields can be checked to ensure that all required fields are filled in and the information can be reformatted to make it easy to read. More sophisticated processing can also be carried out, such as redirecting the email depending on the contents of the message, sending it to more than one recipient, saving the contents in a database, and so on.
The CGI will be more useful if it exhibits some "general" behaviour so that it can be used with many different forms. The names of the form elements can be used to indicate required fields and the CGI can reject input which does not have these fields filled in. In the following form which uses the cgi_femail CGI the email address is required (indicated by the "req_" prefix to its name). We have also 'hidden' the email address to discourage harvesters (well, unsophisticated ones, anyway) from adding us to their spam lists:
The web is an excellent way to get feedback on the services and information you offer. One of the features commonly found on web sites is a guestbook, where visitors can register their comments. This is the sort of application where you probably do not want the information emailed to you immediately - it is enough to check out the guestbook from time to time, to see what comments have been made. The example outlined here is rather simple; a more complex (and realistic) version might, for example, check the content for offensive words or attempts to breach security, present a more appealing layout and allow you to archive out-of-date entries. You can try out the example but please do not try to use it to enter active links to your own site - that is not what it is for (and it will not work).
We assume that the guestbook will be made available via a webpage to anyone who wants to look at it, and anyone will be able to contribute comments, so two CGIs are required - one to add an entry, and one to display the existing entries. It is easiest to invoke each one from its own form, but both forms can be on the same web page, for example:
Here is the form as it looks on the web page. Note the extra feature: the option of specifying the number of entries to display. You can try it out (but note that entries containing HTML will be ignored).
As long as we are happy with a straightforward page layout, the CGI to read the guestbook is very simple, since it can take advantage of Perl's access to Unix system calls. The writing CGI is a bit more complex, but we can keep it reasonably simple by holding the HTML for the guestbook in three files: an unchanging header, a central section containing entries which we add to by appending new entries on the end, and an unchanging footer. In a 'serious' system you might also have one or more private CGIs to delete or archive entries.
Many web sites (including ours) allow users to search the site for a keyword or phrase, or a more complex arrangement of words. This facility is a very powerful method of providing information about the site and allowing rapid navigation to the areas of interest. There are a number of ways to implement site searching, but here we will look at the most flexible - writing your own CGI to do the task.
To illustrate the basic requirements, we start with a simple example - to search a single page and display lines containing a given word. The form needs to provide the name of the page to be searched, an input box to accept the word, and a button to submit the form, for example:
Search for a word
The CGI which the form invokes is written in Perl. You can try it here:
This search is not in fact very useful - it only (rather poorly) duplicates a feature found in most browsers. So our second example is more complex - in fact it is very close to the script we use to implement searching on our site. It is more useful in that it returns a page of active links to pages containing the search word. The word can be a phrase (it can contain spaces) and the search can be limited to a subset of pages - this makes sense with a site like ours which is comprised of a number of more-or-less unrelated sub-sites.
There is no need to provide a working example of this search - it is at the top of each of our major pages. The HTML for the form looks like this:
Because our search feature only has a small space at the top of the page, it does not allow sophisticated search rules - whatever is entered is what is searched for (multiple words separated by spaces are treated as a phrase rather than separate key words). Also, it is not suitable as a general searching script - it expects our particular site structure and names, although it could be modified to work with a different structure. You can try it out by going to any of the major pages, for example the home page.
Problems: creating, testing and debugging your codeThe complexity of CGIs and the interactions between them and web pages means that getting it all working is much more challenging than writing web pages in HTML.
If you want to create CGI scripts, you may need to talk to the webmaster who looks after your server first. Some ISPs do not allow user-written CGIs, and even if they do, they may want to closely examine anything you do before it is allowed on their server. This is because CGIs run under the control of the web server, which usually has more privileges than normal users are allowed, so there will always be system security concerns with CGIs.
For example, one precaution typically applied is illustrated in our resub package which removes any backquote characters - under some conditions these can be used to invoke Unix commands from the information passed to the CGI. A related rule is never to use passed parameters within backquotes in Perl programs. The vertical bar and angle brackets are other characters in Unix which can be used to induce undesirable behaviour so if possible these characters should be screened out of your input. Where possible, pass codes to select parameters used inside the CGI rather than the parametrs themselves; this simplifies and tightens up range-checking of the input.
You might also like to investigate Perl's 'taint' mode. There are many more security precautions which you need to consider when creating CGIs; for example have a look at Randal L. Schwartz's Unix Review Column 48.
Detailed instruction on programming in Perl or other languages is well beyond the scope of these pages, but here are some general hints on writing and debugging your CGI scripts: