Next: 4.2 Invoking the parser
Up: 4. The Parser
Previous: 4. The Parser
  Contents
  Index
4.1 Description file
Before we take a closer look at the parser itself, we will describe
the format of the description file also known as the Home Page document
(default home.html, but that can be changed). On a Unix/Linux
system this file will be stored by default in $HOME/.plucker.
OS/2 will use the environment-variable HOME to find the
location of your home-directory (you can also use drive letters).
The installer should set the necessary environment variable for
you and also add the necessary directories to your system. You
may check the location by simply typing set home at a
command prompt.
The description file is a valid HTML document with extra optional
tags added for the link references.
- MAXDEPTH=n: This specifies how
deep the parser should follow the links embedded in a web page.
If MAXDEPTH is not given the parser will default to a depth
of 1, that is only download the page itself but do not follow any
links in it. To follow links in the current page you would use
MAXDEPTH=2 and to follow links also in those pages you would
use MAXDEPTH=3 and so on. Too high values without using any
of the available filtering mechanisms could result in an excessive
amount of data.
Hint: MAXDEPTH=2 can be very useful if you have a page that
contains only the headlines that are links to the full text version
of the articles. Many newstickers use this format.
- NOIMAGES: If you are not interested
in downloading images
then you use this tag. If specified all images will be replaced with
the ALT-tag for the image if available, otherwise [img].
Hint: NOIMAGES is an effective way to decrease the size of
databases.
- STAYONHOST: Most web sites contains
references to both locally stored articles and to articles stored on
other hosts. Using a MAXDEPTH of 2 or higher could result in a
lot of unwanted data. To prevent this you may specify the STAYONHOST
tag for your link. The parser will now only download content that resides
on the same server as the one that contained the top page. Together with
exclusionlist.txt this is a quite handy way to prevent the download
of links referred to by banners.
- STAYBELOW=text: Similar to
STAYONHOST this tag tells the parser to only fetch pages that
start with text. For example, it could be used if the
articles on a page are listed on another server in which case
STAYONHOST would not work properly. Or you can grab certain
articles out of a large listing so you would get all headlines but
only articles referring to specific subjects (provided the web server
offering the information is set up correctly).
NOTE: If =text is not given, it will default to the content
of the href-attribute (the URL you are pointing to).
- BPP=n: This option is used to specify
the bit depth that should be used for images. Valid values are 0 (i.e.
no images), 1, 2, 4, and 8.
NOTE: BPP=8 is currently only supported when the parser is used
on a Windows system.
- MAXWIDTH=width: Used to set the maximum
width of images.
- MAXHEIGHT=height: Used to set the maximum
height of images.
An simple example of a description file is:
<HTML>
<HEAD>
<TITLE>Plucker Home Page</TITLE>
</HEAD>
<BODY>
<A HREF="http://plucker.gnu-designs.com" MAXDEPTH=2 STAYONHOST NOIMAGES>Plucker home page</A>
</BODY>
</HTML>
This would download the front page of our web site and also follow any
links on the page if they are local to the host. No images would be
downloaded.
The description file (home.html) that is installed when your
Plucker directory is set up, also contains a few examples.
Next: 4.2 Invoking the parser
Up: 4. The Parser
Previous: 4. The Parser
  Contents
  Index
The Plucker Team