[Front page] [HTTPget] [AdBlock] [PHP Fortune] [GPS] [Mandelbrot] [XSL tutorial] [C tutorial] [TiddlyWiki] [Contact]

HTTPget

What is HTTPget?
How to use HTTPget
Command-line-switches
Examples of using HTTPget
Disclaimer

What is HTTPget?

HTTPget is a program capable of retrieving large amounts of easy-locatable files from the WWW. HTTPget uses threads to download several files simultaneously. The program is written in Java and contains no graphical user interface, and is thus executable on a large number of platforms. You can download HTTPget here.

How to use HTTPget

java -jar HTTPget.jar <main-pattern> [-tos]
Main-pattern is the pattern used to describe which files are going to be downloaded. If no switches are applied, the default behaviour is to download the files at all the URLs that are in the pattern.
A pattern is a text-string, usually an URL, which will in some places contain brackets [ ] that defines where to contents will be variable. There are two types of brackets:
[enum1, enum2, ... enumN] changes between the text-bids supplied.

[a..b] will create a list of integers running from a to b.
Examples:

Weekday:[Monday, Wednesday, Friday]
Will create 3 texts:
Weekday:Monday
Weekday:Wednesday
Weekday:Friday
Writing
Numbers:[1..5]
will create 5 texts:
Numbers:1
Numbers:2
Numbers:3
Numbers:4
Numbers:5
It's possible to combine several brackets in one line.
[1..12][Monday, Wednesday, Friday]
This will create 36 lines, with all possible combinations of the numbers from 1 through 12 and the three weekdays.
A number can have leading zeros.
[001..245]
will result in a list, where all numbers have 3 digits. Numbers from 1 to 99 will have leading zeros.
 

Switches

Several switches can be put on the command-line of the program. The default behaviour of the program is to download the files specified by the main-pattern, and save them on disk with the file-names they had on the website.
 
-t <N> Used to tell the program how many files to download simultaneously. Default value is one. If you have a high-speed connection, and the target site is also broadband, it's recommendable to set N high.
-o <pattern> 
Changes the name the file will be saved with on the disk. Sometimes it may be convenient to rename files. The output-pattern must have at least as many possible combinations as the main pattern.
-s <pattern>
This switch tells the program to search the URLs in the main-pattern for files that match this pattern. 
It's possible to use wildcards in this pattern.
* matches any number of characters.
_ matches exactly one character.

Examples of using HTTPget

In the following we'll be using an imaginary website called www.greatcomix.com as a target.
GC has their archive at the adress http://www.greatcomix.com/archive/ . All their strips are in files named after the pattern gc[year][month][day].gif.

To download all of these, simply type:
    java -jar HTTPget.jar http://www.greatcomix.com/archive/gc[97..01][01..12][01..31].gif

Next  we would like to rename all the downloaded files. To rename them so that gc is written in capitals, add the switch:

   -o GC[97..01][01..12][01..31].gif
Another comic is hosted at Greatcomix. It is not published on a daily basis, so all the files are simply called bar1.gif, bar2.gif, bar3.gif ... bar542.gif. The annoying thing about downloading these is that when you view them in a sorted list, 10 comes immediately after 1, 100 comes after 10 and so on. To avoid this problem we add zeros in front of the files when they're downloaded.

    java -jar HTTPget.jar http://www.greatcomix.com/bar/bar[1..542].gif -o bar[001..542].gif

After a while Greatcomix decides that too many people are downloading their comics without looking at their banner ads. They therefore give all their files names that have no relation to their publishing date, thus making it harder/impossible to define a pattern for them and download them in the correct order. Usually these files will still be linked from HTML-files, whose addresses are easily locatable by date. We therefore ask HTTPget to search through the HTML-files for the picture files.
 

java -jar HTTPget.jar http://www.greatcomix.com/bar/[97..01][01..12][01..31]a.html
-s bar*.gif -o bar[97..01][01..12][01..31].gif
Performance can be dramatically increased in all of the above examples, by adding -t N, to specify that the program should download N files simultaneously.

Disclaimer

I do not take any responsibility for other people's use of this program. Please note that some web-site may not like it if you download their content using this program. Any problems (legal or otherwise) which may arise from the use of this program, should in no way be extended to the programmer.
Please remember that a lot of websites base their earnings on you watching and clicking on banner ads. When you use HTTPget you may cause a loss of income to the website.

Comments, suggestions and bug-reports to Henrik Aasted Sorensen.