Assimilator Web Crawler


(c)2002 Broken Toaster Software


Written By Nick Lott






NOTE: These Documents are incomplete and may not reflect the current version of the application as it is still under development.



Contents:

  1. Introduction
  2. Quick Analysis Tool
  3. Starting a new search
  4. Managing a search
  5. Downloading Files
  6. Exporting the file list
  7. Ideas for future versions
  8. Changes





  1. Introduction
  2. Assimilator is a handy tool for automatic search and download of files from the web. The approach is like to drift net fishing for files. This idea is based on a sort of six degrees of separation theory, i.e. If you know site that contains information related to what your looking for, chances are it will have a link to another related site that my have more information or files that you are searching for. This documentation is still under construction!

    Future areas of development

    Bugs and Issues

    Current Status

    Completely re-written with support for Mac. The crawler is now HTTP/1.1 compliant and supports transparent redirections and resuming of downloads. HTML parsing is greatly improved but support for java script parsing for links has yet to be added. New version will search images, hyperlinks and frames for links and can search any sort of file as specified by the user. Recently added features include using Google search to generate initial url, HTML formatting of exported file lists,trasparent Redirection and resume capability (depends on server).

    How it works

    In order to get the best out of AWC it may help to understand the way in which it works. This section is targeted at giving you an idea of the theories behind AWC and insight into gaining better results from the program. AWC will extract all the links from a web page and then parse them for files. Any files matching the type and name specifications will then be queued for downloading. Any links that aren’t files will be queued for parsing and then the process repeats with the next page in the queue.

  3. Quick Analysis Tool
  4. The quick analysis tool is used to gain a quick insight into the links contained (or as detected by AWC) on a certain page. To check a URL simply type it into the text box and hit return or click the "Get" Button. You can abort a HTTP request at any stage by clicking the "Stop" button. Once the status text regesters the page having been successfully recieved, you can view any aspect of the page by the appropriate button:


    These tools enable you to quickly check a URL to see if it is suitable for a search, and may help you diagnose problems with unexpected results from some pages.

  5. Starting a new search
  6. A new search is started by selecting the appropriate option from the file menu. This displays a dilog box for defining the search criterea. The first entry is the initial URL, this is the location of the web page on which the search will start. This entry is mandatory. To the right of this is the check box labeled "Google", when this is enabled the first entry becomes the keywords for a search on www.google.com which will form the initial URL for the search.

    In each of the fields for the window multiple keywords or types are separated by a single space. If a keyword is preceeded by a minus sign that keyword must not be found in the search field eg. "findThis -notThis". To the right of some of the feilds is an option labeled "All", when this is enabled, all keywords in the corresponding field must be found (or not found in the case of a negative keyword) in order for a match to be made.


    Once the criterea have been appropriatly specified, click go to begin the search.

  7. Managing a search
  8. once a search has begun the search window will appear this window is used to display the progress of the current search, as well as manipulating the pages searched and order in hich it is done. At the top of the window are the main operation buttons Pause/Continue, Skip, Refine, Exit and More/Less. In the bottom half of the screen are the various information panels, Search Queue, History, File Que and Info. Easch of these panel's displays different information concerning the search currently underway.

    searchs need to be "managed".... why

    Skip

    the skip button is usefull should the search fall into a infinate redirection loop, (which can occur when a page tries to redirect the client to a page which is of a different protocol, this problem is being investigated and will hopefully be solved soon) or when the page currently being searched is known not to contain relevant hyperlinks. On pressing the skip button, any transfer of information is abandoned and the file removed from the search queue.

    Pause/Continue

    The pause button disables the automatic downloading and searching of poages from the search que. When the pause button is pressed, it's caption is changed to continue to reflect it's new operation. Once the search has been paused, the current page is downloaded and searched, but then further downloading and searching of pages does not. This can be useful to check the contents of the search que which can quickly grow out of hand and full of irrelavent URL's. When the search is paused the you can easily browse the search queue and remove any URL's which are deemed irrelevent without the search queue growing as you do so.

    Refine

    the purpose of the rfine button is to allow the user to change the search criterea mid search. The reasons a user may wish to do this are two fold, either the current search criterea are too narrow and too few files and/or pages enter the queue's or the search criterea are too wide and the queues quickly become full of inappropriate or irrelevent files.

    Search Queue Panel

    as a search progresses pages and files found that meet the appropriate criterea will be added to their respective queues. You may find as this happens that some url's appear in the list that you do not wish to search such as ads or unrelated sites, or files that you do not wish to download.

    When this happens you can remove pages by hand ueing the delete button in the search panel or refine the initial search criterea, currently refine will only start a knew search but soon it is hoped that it will be possible to refine searches in progress and reapply criterea to exiting queues.

    If a url in the search queue is strongly suspected to contain relevent links then it can be promoted to the top of the queue to be searched next. the url at the top of the search queue is requested and parsed for links, if any of those links meet the page criterea they are added to the bottom of the search queue, if any meet the file criterea

    If you wish to add a relavent page to the search queue you can do so by typing it in the small textbox provided and clicking the add button.

    History Panel

    Info Panel

    this panel contains all the search rules and criterea applied to the current search in progress.

  9. Downloading Files
  10. Exporting the file list
  11. Ideas for future versions
  12. The following are ideas for future development of AWC once the current version is up to scratch. After the basic functionality is complete and working bug free, then work will begin on expanding awc to make it more powerful.