0% completed
Overview
A web crawler is a tool designed to browse the internet methodically and systematically to gather information from web pages. When deploying a multithreaded web crawler, one can expect significant performance improvements due to the concurrent fetching of multiple web pages. However, concurrency introduces challenges, especially when multiple threads try to access shared resources like a queue containing URLs. The primary concern is avoiding duplicate work, where two or more threads might end up processing the same URL because of simultaneous access.
Step-by-step Algorithm
- Prepare a list of web pages to be visited, named "To-Visit List".
- Maintain a record of web pages that have been visited, named "Visited Record".
- Each crawler agent follows this routine:
- Check if there are still web pages in the "To-Visit List".
- If the list is empty, stop.
- Secure exclusive access to the "To-Visit List".
- Take a web page from the "To-Visit List".
- Ensure the web page is not in the "Visited Record". If it is, go back to the first step of this routine.
- Mark this web page as visited by adding it to the "Visited Record".
- Release exclusive access to the "To-Visit List".
- Access and process the web page.
- Identify new web pages linked from this page.
- Secure exclusive access to the "To-Visit List".
- Add these new web pages to the "To-Visit List", ensuring no duplicates.
- Release exclusive access to the "To-Visit List".
- Start the routine for all crawler agents simultaneously.
- Continue until all agents stop.
Algorithm Walkthrough
Let's walk through a dry run of the web crawler example you provided, paying special attention to the synchronization aspects.
Initialization
- toVisitList: Initialized with a seed URL "http://example.com".
- visitedRecord: Initialized as an empty set.
- queueLock: Mutex initialized to protect access to
toVisitList
. - visitedLock: Mutex initialized to protect access to
visitedRecord
. - Crawler Agents: 5 threads are created to act as crawler agents.
Execution
The execution for each crawler agent can be divided into several steps:
-
Get URL from Queue
- Acquire
queueLock
to get exclusive access totoVisitList
. - If
toVisitList
is not empty, pop a URL from the queue and releasequeueLock
. - If
toVisitList
is empty, releasequeueLock
and exit the loop (terminate the thread).
- Acquire
-
Check if URL is Visited
- Acquire
visitedLock
to checkvisitedRecord
. - If URL is in
visitedRecord
, releasevisitedLock
and go back to step 1. - If URL is not in
visitedRecord
, insert it and releasevisitedLock
.
- Acquire
-
Process Page: Print "Processing: URL" and sleep for 1 second to simulate processing time.
-
Add New Links to Queue
- Simulate fetching links from the page using
getLinksFromPage
. - Acquire
queueLock
. - For each link fetched:
- Check if it is not in
visitedRecord
. - If it’s not, add it to
toVisitList
.
- Check if it is not in
- Release
queueLock
.
- Simulate fetching links from the page using
-
Repeat: Go back to step 1.
Synchronization Aspects
-
queueLock
- Ensures that only one thread can access
toVisitList
at a time. - Prevents race conditions where two threads might try to pop the same URL.
- Protects integrity of the data in the queue.
- Ensures that only one thread can access
-
visitedLock
- Ensures that only one thread can access
visitedRecord
at a time. - Prevents race conditions where two threads might process the same URL simultaneously.
- Protects integrity of the data in the set.
- Ensures that only one thread can access
Final Result
Through the use of mutex locks (queueLock
and visitedLock
), the program ensures that the shared resources (toVisitList
and visitedRecord
) are accessed in a thread-safe manner, preventing race conditions and preserving data integrity. The program simulates a web crawler where multiple agents are concurrently processing URLs, fetching new links, and ensuring that each URL is only processed once.
.....
.....
.....
Table of Contents
Contents are not accessible
Contents are not accessible
Contents are not accessible
Contents are not accessible
Contents are not accessible