Grokking Multithreading and Concurrency for Coding Interviews
Ask Author
Back to course home

0% completed

Vote For New Content
Problem 16: Scenario: Multithreaded Web Crawler
Table of Contents

Contents are not accessible

Contents are not accessible

Contents are not accessible

Contents are not accessible

Contents are not accessible

Overview

A web crawler is a tool designed to browse the internet methodically and systematically to gather information from web pages. When deploying a multithreaded web crawler, one can expect significant performance improvements due to the concurrent fetching of multiple web pages. However, concurrency introduces challenges, especially when multiple threads try to access shared resources like a queue containing URLs. The primary concern is avoiding duplicate work, where two or more threads might end up processing the same URL because of simultaneous access.

Step-by-step Algorithm

  • Prepare a list of web pages to be visited, named "To-Visit List".
  • Maintain a record of web pages that have been visited, named "Visited Record".
  • Each crawler agent follows this routine:
    • Check if there are still web pages in the "To-Visit List".
    • If the list is empty, stop.
    • Secure exclusive access to the "To-Visit List".
    • Take a web page from the "To-Visit List".
    • Ensure the web page is not in the "Visited Record". If it is, go back to the first step of this routine.
    • Mark this web page as visited by adding it to the "Visited Record".
    • Release exclusive access to the "To-Visit List".
    • Access and process the web page.
    • Identify new web pages linked from this page.
    • Secure exclusive access to the "To-Visit List".
    • Add these new web pages to the "To-Visit List", ensuring no duplicates.
  • Release exclusive access to the "To-Visit List".
  • Start the routine for all crawler agents simultaneously.
  • Continue until all agents stop.

Algorithm Walkthrough

Let's walk through a dry run of the web crawler example you provided, paying special attention to the synchronization aspects.

Initialization

  • toVisitList: Initialized with a seed URL "http://example.com".
  • visitedRecord: Initialized as an empty set.
  • queueLock: Mutex initialized to protect access to toVisitList.
  • visitedLock: Mutex initialized to protect access to visitedRecord.
  • Crawler Agents: 5 threads are created to act as crawler agents.

Execution

The execution for each crawler agent can be divided into several steps:

  1. Get URL from Queue

    • Acquire queueLock to get exclusive access to toVisitList.
    • If toVisitList is not empty, pop a URL from the queue and release queueLock.
    • If toVisitList is empty, release queueLock and exit the loop (terminate the thread).
  2. Check if URL is Visited

    • Acquire visitedLock to check visitedRecord.
    • If URL is in visitedRecord, release visitedLock and go back to step 1.
    • If URL is not in visitedRecord, insert it and release visitedLock.
  3. Process Page: Print "Processing: URL" and sleep for 1 second to simulate processing time.

  4. Add New Links to Queue

    • Simulate fetching links from the page using getLinksFromPage.
    • Acquire queueLock.
    • For each link fetched:
      • Check if it is not in visitedRecord.
      • If it’s not, add it to toVisitList.
    • Release queueLock.
  5. Repeat: Go back to step 1.

Synchronization Aspects

  • queueLock

    • Ensures that only one thread can access toVisitList at a time.
    • Prevents race conditions where two threads might try to pop the same URL.
    • Protects integrity of the data in the queue.
  • visitedLock

    • Ensures that only one thread can access visitedRecord at a time.
    • Prevents race conditions where two threads might process the same URL simultaneously.
    • Protects integrity of the data in the set.

Final Result

Through the use of mutex locks (queueLock and visitedLock), the program ensures that the shared resources (toVisitList and visitedRecord) are accessed in a thread-safe manner, preventing race conditions and preserving data integrity. The program simulates a web crawler where multiple agents are concurrently processing URLs, fetching new links, and ensuring that each URL is only processed once.

Flow Chart
Flow Chart
Python3
Python3

. . . .

.....

.....

.....

Like the course? Get enrolled and start learning!

Table of Contents

Contents are not accessible

Contents are not accessible

Contents are not accessible

Contents are not accessible

Contents are not accessible