Grokking Graph Algorithms for Coding Interviews
Ask Author
Back to course home

0% completed

Vote For New Content
Solution: Minimum Genetic Mutation
Table of Contents

Contents are not accessible

Contents are not accessible

Contents are not accessible

Contents are not accessible

Contents are not accessible

Problem Statement

You have a gene sequence represented by an 8-character string, containing only 'A', 'C', 'G', or 'T' characters. To change from one gene sequence to another, you can mutate one character at a time.

For example, "AAAAACCC" --> "AAAACCCC" is one mutation.

Each intermediate sequence must be a valid gene in the provided list (bank).

The gene bank bank records all the valid gene mutations. A gene must be in bank to make it a valid gene string.

Given the two gene strings startGene and endGene and the gene bank, return the minimum number of mutations needed to mutate from startGene to endGene. If there is no such a mutation, return -1.

Note: The starting point is assumed to be valid, so it might not be included in the bank.

Examples

Example 1:

  • Input: startGene: "AACCGGTT", endGene: "AAACGGTA", bank: ["AACCGGTA", "AACCGCTA", "AAACGGTA"]
  • Expected Output: 2
  • Justification:
    • Mutations: "AACCGGTT" → "AACCGGTA" → "AAACGGTA"

Example 2:

  • Input: startGene: "AAAAACCC", endGene: "AACCCCCC", bank: ["AAAACCCC", "AAACCCCC", "AACCCCCC"]
  • Expected Output: 3
  • Justification:
    • Mutations: "AAAAACCC" → "AAAACCCC" → "AAACCCCC" → "AACCCCCC"

Example 3:

  • Input: startGene: "AACCGGTT", endGene: "CCGGTTAA", bank: ["AACCGGTA", "AACCGCTA", "AAACGGTA", "CCGGTTAA"]
  • Expected Output: -1
  • Justification: We need to mutate a total of 8 characters, but the 'bank' array contains only 4 strings. So, it is not possible to perform 8 mutations.

Constraints:

  • 0 <= bank.length <= 10
  • startGene.length == endGene.length == bank[i].length == 8
  • startGene, endGene, and bank[i] consist of only the characters ['A', 'C', 'G', 'T'].

Solution

To solve this problem, we can use a breadth-first search (BFS) approach. BFS is ideal for finding the shortest path in an unweighted graph, which in this case, is represented by gene sequences as nodes and mutations as edges. Starting from startGene, we will explore all possible single mutations and use a queue to keep track of the current gene sequence and the number of mutations. We will use a set to store the bank for quick lookups and another set to track visited sequences to avoid cycles.

This approach ensures we explore all possible paths efficiently and find the shortest mutation sequence if one exists.

Step-by-Step Algorithm

  1. Initialization:

    • Convert the bank array into a set for O(1) look-up.
    • Create a queue for BFS and initialize it with a tuple containing the startGene and a mutation count of 0.
    • Create a set to keep track of visited genes to avoid revisiting them.
  2. BFS Traversal:

    • While the queue is not empty:
      • Dequeue the front element, which includes the current gene sequence and the mutation count.
      • If the current gene sequence matches the endGene, return the mutation count as the result.
      • For each position in the gene sequence:
        • Save the original character at this position.
        • Replace this character with each of 'A', 'C', 'G', and 'T':
          • Form a new gene sequence by replacing the character.
          • If the new gene sequence is in the bank set and has not been visited:
            • Mark this new gene sequence as visited.
            • Enqueue this new gene sequence along with an incremented mutation count.
        • Restore the original character at this position.
  3. Return Statement:

    • If the queue is exhausted and no mutation path is found, return -1.

Algorithm Walkthrough

Using the following input:

  • startGene: "AACCGGTT"
  • endGene: "AAACGGTA"
  • bank: ["AACCGGTA", "AACCGCTA", "AAACGGTA"]
  1. Initialization:

    • bankSet: {"AACCGGTA", "AACCGCTA", "AAACGGTA"}
    • queue: [("AACCGGTT", 0)]
    • visited: {}
  2. First Iteration:

    • Queue before processing: [("AACCGGTT", 0)]
    • Dequeue ("AACCGGTT", 0)
      • Current gene: "AACCGGTT"
      • Mutation count: 0
      • For each character in "AACCGGTT":
        • Position 0:
          • Try 'A', 'C', 'G', 'T' (no change for 'A')
        • Position 1:
          • Try 'A', 'C', 'G', 'T' (no change for 'A')
        • Position 2:
          • Try 'A', 'C', 'G', 'T' (change to "AAGCGGTT", "AAACGGTT", "AATCGGTT")
          • "AAGCGGTT", "AAACGGTT", "AATCGGTT" is not in the bankSet
        • Position 3:
          • Try 'A', 'C', 'G', 'T' (change to "AACCAGTT", "AACCCGTT", "AACCTGTT")
          • "AACCAGTT", "AACCCGTT", "AACCTGTT" is not in the bank set.
        • Position 4:
          • Try 'A', 'C', 'G', 'T' (no change for 'G')
        • Position 5:
          • Try 'A', 'C', 'G', 'T' (no change for 'G')
        • Position 6:
          • Try 'A', 'C', 'G', 'T' (no change for 'T')
        • Position 7:
          • Try 'A', 'C', 'G', 'T'
          • "AACCGGTA" is in the bankSet and not visited
          • Enqueue ("AACCGGTA", 1)
          • Mark "AACCGGTA" as visited
    • Queue after processing: [("AACCGGTA", 1)]
    • Visited: {"AACCGGTA"}
  3. Second Iteration:

    • Queue before processing: [("AACCGGTA", 1)]
    • Dequeue ("AACCGGTA", 1)
      • Current gene: "AACCGGTA"
      • Mutation count: 1
      • For each character in "AACCGGTA":
        • Position 0:
          • Try 'A', 'C', 'G', 'T' (no change for 'A')
        • Position 1:
          • Try 'A', 'C', 'G', 'T' (no change for 'A')
        • Position 2:
          • Try 'A', 'C', 'G', 'T' (change to "AAGCGGTA", "AAACGGTA", "AATCGGTA")
          • "AAACGGTA" is in the bankSet and not visited
          • Enqueue ("AAACGGTA", 2)
          • Mark "AAACGGTA" as visited
        • Position 3:
          • Try 'A', 'C', 'G', 'T' (no change for 'C')
        • Position 4:
          • Try 'A', 'C', 'G', 'T' (no change for 'G')
        • Position 5:
          • Try 'A', 'C', 'G', 'T' (no change for 'G')
        • Position 6:
          • Try 'A', 'C', 'G', 'T' (no change for 'T')
        • Position 7:
          • Try 'A', 'C', 'G', 'T' (no change for 'A')
    • Queue after processing: [("AAACGGTA", 2)]
    • Visited: {"AACCGGTA", "AAACGGTA"}
  4. Third Iteration:

    • Queue before processing: [("AAACGGTA", 2)]
    • Dequeue ("AAACGGTA", 2)
      • Current gene: "AAACGGTA"
      • Mutation count: 2
      • Check if the current gene matches the endGene:
        • It matches "AAACGGTA"
        • Return 2 as the result

The minimum number of mutations needed to change from "AACCGGTT" to "AAACGGTA" using the given bank is 2.

Code

Python3
Python3

. . . .

Complexity Analysis

Time Complexity

  • Initialization of the bank set and visited set: O(N), where N is the number of genes in the bank.
  • BFS Traversal:
    • Each gene string has a fixed length of 8 characters.
    • For each gene string, we try changing each of the 8 characters to 'A', 'C', 'G', or 'T' (4 possibilities).
    • Checking if a mutated gene is in the bank set takes O(1) time.
    • Therefore, each gene has O(8 * 4) = O(32) possible mutations, which simplifies to O(1) because it's a constant.
    • In the worst case, we might visit all N genes.
    • Hence, the total time complexity is O(N).

Space Complexity

  • Storage for the bank set and visited set: O(N)
  • Queue for BFS traversal: O(N)
  • Storage for the gene strings in the queue: O(N * 8), which simplifies to O(N) because the length of each gene is constant (8).
  • Therefore, the total space complexity is O(N).

.....

.....

.....

Like the course? Get enrolled and start learning!

Table of Contents

Contents are not accessible

Contents are not accessible

Contents are not accessible

Contents are not accessible

Contents are not accessible