Bitap algorithm is a string search algorithm that uses the parallelism of bitwise operations Also called Baeza-Yates-Gonnet algorithm, Shift-and algorithm and Shift-or algorithm ambiguous retrieval, which is a feature not found in other string search algorithms.
-
âbitapâ (bit-parallel approximate pattern matching)
-
Implemented in asearch.
Start with a simple âunambiguous searchâ explanation Use this kind of linear NFA to determine string acceptance. Pack 1 bit of whether or not you are in each state into an integer so you can use AND and shift operations. In the case of this figure
- Initial condition 100
- Acceptance state mask accept 001
- Mask corresponding to character a 100
- Mask corresponding to character b 010
- The bit stands in the root state where the arrow grows that transitions to the next state in that character.
- State update:
state = (state & mask) >> 1
. - Acceptance decision:
state & accept ! = 0
This pattern matches âabâ in regular expression. *a.*bâ, but you can easily change it to match â.
- Masking transitions with arbitrary characters eps 110
- Bits stand in the root state of the growing transition in any character.
- state update:
state = (state & eps) | (state & mask) >> 1
.- Once the standing state of 1 in the eps mask is 1, it will remain 1 for the rest of the time.
- The implementation of asearch uses 0x20 as the character meaning this transition
ambiguous retrieval
- Express a pattern that tolerates a single character error as a separate integer
- I was going to explain each of these transitions (1), (2), and (3), but there seems to be a discrepancy between the diagram in the explanation and the implementation.
- Regarding the transition in (3)
- The figure in [/masui/ambiguous search asearch](https://scrapbox.io/masui/ambiguous search asearch) is b
- However, asearchâs implementation does
i1 |= (i0 >> 1)
after updating i0, so the transition is at a. - The implementation of â*â transitions and asearch in this figure appear to be âtransitions with any one characterâ rather than âepsilon transitions that do not consume a characterâ.
- (Although the variable name epsilon is used in the asearch implementation.)
- hence
- If implemented as shown in the NFA diagram, âabâ accepts âbâ and does not accept âa
- The current implementation of asearch accepts âaâ and does not accept âb
- If the transition is an epsilon transition that does not consume a character, both âaâ and âbâ are accepted even without the Katsurabayashi transition (3).
- Not yet examined whether it would be easy to implement this in bitwise operations.
https://github.com/masui/asearch-ruby ruby
chars.each { |c|
mask = @shiftpat[c]
i3 = (i3 & @epsilon) | ((i3 & mask) >> 1) | (i2 >> 1) | i2
i2 = (i2 & @epsilon) | ((i2 & mask) >> 1) | (i1 >> 1) | i1
i1 = (i1 & @epsilon) | ((i1 & mask) >> 1) | (i0 >> 1) | i0
i0 = (i0 & @epsilon) | ((i0 & mask) >> 1)
i1 |= (i0 >> 1)
i2 |= (i1 >> 1)
i3 |= (i2 >> 1)
}
- Look at the definition of Levenshtein Distance and think about what transitions there should be.
-
- Inserting a single character is only required if a transition to the top occurs when any single character arrives, regardless of the character, so
i1 |= i0
- The one-character replacement is good if the transition to the upper right occurs when any one character arrives, regardless of the character, so
i1 |= (i0 >> 1)
- It is troublesome that one character deletion is an epsilon transition that does not consume a character.
- Transition to âwhen the previous character arrivesâ while you are at it.
- So after the character-consuming transition
i1 |= (i0 >> 1)
- Think of it this way, there is no problem with the transition part of asearchâs implementation
- Hereâs the problem. ruby
- Inserting a single character is only required if a transition to the top occurs when any single character arrives, regardless of the character, so
def initstate
[INITPAT, 0, 0, 0]
end
- The epsilon transition before the first character comes in, i1 to i3 should also have bits standing, but they are set to 0.
- As for the diagram, if you draw it using the epsilon transition, it means that the diagonal transition includes epsilon, but if you draw it without using it, that diagonal transition is a two way transition by the mechanism of [[Bitap algorithm#5ca57e73aff09e0000f994fd|5ca57e73aff09e0000f994fd]], so both a and b for the cinnabar jumping line Therefore, it seems correct to draw both a and b on the cinnabar jumping line.
Other Topics
- The mask (shiftpat) for each character indicates âwhich state is the transition source when the character arrivesâ, so for example, if you set up bits in both the mask for âaâ and the mask for âAâ, case-insensitive matching is possible.
- Not just case-insensitive, but any type of transition with multiple types of single characters. For example \d.
- On the other hand, âcharacter-by-character maskâ is required for each character, so it is necessary to devise a way to match a Japanese string.
- The asearch implementation splits a character into multiple bytes and matches byte by byte.
- cons: A difference of one character in Japanese is the same as a difference of two characters in fuzzy search
- If one of the two bytes happens to match, itâs a one-letter difference.
- cons: A difference of one character in Japanese is the same as a difference of two characters in fuzzy search
- In my implementation, I only take the lower 1 byte of information and discard the rest. :
- The asearch implementation splits a character into multiple bytes and matches byte by byte.
>>> ensure_bytes("Hello")
b'S\x93kao'
- We consider the probability that the lower 1 byte of a Japanese string of some length (from 5 characters) happens to be a string that makes sense in an ASCII string to be negligibly small.
- cons: If you use 0x20 for a special meaning character like in the asearch implementation, you will get unexpected behavior when a Japanese character happens to be that character.
- velocity
- I did an ambiguous search with a query of length 20 on a set of 25743 page titles obtained by crawling Scrapbox and found min 29, max 47, med 33 msec.
This page is auto-translated from /nishio/Bitapăąă«ăŽăȘășă using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. Iâm very happy to spread my thought to non-Japanese readers.