-
Algorithm that performs the equivalent of a tree search without constructing a tree
- Must be able to “determine if they are ancestors” and “obtain a recent common ancestor.”
- For suffix arrays
- Consider the position-length pair to be the top of the tree.
- Length comparisons are substituted to determine if they are ancestors or not,
- To obtain the most recent common ancestor, we can use the values previously calculated in the construction of the LCP array, which can be done in O(1).
-
I can do a post-order tour of the tree nodes.
- Because it cannot be traced back to the child,
- Information on “which document it appeared in” needs to be communicated to the parent during the patrol process.
- If the number of occurrences per document ID is calculated, it would be good to determine the keyword-like quality using the concentration of occurrences, and to cut off words with too high frequency using DF.
-
Now the word is an object, and it has a document ID and its own location.
- but
- This is then grouped together into multiple documents, converted into a word ID, and the document ID can be pulled by its position in the grouped column.
- Then we can SAIS that long string together and do a frequency count.
- I’d like to put the frequency counts aside and implement the same functionality as we have now and compare speeds.
- but
-
I wonder if it needs a twist that the existing algorithm is “statistics within a given set of documents” while this one is “statistics between new documents and the existing set of documents”.
- Maybe it would be better to convert it to a suffix tree?
- It seems to me that a top-down search would have a smaller search area?
- Can I mix and match and build SA and look at the nodes before and after?
This page is auto-translated from /nishio/linkSuggest0907 using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. I’m very happy to spread my thought to non-Japanese readers.