prev pVectorSearch2023-06-06

  • Next action is to suck the data from the public project and index it without going through export

/halsk 1281 pages /yuiseki 2679 pages /tkgshn 5648 pages First from /halsk

  • 127 sec Two left.
  • 335 sec done

647a080aaff09e00006bd34e

The layer of parallelization was not what was originally planned.

  • I thought it was a function that calls the embed API.
  • A class that represents an index with a method to execute it in batch.

It’s done, so we run it and go to lunch.

This model’s maximum context length is 8191 tokens, however you requested 9158 tokens (9158 in your prompt; 0 for the completion). Please reduce your prompt; or completion length.

  • oops

Stop at PDB and observe :

(Pdb) print(list(map(get_size, texts)))
[85, 122, 88, 118, 5, 106, 100, 100, 9158, 9152, 9164, 121, 113, 81, 456, 133, 45, 6, 6, 7, 514, 351, 278, 441, 322, 27, 210, 515, 517, 252, 23, 28, 363, 92, 486, 318, 350, 326, 276, 340, 418, 355, 309, 292, 366, 334, 273, 387, 349, 341]

Ah, I see, that’s one line.

I don’t know how long this trouble will take.

  • 238/238 [11:46<00:00, 2.97s/it]
    • This is the /tkgshndata that ran after the fix
    • The 5748-page Scrapbox is divided into a little over 10,000 chunks and processed in 238 batches of 50 pieces each.
    • It is about 3 seconds per batch, and the total time is like 12 minutes.

Put in Qdrant

  • ResponseHandlingException: The write operation timed out
  • I was able to hit it without wait=True on /nishio, but when I tried to get 3 people in at once, 1.5 people died.
    • WAL overflowed?
    • PS: Conflicts with indexing
  • If you add wait=True, it looks like this
    • 117/117 [02:57<00:00, 1.51s/it]
    • 44/44 [00:54<00:00, 1.23s/it]
    • Well, I’ll be in there in a few minutes, so it’s nothing to worry about.

Now you can cross search.

  • What’s that? Why do I get hits only on yuiseki?
  • Is that possible?
  • Oh, I was doing client.recreate_collection.

redoing

image

  • Plenty of room.

What we were able to verify this time

  • If the other party is a public project in Scrapbox, no special work by the other party is required
  • The time and monetary costs are not significant.

@yuiseki_: If you say that we will combine multiple people’s personal Scrapboxes and make them vector searchable to test their usefulness in cooperation, consensus building, etc., I feel the priority of what and how much to write in my Scrapbox explode!

@nishio: in a little while, anyone can throw an agenda at this [/villagepump/Scrapbox ChatGPT Connector Roundtable Mode](https://scrapbox.io/villagepump/Scrapbox ChatGPT Connector Roundtable Mode)

I looked at it on my iPhone and it looked terrible.

  • image

@yuiseki_: added about 100 pages of important information for now!

  • We need to implement an update function…

This page is auto-translated from /nishio/pVectorSearch2023-06-07 using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. I’m very happy to spread my thought to non-Japanese readers.