OpenAI DevDay I saw this graph on social media and was curious about the details, so I’m glad it’s available on video!
45%→65%
- Baseline was 45%.
- Tried HyDE first, which did not work for this use case
- We also tried fine tuning the embedding
- This worked well from an accuracy standpoint, but was too expensive and slow to adopt.
- Chunk size and delimitation method
- That improved it by 20% to 65%.
- Not yet at a level where we can give it to our clients.
- I’ve done 20 iterations so far.
65%→85%
- cross-encoder or using a rule-based approach reranking.
- Example of a rule: Use the latest
- Significant performance improvement
- Classification.
- Classified domains and assigned different metadata accordingly
- It’s not explained specifically, but in a Cybozu-like context, for example, it would be something like “this is a schedule, let’s give the participants’ information” or “this is a space conversation, let’s give the thread title and the name of the space”.
85%→98%
- Prompt engineering again
- Observe again what questions are failing
- For example, for questions that require explicit numbers, we stopped extracting them from the documentation and provided a tool to issue SQL.
- query expansion
- The work was different from what I imagined by the name (I thought it was to give data that is easy to hit on the side of the search target).
- Split user input into multiple queries, search them in parallel, and return a composite
- I think this is pretty use case dependent.
- I didn’t do fine tuning anywhere, I wanted to emphasize this.
https://www.youtube.com/watch?v=ahnGLM-RC1Y
This page is auto-translated from [/nishio/A Survey of Techniques for Maximizing LLM Performance](https://scrapbox.io/nishio/A Survey of Techniques for Maximizing LLM Performance) using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. I’m very happy to spread my thought to non-Japanese readers.