The sample code at OpenAI API says this python
def get_embedding(text, model="text-embedding-ada-002"):
text = text.replace("\n", " ")
return openai.Embedding.create(input = [text], model=model)['data'][0]['embedding']
Why is replacement necessary when whitespace and line breaks themselves are different tokens? Is it really necessary? I was skeptical, but I found a statement that it was because the replacement would improve the result.
-
Replace newlines with a single space
-
Unless you’re embedding code, we suggest replacing newlines (\n) in your input with a single space, as we have observed inferior results when newlines are present.
No follow-up exam.
- I doubt if it’s the same in Japanese.
- but I’m inclined to follow it.
This page is auto-translated from /nishio/Embedding前に改行をスペースにするのはなぜ? using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. I’m very happy to spread my thought to non-Japanese readers.