The sample code at OpenAI API says this python

def get_embedding(text, model="text-embedding-ada-002"):
   text = text.replace("\n", " ")
   return openai.Embedding.create(input = [text], model=model)['data'][0]['embedding']

Why is replacement necessary when whitespace and line breaks themselves are different tokens? Is it really necessary? I was skeptical, but I found a statement that it was because the replacement would improve the result.

No follow-up exam.

  • I doubt if it’s the same in Japanese.
  • but I’m inclined to follow it.

OpenAI API


This page is auto-translated from /nishio/Embedding前に改行をスペースにするのはなぜ? using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. I’m very happy to spread my thought to non-Japanese readers.