What is the shape of the distribution of text embedding vectors in the first place?
Data is from 18129 data retrieved from X. Argument 7574 data extracted by Talk to the City extruction.
- This is embedded in a vector space of 3072 dimensions with `text-embedding-3-large
The embedding vectors returned by OpenAI are normalized so that all vectors have length 1
- So it is distributed on a sphere of 3072 dimensions. py
def p(x):
print(f"{x.mean()} ± {x.std() * 3}")
norm = np.linalg.norm(embeddings_array, axis=1)
p(norm) # -> 0.999999999999999 ± 1.2314317043463034e-15
Values for each axis are clustered in a narrow range of about SD 0.01 py
axis_std = embeddings_array.std(axis=0)
pd.DataFrame(axis_std).describe()
Of the 3072 dimensions, 80 dimensions have 60% of the information.
- 90% of the information is scattered over 500 dimensions.
- Very high dimensional, two dimensional, not even 10% of the information.
The inner product of each data with respect to the other data is about 0.1~0.7 with an average around 0.4
-
- Histogram with order shuffled and inner product with neighboring vectors
- The values on each axis move in a narrow range of about SD 0.01, but the overall range is pretty wide…
- Is this a sensory gap caused by the fact that the belt on the higher dimensional sphere is actually very wide?
- If we keep the Euclidean distance, it looks like this
- Most data are about 0.8~1.4 away from most other data.
- This is a pretty important feature, because the default value of eps for DBSCAN is 0.5, which means that “almost all points are not adjacent”.
-
eps float, default=0.5
-
The maximum distance between two samples for one to be considered as in the neighborhood of the other.
-
HDBSCAN condensed_tree_.plot
- Default HDBSCAN (min_cluster_size=5)
- Smoothed density estimated by HDBSCAN(min_cluster_size=30)
- So, roughly speaking, it’s “one big cluster with a little tiny bump on it.”
- In the “thin density is OK” area smaller than lambda 1.0, all 7000 pieces are connected to one another.
- In the ~1.2 region, small sets that are not considered clusters leave the cluster, leaving only about 1000 clusters.
- That small set is divided into three parts.
- min_cluster_size=5, the number of small sets to be considered as clusters is increased and drawn more finely.
This page is auto-translated from /nishio/テキスト埋め込みベクトルの分布 using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. I’m very happy to spread my thought to non-Japanese readers.