RAG Meetup at Pinecone HQEvaluating RAG Applications Workshop with Weights and BiasesRegister
Preview Mode ()

The world of machine learning is powered by vectors. Not just any vectors, but dense vectors capable of representing human meaning in numeric form. These meaningful vectors are quickly replacing more traditional forms of data as the digital world becomes more ML-powered and more people-centric.

We expect intuitive natural language search, intelligent recommendations, and much more by default. To achieve this, we need dense vectors, but we also need a database for these vectors.

That is where vector databases like Pinecone come in. The vector database enables scalable, super fast, and accurate retrieval of dense vectors. Already with Pinecone, we see customers searching through billions of vectors and returning results with sub-second latency.

Today, vector search just became up to 10x faster, easier to set up, and vertically scalable. In this article, we will show you how you can get started with the latest features in Pinecone, covering:

  • Vertical scaling for p1 and s1 pods.
  • Phase one of collections, enabling static snapshots of indexes.*
  • The latest graph-based p2 pod with up to 10x faster query times.

Although we won’t talk about it here, there are also three more upgrades to note:

  • p1 and s1 pods now have ~50% lower latency and ~50% more throughput per replica.
  • s1 pods are now available on the free Standard plan, meaning you get 5x greater capacity.
  • Updated pricing as of September 1st for new customers.

Without any further ado, let’s explore the latest features.

*Future updates to collections will allow import/export between S3 and GCS blob storage, write streaming, and bulk data upload directly to collections.


Vertical Scaling on p1 and s1

Pods are the hardware components that all of our vectors are stored in. Naturally, they have limits. A p1 pod is expected to hold ~1M 768-dimensional vectors.


The free Standard tier comes with access to one p1 pod, and as of today, your free capacity can now be increased 5x using the newly included s1 pod.


In the past, we had to know ahead of time how many pods we needed for our vector database. Unfortunately, this isn’t always realistic. Our data needs can change over time, and we often find ourselves outgrowing our initial pod confines or overprovisioning and wasting resources.

As a result we would need to create a new index from scratch, which isn’t fun - especially when you have millions of vectors.

Fortunately, we now have vertical scaling. Every index using p1 or s1 pods can be scaled in multiples of two up to eight-times their original size with zero downtime. Let’s see how.

In[1]:
import pinecone

pinecone.init(
    api_key="YOUR_API_KEY",
    environment="YOUR_ENV"  # find next to API key in console
)

index_name = "oscar-minilm"

index = pinecone.Index(index_name)
pinecone.describe_index(index_name)
Out[1]:
{'dimension': 384,
 'index_fullness': 1.0,
 'namespaces': {'': {'vector_count': 1345350}},
 'total_vector_count': 1345350}

Starting with a very full index (see 'index_fullness'), we need to increase the index size to add more vectors and maintain reasonable latency. We use the new pinecone.index_config method to do this.

In[2]:
# we can scale up to x8, in multiples of 2, eg:
# p1.x2 | s1.x2
# p1.x4 | s1.x4
# p1.x8 | s1.x8
pinecone.configure_index(index_name, pod_type="p1.x2")
In[3]:
pinecone.describe_index(index_name)
Out[3]:
IndexDescription(name='oscar-minilm', metric='cosine', replicas=1, dimension=384.0, shards=1, pods=2, pod_type='p1.x2', status={'ready': True, 'state': 'ScalingUpPodSize'}, metadata_config=None, source_collection='')

By default when creating an index with pod_type as p1 or s1, we are actually creating a p1.x1 or s1.x1 pod. From either of those, we can scale up to eight times. In this case, we scaled by x2, doubling our capacity.

Collections

Another major feature of this release is collections. In the past, after we created an index, we could only reuse those vectors by keeping a local copy or iteratively retrieving them all. Neither option is ideal. Collections are the solution to this. These are essentially static indexes that we can think of as the “source of truth” for our vector data.

We can create a collection using an existing index, like the oscar-minilm index we just scaled.

In[4]:
pinecone.list_collections()  # we have no collections right now
Out[4]:
[]
In[5]:
collection_name = "oscar-minilm-collection"

# we can create a collection from an existing index
pinecone.create_collection(collection_name, index_name)
In[6]:
pinecone.list_collections()  # now we can see the collection
Out[6]:
['oscar-minilm-collection']
In[7]:
collection_info = pinecone.describe_collection(collection_name)
# as with `describe_index`, describing the collection returns a description object
collection_info
Out[7]:
<pinecone.manage.CollectionDescription at 0x10cbad430>
In[8]:
print(collection_info)  # we can view it like so, or access the attributes with `.name` etc
Out[8]:
{'name': 'oscar-minilm-collection', 'status': 'Initializing', 'dimension': 384.0}

The syntax for creating and describing collections mirrors that of the same operations for indexes. We create a new collection with pinecone.create_collection("collection_name", "index_name"). To view collection information we describe it with pinecone.describe_collection("collection_name").

We will be able to see the existence of the collection immediately. However, the collection will take some time to fully initialize and be ready for use elsewhere. We can see the collection status after describing it via the status value.

Once the collection status switches to "Ready" we can use it to create new indexes. All we need is:

In[9]:
# check that the status is "Ready"
print(pinecone.describe_collection(collection_name))
Out[9]:
{'name': 'oscar-minilm-collection', 'status': 'Ready', 'size': 2357016615, 'dimension': 384.0}
In[10]:
print(pinecone.list_indexes())  # we only have one index right now

# create the new index
pinecone.create_index(
    "oscar-minilm-p2",
    pod_type="p2",  # using the new P2 pod type
    source_collection=collection_name,
    dimension=int(collection_info.dimension),  # optional
    pods=2  # two pods to align with p1.x2
)

# now we will have two indexes...
pinecone.list_indexes()  # it can take some time (23 min in this case)
Out[10]:
['oscar-minilm']
Out[10]:
['oscar-minilm', 'oscar-minilm-p2']
In[11]:
pinecone.describe_index("oscar-minilm-p2")
Out[11]:
IndexDescription(name='oscar-minilm-p2', metric='cosine', replicas=1, dimension=384.0, shards=2, pods=2, pod_type='p2.x1', status={'ready': True, 'state': 'Ready'}, metadata_config=None, source_collection='oscar-minilm-collection')

Here we checked the collection status for "Ready". Then, using the same pinecone.create_index method we usually use, we initialized a p2 pod index and specified source_collection to build it from our oscar-minilm-collection collection. The creation time is not instant. In this case, it took 23 minutes, a typical time for a collection of this size with the p2 pod type. p1 and s1 index creation is faster (~5 minutes).


p2 Pods

We’ve already seen some of the p2 pod when we initialized a new index from our collection. p2 pods are a new index type that enables up to 10x faster search speeds by utilizing a graph-based index. There are both pros and cons, Queries Per Second (QPS) is faster, but the vector ingestion rate is much slower.

s1p1p2
Capacity (768-d vectors)~5M~1M~1M
Query latency at full capacity (p95)<200ms<50ms<10ms
QPS at full capacity / replica~5 QPS~20 QPS~200 QPS
Ingestion rate / pod10k vectors/s10k vectors/s50 vectors/s

The decision between p1, s1, and p2 relies on your application priorities. p2 is ideal for minimal latency, high-throughput, and indexes with relatively low update rates.

In[13]:
import numpy as np

np.random.seed(0)

xq = np.random.rand(384).tolist()
In[14]:
%%timeit

xc = index.query(xq, top_k=100)
Out[14]:
34.9 ms ± 1.73 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In[15]:
%%timeit

xc = index2.query(xq, top_k=100)
Out[15]:
14.9 ms ± 313 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

If we test our two indexes, the p1 index and the p2 index, with a random query, p2 cuts latency from ~35ms to just ~15ms (including network latency).


That’s a quick rundown of the latest features in Pinecone. All of these are currently in public preview and are not yet covered by Pinecone’s standard SLAs. Therefore we do not recommend them for production use just yet.

These are the three key features, but there are other changes too. Both p1 and s1 pods now (on average) have 50% lower latency and 50% higher throughput. s1 pods have been added to the free Standard plan, meaning standard users can store and query up to 5M 768-dimensional vectors for free.

With all of these new features, there’s plenty to be excited about. As we all move towards an increasingly vector-centric future, there’s no better time to get started with vector search than today.

Learn more about these new features in our announcement.

Share: