Inside vector DBs: A simple guide

Understand the fundamentals of vector databases

Oct 19, 2024

Vector databases are specialized databases designed to efficiently store, index, and retrieve vector embeddings. Traditional scalar-based databases struggle with the complexity of vector data, making vector databases crucial for tasks like real-time semantic search.

While standalone vector indices like FAISS are effective for similarity search, they lack vector databases' comprehensive data management capabilities. Vector databases support CRUD operations, metadata filtering, scalability, real-time updates, backups, ecosystem integration, and robust data security, making them more suited for production environments than standalone indices.

Cloud Engineering for Python Developers Course

→ The truth about being a brilliant ML & MLOPs engineer is not sexy.

In reality, you need two strong foundations that have nothing to do with AI:

1. Programming: Learning Python is a perfect start.

2. Cloud engineering: Learning to deploy your apps to a cloud such as AWS is critical in bringing value to your project.

You need these two before serving models in production.

That's where the Cloud Engineering for Python Developers self-paced course made by Eric Riddoch kicks in.

I have known Eric for almost a year, and I can certify that he is a brilliant cloud, DevOps and MLOps engineer.

Screenshot from the MLOps Club course — 10% off by registering with my code: DECODINGML

Here is what you will learn during the course:

enterprise-level AWS account management
the fundamentals of cloud engineering
design cloud-native RESTful APIs that cost-effectively scale from 4 to 4 million requests/day
write, test, locally mock, and deploy code using the AWS SDK and OpenAI
advanced observability and monitoring: logs, metrics, traces, and alerts

The principles learned during the course can easily be extrapolated from AWS to other cloud platforms such as GCP and Azure.

The self-paced course provides Q&A support, a community, and a final live project demo to boost accountability.
To get 10% off, use the following code when registering: DECODINGML

Eric also offers a scholarship program that can significantly reduce the price.
If you consider buying the course, please use the DECODINGML code to support my work.

If you want to learn cloud engineering, I strongly recommend the course.
→🔗 Enroll here

Back to our article ↓

How does a vector database work?

Think of how you usually search a database. You type in something specific, and the system spits out the exact match. That's how traditional databases work. Vector databases are different. Instead of perfect matches, we look for the closest neighbors of the query vector. Under the hood, a vector database uses Approximate Nearest Neighbor (ANN) algorithms to find these close neighbors.

While ANN algorithms don't return the top matches for a given search, standard nearest-neighbor algorithms are too slow to work in practice. Also, it is shown empirically that using only approximations of the top matches for a given input query works well enough. Thus, the trade-off between accuracy and latency ultimately favors ANN algorithms.

This is a typical workflow of a vector database:

1. Indexing vectors: Vectors are indexed using data structures optimized for high-dimensional dataStandardon indexing techniques include hierarchical navigable small world (HNSW), random projection, product quantization (PQ), and locality-sensitive hashing (LSH).

2. Querying for similarity: During a search, the database queries the indexed vectors to find those most similar to the input vector. This process involves comparing vectors based on similarity measures such as cosine similarity, Euclidean distance, or dot product. Each has unique advantages and is suitable for different use cases.

3. Post-processing results: After identifying potential matches, the results undergo post-processing to refine accuracy. This step ensures that the most relevant vectors are returned to the user.

Vector databases can filter results based on metadata before or after the vector search. Both approaches have trade-offs in terms of performance and accuracy. The query also depends on the metadata (along with the vector index), so it contains a metadata index user for filtering operations.

Algorithms for creating the vector index

Vector DBs use various algorithms to create the vector index and manage searching data efficiently:

Random projection: Random projection reduces the dimensionality of vectors by projecting them into a lower-dimensional space using a random matrix. This technique preserves the relative distances between vectors, facilitating faster searches.
Product quantization (PQ): Product quantization compresses vectors by dividing them into smaller sub-vectors and then quantizing these sub-vectors into representative codes. This reduces memory usage and speeds up similarity searches.
Locality-sensitive hashing (LSH): Locality-sensitive hashing maps similar vectors into buckets. This method enables fast approximate nearest neighbor searches by focusing on a subset of the data, reducing the computational complexity.
Hierarchical Navigable Small World (HNSW): HNSW constructs a multi-layer graph where each node represents a set of vectors. Similar nodes are connected, allowing the algorithm to navigate the graph and find the nearest neighbors efficiently.

These algorithms enable vector databases to efficiently handle complex and large-scale data, making them a perfect fit for a variety of AI and machine learning applications.

Database operations

Vector databases also share common characteristics with standard databases to ensure high performance, fault tolerance, and ease of management in production environments.

Key operations include:

Sharding and replication: Data is partitioned (sharded) across multiple nodes to ensure scalability and high availability. Data replication across nodes helps maintain data integrity and availability in case of node failures.
Monitoring: Continuous monitoring of database performance, including query latency and resource usage (RAM, CPU, disk), helps maintain optimal operations and identify potential issues before they impact the system.
Access control: Implementing robust access control mechanisms ensures that only authorized users can access and modify data. This includes role-based access controls and other security protocols to protect sensitive information.
Backups: Regular database backups are critical for disaster recovery. Automated backup processes ensure that data can be restored to a previous state in case of corruption or loss.

Popular vector databases

Popular options are Qdrant, Milvus, MongoDB, Redis, Weaviate, Pinecone, Chroma and pgvector (a PostgreSQL plugin for vector indexes).

Comparing all the vector databases in detail can be challenging, as each vector db has its advantages. Thus, there is no clear winner, as you might choose a different one depending on your use case.

Screenshot from the Vector DB Comparison resource from Superlinked.

For example, if you are already highly dependent on PostgreSQL or MongoDB, it makes sense to choose their option. Otherwise, you might want to choose Qdrant.

We don’t want to do an in-depth comparison here.

Still, if you're curious, you can check the Vector DB Comparison resource from Superlinked. It compares all the top vector databases in every way, from the license and release year to database features, embedding models and frameworks supported.

Conclusion

In conclusion, vector databases are indispensable for efficiently managing and retrieving high-dimensional vector data in AI applications, and selecting the right one depends on your specific use case and requirements.

Images

If not otherwise stated, all images are created by the author.