AI 日报

Qdrant 教程:文本相似度搜索

  • By aihubon
  • Dec 19, 2023 - 2 min read



Qdrant 教程:文本相似度搜索

什么是Qdrant?

Qdrant 是一个用 Rust 构建的高性能搜索引擎和数据库,专为向量相似性而设计。即使在高负载下,它也能提供快速可靠的性能,使其成为需要速度和可扩展性的应用程序的理想选择。Qdrant 可以将您的嵌入或神经网络编码器转变为适用于各种用例的强大应用程序,例如匹配、搜索、推荐或对大型数据集执行其他复杂操作。凭借其扩展的过滤支持,它非常​​适合分面搜索和基于语义的匹配。用户友好的 API 简化了使用 Qdrant 的过程。Qdrant Cloud 提供了一个托管解决方案,只需最少的设置和维护,可以轻松部署和管理应用程序。

有关 Qdrant 的更多信息,请查看我们专门的 AI 技术页面。

我们会怎样做?

Qdrant 教程:文本相似度搜索

在本教程中,我们将利用 Qdrant 向量数据库存储来自 Cohere 模型的嵌入,并使用余弦相似度进行搜索。我们将使用 Cohere SDK 访问模型。所以,事不宜迟,让我们开始吧!

先决条件

我将使用 Qdrant Cloud 来托管我的数据库。值得一提的是,Qdrant 提供 1 GB 的永久免费内存。所以去使用 Qdrant Cloud。您可以在此处了解操作方法。

现在让我们在项目目录中创建一个新的虚拟环境并安装所需的包:

python3 -m venv venvsource venv/bin/activatepip install cohere qdrant-client python-dotenv

请创建一个项目.py文件。

数据

Qdrant 教程:文本相似度搜索

我们将以 JSON 格式存储数据。随意复制它:

[  { "key": "Lion", "desc": "Majestic big cat with golden fur and a loud roar." },  { "key": "Penguin", "desc": "Flightless bird with a tuxedo-like black and white coat." },  { "key": "Gorilla", "desc": "Intelligent primate with muscular build and gentle nature." },  { "key": "Elephant", "desc": "Large mammal with a long trunk and gray skin." },  { "key": "Koala", "desc": "Cute and cuddly marsupial with fluffy ears and a big nose." },  { "key": "Dolphin", "desc": "Playful marine mammal known for its intelligence and acrobatics." },  {    "key": "Orangutan",    "desc": "Shaggy-haired great ape found in the rainforests of Borneo and Sumatra."  },  { "key": "Giraffe", "desc": "Tallest land animal with a long neck and spots on its fur." },  {    "key": "Hippopotamus",    "desc": "Large, semi-aquatic mammal with a wide mouth and stubby legs."  },  { "key": "Kangaroo", "desc": "Marsupial with powerful hind legs and a long tail for balance." },  { "key": "Crocodile", "desc": "Large reptile with sharp teeth and a tough, scaly hide." },  {    "key": "Chimpanzee",    "desc": "Closest relative to humans, known for its intelligence and tool use."  },  { "key": "Tiger", "desc": "Striped big cat with incredible speed and agility." },  { "key": "Zebra", "desc": "Striped mammal with a distinctive mane and tail." },  { "key": "Ostrich", "desc": "Flightless bird with long legs and a big, fluffy tail." },  { "key": "Rhino", "desc": "Large, thick-skinned mammal with a horn on its nose." },  { "key": "Cheetah", "desc": "Fastest land animal with a spotted coat and sleek build." },  {    "key": "Polar Bear",    "desc": "Arctic bear with a thick white coat and webbed paws for swimming."  },  { "key": "Peacock", "desc": "Colorful bird with a vibrant tail of feathers." },  { "key": "Kangaroo", "desc": "Marsupial with powerful hind legs and a long tail for balance." },  {    "key": "Octopus",    "desc": "Intelligent sea creature with eight tentacles and the ability to change color."  },  { "key": "Whale", "desc": "Enormous marine mammal with a blowhole on top of its head." },  { "key": "Sloth", "desc": "Slow-moving mammal found in the rainforests of South America." },  { "key": "Flamingo", "desc": "Tall, pink bird with long legs and a curved beak." }]

环境变量

创建.env文件并在其中存储您的 Cohere API 密钥、Qdrant API 密钥和 Qdrant 主机:

QDRANT_API_KEY=QDRANT_HOST=COHERE_API_KEY=

导入库

import jsonimport osimport uuidfrom typing import Dict, Listimport coherefrom dotenv import load_dotenvfrom qdrant_client import QdrantClientfrom qdrant_client.http import models

加载环境变量

load_dotenv()QDRANT_API_KEY = os.getenv("QDRANT_API_KEY")QDRANT_HOST = os.getenv("QDRANT_HOST")COHERE_API_KEY = os.getenv("COHERE_API_KEY")COHERE_SIZE_VECTOR = 4096  # Larger modelif not QDRANT_API_KEY:    raise ValueError("QDRANT_API_KEY is not set")if not QDRANT_HOST:    raise ValueError("QDRANT_HOST is not set")if not COHERE_API_KEY:    raise ValueError("COHERE_API_KEY is not set")

如何索引数据并在以后进行搜索?

我将实现该类SearchClient,它将能够索引和访问我们的数据。类将包含所有必要的功能,例如索引和搜索,以及将数据转换为必要的格式。

class SearchClient:    def __init__(        self,        qdrabt_api_key: str = QDRANT_API_KEY,        qdrant_host: str = QDRANT_HOST,        cohere_api_key: str = COHERE_API_KEY,        collection_name: str = "animal",    ):        self.qdrant_client = QdrantClient(host=qdrant_host, api_key=qdrabt_api_key)        self.collection_name = collection_name        self.qdrant_client.recreate_collection(            collection_name=self.collection_name,            vectors_config=models.VectorParams(                size=COHERE_SIZE_VECTOR, distance=models.Distance.COSINE            ),        )        self.co_client = cohere.Client(api_key=cohere_api_key)    # Qdrant requires data in float format    def _float_vector(self, vector: List[float]):        return list(map(float, vector))    # Embedding using Cohere Embed model    def _embed(self, text: str):        return self.co_client.embed(texts=[text]).embeddings[0]    # Prepare Qdrant Points    def _qdrant_format(self, data: List[Dict[str, str]]):        points = [            models.PointStruct(                id=uuid.uuid4().hex,                payload={"key": point["key"], "desc": point["desc"]},                vector=self._float_vector(self._embed(point["desc"])),            )            for point in data        ]        return points    # Index data    def index(self, data: List[Dict[str, str]]):        """        data: list of dict with keys: "key" and "desc"        """        points = self._qdrant_format(data)        result = self.qdrant_client.upsert(            collection_name=self.collection_name, points=points        )        return result    # Search using text query    def search(self, query_text: str, limit: int = 3):        query_vector = self._embed(query_text)        return self.qdrant_client.search(            collection_name=self.collection_name,            query_vector=self._float_vector(query_vector),            limit=limit,        )

让我们使用我们的代码!

让我们尝试从文件中读取数据data.json,对其进行处理和索引。然后我们可以尝试从我们的数据库中搜索并获得前 3 个结果!

if __name__ == "__main__":    client = SearchClient()    # import data from data.json file    with open("data.json", "r") as f:        data = json.load(f)    index_result = client.index(data)    print(index_result)    print("====")    search_result = client.search(        "Tallest animal in the world, quite long neck.",    )    print(search_result)

结果!

operation_id=0 status====[ScoredPoint(id='d17eb61c-8764-4bdb-bb26-ac66c3ffa220', version=0, score=0.8677041, payload={'desc': 'Tallest land animal with a long neck and spots on its fur.', 'key': 'Giraffe'}, vector=None), ScoredPoint(id='4934a842-8c55-42bc-938f-a839be2505de', version=0, score=0.71296465, payload={'desc': 'Large, semi-aquatic mammal with a wide mouth and stubby legs.', 'key': 'Hippopotamus'}, vector=None), ScoredPoint(id='05d7e73c-a8bf-44f9-a8b4-af82e06719d0', version=0, score=0.69240415, payload={'desc': 'Large, thick-skinned mammal with a horn on its nose.', 'key': 'Rhino'}, vector=None)]

正如您在第一行中看到的那样:索引操作进行得很顺利。正如我们所定义的,我们得到了 3 个结果。第一个是(正如预期的那样)一只长颈鹿。我们还有河马和犀牛。它们也很大,但我认为长颈鹿是最高的😆。

我明白了,然后……下一步是什么?

Qdrant 教程:文本相似度搜索

为了练习您的 Qdrant 技能,我建议构建一个 API,使您的应用程序能够索引数据、添加新记录和搜索。我认为您可以为此使用 FastAPI!

如果你想尝试新技能,我建议你在本周末的 Cohere x Qdrant AI 黑客马拉松期间使用它们来构建基于 AI 的应用程序!

加入我们的创新者、创造者和创新者社区,用 AI 塑造未来!并查看我们的不同活动!

谢谢你!– AI未来百科 ; 探索AI的边界与未来! 懂您的AI未来站