A LLM semantic caching system aiming to enhance user experience by reducing response time via cached query-result pairs.
中文 | English
Codefuse-ModelCache is a semantic cache for large language models (LLMs). By caching pre-generated model results, it reduces response time for similar requests and improves user experience.
This project aims to optimize services by introducing a caching mechanism. It helps businesses and research institutions reduce the cost of inference deployment, improve model performance and efficiency, and provide scalable services for large models. Through open-source, we aim to share and exchange technologies related to large model semantic cache.
The project’s startup scripts are divided into flask4modelcache.py and flask4modelcache_demo.py.
pip install -r requirements.txt
cd CodeFuse-ModelCache
python flask4modelcache_demo.py
Before starting the service, the following environment configurations should be performed:
reference_doc/create_table.sql
modelcache/config/milvus_config.ini
modelcache/config/mysql_config.ini
The current service provides three core functionalities through RESTful API.: Cache-Writing, Cache-Querying, and Cache-Clearing. Demos:
import json
import requests
url = 'http://127.0.0.1:5000/modelcache'
type = 'insert'
scope = {"model": "CODEGPT-1008"}
chat_info = [{"query": [{"role": "system", "content": "You are an AI code assistant and you must provide neutral and harmless answers to help users solve code-related problems."}, {"role": "user", "content": "你是谁?"}],
"answer": "Hello, I am an intelligent assistant. How can I assist you?"}]
data = {'type': type, 'scope': scope, 'chat_info': chat_info}
headers = {"Content-Type": "application/json"}
res = requests.post(url, headers=headers, json=json.dumps(data))
import json
import requests
url = 'http://127.0.0.1:5000/modelcache'
type = 'query'
scope = {"model": "CODEGPT-1008"}
query = [{"role": "system", "content": "You are an AI code assistant and you must provide neutral and harmless answers to help users solve code-related problems."}, {"role": "user", "content": "Who are you?"}]
data = {'type': type, 'scope': scope, 'query': query}
headers = {"Content-Type": "application/json"}
res = requests.post(url, headers=headers, json=json.dumps(data))
import json
import requests
url = 'http://127.0.0.1:5000/modelcache'
type = 'remove'
scope = {"model": "CODEGPT-1008"}
remove_type = 'truncate_by_model'
data = {'type': type, 'scope': scope, 'remove_type': remove_type}
headers = {"Content-Type": "application/json"}
res = requests.post(url, headers=headers, json=json.dumps(data))
https://mp.weixin.qq.com/s/ExIRu2o7yvXa6nNLZcCfhQ
In terms of functionality, we have made several changes to the git repository. Firstly, we have addressed the network issues with huggingface and enhanced the inference speed by introducing local inference capabilities for embeddings. Additionally, considering the limitations of the SqlAlchemy framework, we have completely revamped the module responsible for interacting with relational databases, enabling more flexible database operations. In practical scenarios, LLM products often require integration with multiple users and multiple models. Hence, we have added support for multi-tenancy in the ModelCache, while also making preliminary compatibility adjustments for system commands and multi-turn dialogue.
Module | Function | ||
---|---|---|---|
ModelCache | GPTCache | ||
Basic Interface | Data query interface | ☑ | ☑ |
Data writing interface | ☑ | ☑ | |
Embedding | Embedding model configuration | ☑ | ☑ |
Large model embedding layer | ☑ | ||
BERT model long text processing | ☑ | ||
Large model invocation | Decoupling from large models | ☑ | |
Local loading of embedding model | ☑ | ||
Data isolation | Model data isolation | ☑ | ☑ |
Hyperparameter isolation | |||
Databases | MySQL | ☑ | ☑ |
Milvus | ☑ | ☑ | |
OceanBase | ☑ | ||
Session management | Single-turn dialogue | ☑ | ☑ |
System commands | ☑ | ||
Multi-turn dialogue | ☑ | ||
Data management | Data persistence | ☑ | ☑ |
One-click cache clearance | ☑ | ||
Tenant management | Support for multi-tenancy | ☑ | |
Milvus multi-collection capability | ☑ | ||
Other | Long-short dialogue distinction | ☑ |
In ModelCache, we adopted the main idea of GPTCache, includes core modules: adapter, embedding, similarity, and data_manager. The adapter module is responsible for handling the business logic of various tasks and can connect the embedding, similarity, and data_manager modules. The embedding module is mainly responsible for converting text into semantic vector representations, it transforms user queries into vector form.The rank module is used for sorting and evaluating the similarity of the recalled vectors. The data_manager module is primarily used for managing the database. In order to better facilitate industrial applications, we have made architectural and functional upgrades as follows:
This project has referenced the following open-source projects. We would like to express our gratitude to the projects and their developers for their contributions and research.
GPTCache
ModelCache is a captivating and invaluable project, whether you are an experienced developer or a novice just starting out, your contributions to this project are warmly welcomed. Your involvement in this project, be it through raising issues, providing suggestions, writing code, or documenting and creating examples, will enhance the project’s quality and make a significant contribution to the open-source community.