๐ฆ LLaVA - Large Language and Vision Assistant
An open-source chatbot trained by fine-tuning LLaMA/Vicuna on GPT-generated multimodal instruction-following data.
Features:
- ๐ฌ Text-based conversation
- ๐ผ๏ธ Image understanding and description
- ๐ง API endpoint for integration
1 2048
0.1 2
API Endpoint Usage
Endpoint: https://your-space-name.hf.space/api/predict
Method: POST
Request Format:
{
"data": [
"{
"message": "Describe this image in detail",
"system_prompt": "You are a helpful assistant",
"image_url": "https://example.com/image.jpg",
"max_tokens": 1024,
"temperature": 0.7
}"
]
}
Response Format:
{
"data": [
"{
"id": "chatcmpl-123456789",
"object": "chat.completion",
"created": 1683123456,
"model": "llava-v1.5-7b",
"choices": [
{
"message": {
"role": "assistant",
"content": "This image shows..."
},
"index": 0,
"finish_reason": "stop"
}
]
}"
]
}
Python Client Example:
import requests
import json
def query_llava(message, image_url=None, system_prompt=""):
payload = {
"data": [json.dumps({
"message": message,
"image_url": image_url,
"system_prompt": system_prompt,
"max_tokens": 1024,
"temperature": 0.7
})]
}
response = requests.post(
"https://your-space-name.hf.space/api/predict",
json=payload
)
if response.status_code == 200:
result = response.json()
api_response = json.loads(result["data"][0])
return api_response["choices"][0]["message"]["content"]
else:
return f"Error: {response.status_code}"
# Example usage
result = query_llava(
"What do you see in this image?",
image_url="https://example.com/image.jpg"
)
print(result)
๐งช Test API
About LLaVA
LLaVA (Large Language and Vision Assistant) is an open-source multimodal AI assistant that combines:
- ๐ง Language Understanding: Based on Vicuna/LLaMA architecture
- ๐๏ธ Vision Capabilities: Uses CLIP vision encoder
- ๐ Multimodal Integration: Connects vision and language seamlessly
Key Features:
- Visual Question Answering: Ask questions about images
- Image Description: Get detailed descriptions of uploaded images
- General Conversation: Chat about any topic
- API Integration: Easy integration with your applications
Model Information:
- Base Model: LLaVA-v1.5-7B
- Vision Encoder: CLIP ViT-L/14@336px
- Language Model: Vicuna-7B
- Training Data: LLaVA-Instruct-150K
Citation:
@misc{liu2023llava,
title={Visual Instruction Tuning},
author={Haotian Liu and Chunyuan Li and Qingyang Wu and Yong Jae Lee},
year={2023},
eprint={2304.08485},
archivePrefix={arXiv},
primaryClass={cs.CV}
}