Vision
Image analysis and document processing for multimodal interactions
Overview
The Vision system enables agents to process and analyze images, providing multimodal AI capabilities for richer interactions. It supports multiple image formats, offers various analysis modes, and integrates seamlessly with both OpenAI and local Ollama providers for flexible deployment options.
Enabling Vision
Enable vision capabilities for an agent by setting the vision
option to true
:
import { Agent } from '@astreus-ai/astreus';
const agent = await Agent.create({
name: 'VisionAgent',
model: 'gpt-4o', // Vision-capable model
vision: true // Enable vision capabilities (default: false)
});
Attachment System
Astreus supports an intuitive attachment system for working with images:
// Clean, modern attachment API
const response = await agent.ask("What do you see in this image?", {
attachments: [
{ type: 'image', path: '/path/to/image.jpg', name: 'My Photo' }
]
});
The attachment system automatically:
- Detects the file type and selects appropriate tools
- Enhances the prompt with attachment information
- Enables tool usage when attachments are present
Vision Capabilities
The vision system provides three core capabilities through built-in tools:
1. General Image Analysis
Analyze images with custom prompts and configurable detail levels:
// Using attachments (recommended approach)
const response = await agent.ask("Please analyze this screenshot and describe the UI elements", {
attachments: [
{ type: 'image', path: '/path/to/screenshot.png', name: 'UI Screenshot' }
]
});
// Using the analyze_image tool through conversation
const response2 = await agent.ask("Please analyze the image at /path/to/screenshot.png and describe the UI elements");
// Direct method call
const analysis = await agent.analyzeImage('/path/to/image.jpg', {
prompt: 'What UI elements are visible in this interface?',
detail: 'high',
maxTokens: 1500
});
2. Image Description
Generate structured descriptions for different use cases:
// Accessibility-friendly description
const description = await agent.describeImage('/path/to/image.jpg', 'accessibility');
// Available styles:
// - 'detailed': Comprehensive description of all visual elements
// - 'concise': Brief description of main elements
// - 'accessibility': Screen reader-friendly descriptions
// - 'technical': Technical analysis including composition and lighting
3. Text Extraction (OCR)
Extract and transcribe text from images:
// Extract text with language hint
const text = await agent.extractTextFromImage('/path/to/document.jpg', 'english');
// The system maintains original formatting and structure
console.log(text);
Supported Formats
The vision system supports these image formats:
- JPEG (
.jpg
,.jpeg
) - PNG (
.png
) - GIF (
.gif
) - BMP (
.bmp
) - WebP (
.webp
)
Input Sources
File Paths
Analyze images from local file system:
const result = await agent.analyzeImage('/path/to/image.jpg');
Base64 Data
Analyze images from base64-encoded data:
const base64Image = 'data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQ...';
const result = await agent.analyzeImageFromBase64(base64Image);
Configuration
Vision Model Configuration
Specify the vision model directly in the agent configuration:
const agent = await Agent.create({
name: 'VisionAgent',
model: 'gpt-4o',
visionModel: 'gpt-4o', // Specify vision model here
vision: true
});
Environment Variables
# API keys (auto-detected based on model)
OPENAI_API_KEY=your_openai_key # For OpenAI models
ANTHROPIC_API_KEY=your_anthropic_key # For Claude models
GOOGLE_API_KEY=your_google_key # For Gemini models
# Ollama configuration (local)
OLLAMA_BASE_URL=http://localhost:11434 # Default if not set
The vision system automatically selects the appropriate provider based on the visionModel
specified in the agent configuration.
Analysis Options
Configure analysis behavior with these options:
interface AnalysisOptions {
prompt?: string; // Custom analysis prompt
maxTokens?: number; // Response length limit (default: 1000)
detail?: 'low' | 'high'; // Analysis detail level (OpenAI only)
}
Usage Examples
Screenshot Analysis
const agent = await Agent.create({
name: 'UIAnalyzer',
model: 'gpt-4o',
vision: true
});
// Analyze a UI screenshot
const analysis = await agent.analyzeImage('/path/to/app-screenshot.png', {
prompt: 'Analyze this mobile app interface. Identify key UI components, layout structure, and potential usability issues.',
detail: 'high'
});
console.log(analysis);
Document Processing
// Extract text from scanned documents
const documentText = await agent.extractTextFromImage('/path/to/scanned-invoice.jpg', 'english');
// Generate accessible descriptions
const accessibleDesc = await agent.describeImage('/path/to/chart.png', 'accessibility');
Multimodal Conversations
// Using attachments for cleaner API
const response = await agent.ask("I'm getting an error. Can you analyze this screenshot and help me fix it?", {
attachments: [
{ type: 'image', path: '/Users/john/Desktop/error.png', name: 'Error Screenshot' }
]
});
// Multiple attachments
const response2 = await agent.ask("Compare these UI mockups and suggest improvements", {
attachments: [
{ type: 'image', path: '/designs/mockup1.png', name: 'Design A' },
{ type: 'image', path: '/designs/mockup2.png', name: 'Design B' }
]
});
// Traditional approach (still works)
const response3 = await agent.ask(
"Please analyze the error screenshot at /Users/john/Desktop/error.png and suggest how to fix the issue"
);
Provider Comparison
Feature | OpenAI (gpt-4o) | Ollama (llava) |
---|---|---|
Analysis Quality | Excellent | Good |
Processing Speed | Fast | Variable |
Cost | Pay-per-use | Free (local) |
Privacy | Cloud-based | Local processing |
Detail Levels | Low/High | Standard |
Language Support | Extensive | Good |
OpenAI Provider
- Best for: Production applications requiring high accuracy
- Default Model:
gpt-4o
- Features: Detail level control, excellent text recognition
Ollama Provider (Local)
- Best for: Privacy-sensitive applications or development
- Default Model:
llava
- Features: Local processing, no API costs, offline capability
Batch Processing
Process multiple images efficiently:
const images = [
'/path/to/image1.jpg',
'/path/to/image2.png',
'/path/to/image3.gif'
];
// Process all images in parallel
const results = await Promise.all(
images.map(imagePath =>
agent.describeImage(imagePath, 'concise')
)
);
console.log('Analysis results:', results);
// Or use task attachments for batch processing
const batchTask = await agent.createTask({
prompt: 'Analyze all these images and provide a comparative report',
attachments: images.map(path => ({
type: 'image',
path,
name: path.split('/').pop()
}))
});
const batchResult = await agent.executeTask(batchTask.id);
Built-in Vision Tools
When vision is enabled, these tools are automatically available:
analyze_image
- Parameters:
image_path
(string, required): Path to image fileprompt
(string, optional): Custom analysis promptdetail
(string, optional): 'low' or 'high' detail level
describe_image
- Parameters:
image_path
(string, required): Path to image filestyle
(string, optional): Description style ('detailed', 'concise', 'accessibility', 'technical')
extract_text_from_image
- Parameters:
image_path
(string, required): Path to image filelanguage
(string, optional): Language hint for better OCR accuracy
How is this guide?