Skip to content

Conversation

@ViliTajnic
Copy link

Summary

Adds OCI Object Storage refresh capability for vector stores with intelligent duplicate detection and metadata tracking.

Feature

OCI Vector Store Refresh synchronizes vector stores with documents in OCI Object Storage buckets. Automatically detects new and modified files, processes only changes, and avoids re-embedding unchanged documents.

How It Works

  1. Compares OCI bucket contents with existing vector store metadata
  2. Identifies new and modified files using ETags and timestamps
  3. Downloads and embeds only changed files
  4. Updates vector store incrementally while preserving existing content

Key Benefits

  • Efficiency: Only processes changed files, not entire bucket
  • Duplicate Prevention: Skips already-embedded files using ETag comparison
  • Metadata Tracking: Stores file size, modified date, and ETag with chunks
  • File Listing: View all embedded files with statistics

Usage

  1. Select vector store and OCI bucket
  2. Click "Refresh from OCI"
  3. View results showing new/updated files and chunks processed

ViliTajnic and others added 30 commits September 18, 2025 16:05
)

Add comprehensive functionality to automatically refresh vector stores when
documents are added or modified in OCI Object Storage buckets while preserving
original embedding parameters.

Key features:
- Change detection using object metadata (etag, time_modified)
- Parameter preservation from existing vector stores
- Incremental processing of only new/modified files
- New REST API endpoint for refresh operations
- Comprehensive status reporting

Files modified:
- common/schema.py: Add VectorStoreRefreshRequest and VectorStoreRefreshStatus schemas
- server/api/utils/oci.py: Add get_bucket_objects_with_metadata() and detect_changed_objects()
- server/api/utils/embed.py: Add refresh functionality with get_vector_store_by_alias(),
  get_processed_objects_metadata(), and refresh_vector_store_from_bucket()
- server/api/v1/embed.py: Add POST /v1/embed/refresh endpoint
to use last minimal SpringBoot version and the sys prompt defined for vector search.
Signed-off-by: Christopher Jones <christopher.jones@oracle.com>
* Added Unit Tests
* Updated Documents
* Updated Images
* Shift pyproject.toml
* Expose FastMCP endpoints
Add reference to the new auto-refresh vector store functionality from OCI Object Storage buckets feature in the AI Optimizer Features section.
- Implemented get_processed_objects_metadata() to retrieve metadata from vector stores
- Added ETag-based change detection for OCI Object Storage files
- Support for both new metadata format (filename/etag) and legacy format (source)
- Added get_total_chunks_count() helper function
- Updated refresh_vector_store endpoint to skip already-processed files
- Fix vector store refresh endpoint: replace core_oci.get_oci() with utils_oci.get()
  - core_oci module was removed in main branch (PR #312)
  - Updated refresh_vector_store in src/server/api/v1/embed.py

- Fix ValueError in client when vector store no longer exists
  - Added validation check in src/client/utils/st_common.py
  - Handles case where previously selected vector store is filtered out
  - Resets to empty selection instead of crashing
This commit merges the latest changes from main branch and adds
extensive documentation for IDE integration to address issue #299.

Changes from main merge:
- Updated embed.py utility functions
- Added webscrape.py for web scraping functionality
- Updated embed API endpoints
- Resolved .gitignore conflict

New IDE Integration Documentation:
- Created comprehensive guide: docs/content/advanced/ide_integration.md
- Covers OpenAI-compatible REST API integration
- Includes MCP (Model Context Protocol) integration details
- Provides setup guides for:
  * Continue.dev
  * Cline
  * Cursor
  * Aider
  * Custom integrations
- Includes API reference, examples, and troubleshooting
- Documents RAG-powered development workflow
- Covers SelectAI integration for IDEs
- Provides cURL, Python, and Node.js examples

Addresses: #299 (IDE Integration)
The IDE integration documentation has been removed from this branch.

Addresses: #299 (IDE Integration)
- Add validation to prevent empty vector store names
- Implement file listing endpoint to view embedded files
- Display file metadata (name, chunks, size, modified date)
- Enhance metadata capture for all file sources (local, SQL, web, OCI)
- Add orphaned chunk detection and reporting
- Auto-hide empty columns in file list display
Fixed issue where files refreshed from OCI Object Storage buckets
were not showing size and modified date metadata in the file listing.

Root cause: The OCI list_objects API call was not requesting metadata
fields. By default, it only returns object names without size, etag,
or modification time.

Changes:
- Added 'fields' parameter to list_objects API call in oci.py to
  explicitly request name, size, etag, timeModified, and md5 fields
- Enhanced refresh_vector_store_from_bucket() in embed.py to build
  file_metadata dict from bucket objects and pass to document loader
- Updated process_metadata() to add etag field to chunk metadata
- Fixed Decimal to int conversion for size field in get_vector_store_files()
- Added summary logging for OCI metadata retrieval

Testing: Verified that files refreshed from OCI buckets now display
correct size and modification date in the file listing UI.
@oracle-contributor-agreement oracle-contributor-agreement bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Nov 4, 2025
ViliTajnic and others added 8 commits November 4, 2025 14:30
Changed unused col1 variable to underscore to indicate it's
intentionally unused (only col2 is used for the refresh button).
Tests for new functionality:
- OCI bucket object metadata retrieval with fields parameter
- Change detection for new and modified files
- File listing from vector stores with metadata
- Oracle Decimal to int conversion
- Orphaned chunk detection
- Old metadata format fallback

Test coverage:
- 10 tests for OCI refresh functions (get_bucket_objects_with_metadata, detect_changed_objects)
- 6 tests for file listing (get_vector_store_files)
- 3 integration tests for new API endpoints

All 21 unit tests passing with comprehensive edge case coverage.
Improvements to the Split/Embed tool UI:

- Add toggle control to switch between "Create New Vector Store" (default)
  and "Use Existing Vector Store" modes
- When creating new VS: show simple text input for vector store name,
  display all configuration options (chunk size, overlap, distance metric,
  index type)
- When using existing VS: hide configuration options (already defined by VS),
  filter dropdown to show only vector stores created with the same embedding
  model to prevent mixing embeddings
- Show full vector store table name in both modes
- Improved validation messages and help text
- Prevents potential issues with mixing embeddings from different models
  in the same vector store

This simplifies the UI and makes the distinction between creating new vs
using existing vector stores much clearer.
Addresses pylint warnings (R0912: too-many-branches, R0915: too-many-statements)
by extracting helper functions to improve code organization and maintainability.

Changes:
- Extract _render_create_new_vs_input() for create new mode UI
- Extract _render_use_existing_vs_input() for use existing mode UI
- Extract _validate_vector_store_alias() for validation logic
- Extract _display_vector_store_info() for VS display and file list
- Refactor main _render_vector_store_section() to use helpers

Result: Main function reduced from 129 lines to 45 lines, with only
2 branches instead of 14, making it more maintainable and passing
pylint complexity checks.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

OCA Verified All contributors have signed the Oracle Contributor Agreement.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants