Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Oct 23, 2025

Add comprehensive .github/copilot-instructions.md for repository onboarding

  • Explore repository structure and documentation
  • Test CMake build configuration and verify it works
  • Run tests to understand test infrastructure (Catch2)
  • Check formatting tools (clang-format, pre-commit)
  • Review CI workflows and build pipelines
  • Understand Python bindings structure
  • Create .github/copilot-instructions.md with all required sections
  • Verify instructions are accurate by following them
  • Test build commands work correctly
  • Validate file meets 2-page limit (~1,589 words)
  • Fix: Revert accidental clang-format changes to 86 files
  • Address review feedback:
    • Added type-erasure mention for orchestrators
    • Updated extensions description to explain svs_invoke system
    • Removed specific line number references
    • Removed SVS_EXPERIMENTAL_LEANVEC from all build commands
    • Updated cmake commands in Quick Reference to use full CI config
    • Added guidance about limiting formatting to modified files only
    • Added requirement for tests with all features and bugfixes
  • Deep dive into codebase architecture and enhance descriptions based on understanding
  • Add AI-generated content attribution policy

Summary

Created comprehensive copilot-instructions.md (313 lines, ~2 pages) with verified build/test instructions.

Latest updates: Enhanced all descriptions based on deep architectural understanding of the codebase:

  • Explained layered architecture: low-level templated index implementations → type-erased orchestrators → customization point extensions
  • Clarified ISA dispatching as runtime CPU feature detection (SSE/AVX2/AVX512) with dispatch to optimized code paths
  • Detailed type-erasure pattern with concrete examples (e.g., svs::Vamana wrapping templated VamanaIndex)
  • Explained svs_invoke as customization point objects enabling type-specific behavior specialization without modifying core algorithms
  • Enhanced directory annotations with architectural context (e.g., "templated implementation" vs "type-erased APIs")
  • Added performance focus explanation about compile-time dispatch and template metaprogramming
  • Detailed CI matrix builds, Python version support (3.9-3.12), and time estimates
  • Added AI-generated content attribution policy: Agents must add attribution line to file headers when creating content on behalf of Intel employees

The instructions now reflect genuine understanding of design patterns and architectural decisions, not just surface-level feature listings.

Original prompt

Your task is to "onboard" this repository to Copilot coding agent by adding a .github/copilot-instructions.md file in the repository that contains information describing how a coding agent seeing it for the first time can work most efficiently.

You will do this task only one time per repository and doing a good job can SIGNIFICANTLY improve the quality of the agent's work, so take your time, think carefully, and search thoroughly before writing the instructions.

- Reduce the likelihood of a coding agent pull request getting rejected by the user due to generating code that fails the continuous integration build, fails a validation pipeline, or having misbehavior. - Minimize bash command and build failures. - Allow the agent to complete its task more quickly by minimizing the need for exploration using grep, find, str_replace_editor, and code search tools. - Instructions must be no longer than 2 pages. - Instructions must not be task specific.

Add the following high level details about the codebase to reduce the amount of searching the agent has to do to understand the codebase each time:

  • A summary of what the repository does.
  • High level repository information, such as the size of the repo, the type of the project, the languages, frameworks, or target runtimes in use.

Additional details that you may incorporate:

  • The repository provides the ScalableVectorSearch (SVS) public library.
  • It implements a state-of-the-art vamana-graph-based approximate nearest neighbor search implementation.
  • It's optimized for Intel x86, but it runs on other hardware too.
  • The repository uses C++20 and modern concepts to enable the best compiler optimizations.
  • The project can be used as a header-only dependency.
  • The project also provides Python bindings, which have to make some assumptions about data shapes (esp. data dimensionality), to specialize templated compute operations.
  • Runtime architecture-aware ISA dispatching allows the project to be built and deployed on different CPU architectures. The ISA dispatching will use the highest available instruction set (SSE vs AVX2 vs AVX512) on the system it's being executed.

Add information about how to build and validate changes so the agent does not need to search and find it each time.

  • For each of bootstrap, build, test, run, lint, and any other scripted step, document the sequence of steps to take to run it successfully as well as the versions of any runtime or build tools used.
  • Each command should be validated by running it to ensure that it works correctly as well as any preconditions and postconditions.
  • Try cleaning the repo and environment and running commands in different orders and document errors and and misbehavior observed as well as any steps used to mitigate the problem.
  • Run the tests and document the order of steps required to run the tests.
  • Make a change to the codebase. Document any unexpected build issues as well as the workarounds.
  • Document environment setup steps that seem optional but that you have validated are actually required.
  • Document the time required for commands that failed due to timing out.
  • When you find a sequence of commands that work for a particular purpose, document them in detail.
  • Use language to indicate when something should always be done. For example: "always run npm install before building".
  • Record any validation steps from documentation.

Additional details you may incorporate:

  • For linting, clang-format is used. Currently "clang-format>=15,<16", as defined in .pre-commit-config.yaml.
  • Build uses cmake, with options defined in cmake/options.cmake. It is recommended to stay close to the CI configuration, defined in build-linux.yml. Currently it is `cmake -DCMAKE_BUILD_TYPE=RelWithDebugInfo -DSVS_BUILD_BINARIES=YES -DSVS_BUILD_TESTS=YES -DSVS_BUILD_EXAMPLES=YES -DSVS_EXPERIMENTAL_LEANVEC=YES -DSVS_NO_AVX512=NO -DSVS_EXPERIMENTAL_ENABLE_IVF=ON
  • Tests can be executed from within the build directory. The executable is ./tests/tests. Filters on CATCH_TEST_CASE names can be applied, e.g. ./tests/tests [integration][build].

List key facts about the layout and architecture of the codebase to help the agent find where to make changes with minimal searching.

  • A description of the major architectural elements of the project, including the relative paths to the main project files, the location
    of configuration files for linting, compilation, testing, and preferences.
  • A description of the checks run prior to check in, including any GitHub workflows, continuous integration builds, or other validation pipelines.
  • Document the steps so that the agent can replicate these itself.
  • Any explicit validation steps that the agent can consider to have further confidence in its changes.
  • Dependencies that aren't obvious from the layout or file structure.
  • ...

💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Copilot AI and others added 2 commits October 23, 2025 09:24
Co-authored-by: ahuber21 <9201869+ahuber21@users.noreply.github.com>
Co-authored-by: ahuber21 <9201869+ahuber21@users.noreply.github.com>
Copilot AI changed the title [WIP] Add Copilot instructions for efficient onboarding Add comprehensive .github/copilot-instructions.md for repository onboarding Oct 23, 2025
Copilot finished work on behalf of ahuber21 October 23, 2025 09:38
Copilot AI requested a review from ahuber21 October 23, 2025 09:38
@ahuber21
Copy link
Contributor

The many code changes you added suggest a problem with clang-format. @copilot fix

Co-authored-by: ahuber21 <9201869+ahuber21@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Oct 23, 2025

The many code changes you added suggest a problem with clang-format. @copilot fix

Fixed in commit 540068f. Reverted the accidental formatting changes to 86 files. The PR now contains only the new .github/copilot-instructions.md file as intended.

Copilot finished work on behalf of ahuber21 October 23, 2025 10:31
Copy link
Contributor

@ahuber21 ahuber21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot address these comments to improve the instructions.

│ │ ├── vamana/ # Vamana graph index
│ │ ├── flat/ # Flat (brute-force) index
│ │ └── inverted/ # Inverted index (IVF)
│ ├── orchestrators/ # High-level APIs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot It may be useful to mention that type-erasure is used to provide simple and consistent interfaces.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in commit bd187f6. Updated orchestrators description to mention that they use type-erasure to provide simple and consistent interfaces.

│ │ └── inverted/ # Inverted index (IVF)
│ ├── orchestrators/ # High-level APIs
│ ├── quantization/ # Vector quantization
│ └── extensions/ # ISA-specific optimizations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot extensions are more general than ISA-specific optimizations. They provide svs_invoke overloads/specializations to hook into core SVS routines. I suggest you investigate this behavior more carefully and include it in the instructions.

The concept is similar to std::invoke.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in commit bd187f6. Changed the extensions description to explain that they provide svs_invoke overloads/specializations to hook into core SVS routines (similar to std::invoke). Also added this as a key principle in the Important Notes section.

2. **pre-commit.yml**: Verifies code formatting with clang-format 15
3. **cibuildwheel.yml**: Builds Python wheels (uses custom manylinux2014 container)

**To replicate CI locally**: Use the exact cmake command from `build-linux.yml` (lines 70-77).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not safe to reference exact lines as files can change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in commit bd187f6. Removed all specific line number references from the instructions.


### Build Issues

1. **Problem**: CMake configuration warns about unused `SVS_EXPERIMENTAL_LEANVEC` variable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SVS_EXPERIMENTAL_LEANVEC is part of the closed-source part of the repository. We can completely eliminate it from this repository to avoid confusion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in commit bd187f6. Eliminated all references to SVS_EXPERIMENTAL_LEANVEC from build commands and explanatory notes.

cd build/tests && ./tests --list-tags

# Clean and rebuild
rm -rf build && mkdir build && cd build && cmake .. && make -j$(nproc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This drops the cmake configuration and uses default values. You should update it to use the exact cmake configuration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in commit bd187f6. Updated the Quick Reference section to use the exact cmake configuration with all required flags.

1. **Trust these instructions first** - Only search the repository if information here is incomplete or incorrect
2. **Always build out-of-source** - Use a `build/` directory, never configure CMake in the repository root
3. **Follow the CI configuration** - Use the same cmake flags as `.github/workflows/build-linux.yml` for consistency
4. **Format before committing** - Run `./tools/clang-format.sh clang-format` to avoid CI failures
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the following (you are allowed to reword): Limit formatting to modified files, don't include formatting changes from other files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in commit bd187f6. Updated the formatting guidance to emphasize: "Only format files you modify; do not include formatting changes from other files in your PR".

7. **ISA dispatching** - Runtime dispatch means the same binary runs on different CPU architectures
8. **Test filters are your friend** - Use Catch2 tags to run subsets of tests during development
9. **Python bindings are specialized** - Changes to template parameters may require Python binding updates
10. **Version is synchronized** - Keep version in sync across `CMakeLists.txt` (line 26), `setup.py` (line 43), and test files
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add that new features and bugfixes must be accompanied by tests.
Especially a bug must first be reproduced in a unit test, and then addressed and fixed in a code change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in commit bd187f6. Added a new principle: "Tests are required - New features and bugfixes must be accompanied by tests. For bugs, first reproduce the issue in a unit test, then fix it in the code".

Co-authored-by: ahuber21 <9201869+ahuber21@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Oct 23, 2025

@copilot address these comments to improve the instructions.

All review comments addressed in commit bd187f6. Changes include:

  • Added type-erasure explanation for orchestrators
  • Updated extensions to describe svs_invoke system
  • Removed specific line references
  • Eliminated SVS_EXPERIMENTAL_LEANVEC
  • Fixed cmake commands in Quick Reference
  • Added formatting constraint guidance
  • Added testing requirements for all changes

Copilot AI requested a review from ahuber21 October 23, 2025 10:44
Copilot finished work on behalf of ahuber21 October 23, 2025 10:44
@ahuber21
Copy link
Contributor

@copilot it feels like you took my suggestions verbatim. Did you reference them against your understanding of the project? Can you improve the description further? Remember, doing a good job can SIGNIFICANTLY improve the quality of the agent's work.

Co-authored-by: ahuber21 <9201869+ahuber21@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Oct 23, 2025

@copilot it feels like you took my suggestions verbatim. Did you reference them against your understanding of the project? Can you improve the description further? Remember, doing a good job can SIGNI...

Enhanced in commit d4f1061. I studied the codebase architecture more deeply and rewrote descriptions based on actual understanding rather than surface-level observations:

Key improvements:

  • Explained the layered architecture: low-level templated index implementations → type-erased orchestrators → customization point extensions
  • Clarified ISA dispatching as runtime CPU feature detection (SSE/AVX2/AVX512) with dispatch to optimized paths
  • Described type-erasure pattern with concrete examples (e.g., svs::Vamana wrapping templated VamanaIndex)
  • Explained svs_invoke as customization point objects enabling type-specific behavior specialization
  • Enhanced directory annotations with architectural context (e.g., "templated implementation" vs "type-erased APIs")
  • Added performance focus explanation about compile-time dispatch and template metaprogramming
  • Detailed CI matrix builds and Python version support

The instructions now reflect understanding of design patterns and architectural decisions, not just listing features.

Copilot finished work on behalf of ahuber21 October 23, 2025 11:02
@ahuber21 ahuber21 marked this pull request as ready for review October 23, 2025 11:09
@ahuber21 ahuber21 requested review from aguerreb and ibhati and removed request for homksei and yuejiaointel October 23, 2025 11:09
@ahuber21
Copy link
Contributor

@copilot add another policy: if content was added on behalf of Intel employees, the following line must be added in the file header

These contents may have been developed with support from one or more Intel-operated generative artificial intelligence solutions.

Co-authored-by: ahuber21 <9201869+ahuber21@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Oct 23, 2025

@copilot add another policy: if content was added on behalf of Intel employees, the following line must be added in the file header

...

Added in commit a8853bb. The policy is now included as item #7 in the Important Notes for Coding Agents section, instructing agents to add the attribution line to file headers when content is created on behalf of Intel employees.

Copilot finished work on behalf of ahuber21 October 23, 2025 11:32
@napetrov
Copy link
Contributor

Might be worth considering not legacy monolithic instruction file but scope limited individual instructions in instructions folder - this can allow reduced context use for specific topic - this can be done on per folder or per type - for instance python API might be separate instruction , build system might be separate instruction, and C++ might be separate instruction , rest of CI and validation might be separate as well.

Copy link
Member

@ethanglaser ethanglaser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple minor comments, looks great though


### Prerequisites
- CMake 3.21 or higher
- C++20 compiler: GCC 11+, GCC 12+, or Clang 15+
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- C++20 compiler: GCC 11+, GCC 12+, or Clang 15+
- C++20 compiler: GCC 11+ or Clang 15+

│ │ ├── build-linux.yml # Main build & test (Ubuntu 22.04, g++/clang)
│ │ ├── pre-commit.yml # Format checking
│ │ ├── cibuildwheel.yml # Python wheel building
│ │ └── build-*.y{a}ml # macOS, ARM builds
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't notice there was inconsistencies here. Maybe makes sense to change to .yml and revise file name (build-macos.yml) in this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants