-
Notifications
You must be signed in to change notification settings - Fork 16
Description
Enhancement: SharePoint Connector
🎯 Overview
Add support for ingesting documents and content from Microsoft SharePoint Online and SharePoint On-Premises environments, enabling organizations to include their SharePoint-based knowledge repositories in their AI-powered development workflows.
📋 Problem Statement
Many organizations store critical documentation, policies, procedures, and knowledge assets in Microsoft SharePoint. Currently, qdrant-loader cannot access this content, creating gaps in organizational knowledge bases and limiting the effectiveness of AI-powered development tools that rely on comprehensive documentation.
Common SharePoint content includes:
- Technical documentation and wikis
- Project documentation and specifications
- Policies and procedures
- Training materials and guides
- Meeting notes and decisions
- File attachments (PDFs, Office documents, etc.)
🚀 Proposed Solution
Implement a SharePoint connector that integrates with both SharePoint Online (Office 365) and SharePoint On-Premises environments using the Office365-REST-Python-Client library.
Key Features
- Multi-Environment Support: SharePoint Online, SharePoint 2013+, and On-Premises
- Multiple Authentication Methods: App-Only, username/password, certificate-based, interactive
- Document Libraries: Process files from SharePoint document libraries
- SharePoint Lists: Extract content from custom lists and built-in lists
- Rich Metadata: Capture SharePoint-specific metadata (author, created/modified dates, custom columns)
- File Conversion Integration: Leverage existing file conversion for Office documents, PDFs, etc.
- Incremental Updates: Change detection for efficient synchronization
- Attachment Processing: Handle file attachments with parent-child relationships
🏗️ Technical Implementation
Configuration Structure
sources:
sharepoint:
company-intranet:
base_url: "https://company.sharepoint.com/sites/intranet"
source: "company-intranet"
source_type: "sharepoint"
# Authentication
authentication_method: "client_credentials" # or "username_password", "certificate", "interactive"
client_id: "${SHAREPOINT_CLIENT_ID}"
client_secret: "${SHAREPOINT_CLIENT_SECRET}"
tenant_id: "${SHAREPOINT_TENANT_ID}"
# Content Selection
document_libraries:
- "Documents"
- "Shared Documents"
- "Policies"
sharepoint_lists:
- "Announcements"
- "Project Updates"
- "Knowledge Base"
# File Processing
enable_file_conversion: true
download_attachments: true
file_extensions: [".docx", ".pdf", ".xlsx", ".pptx", ".txt", ".md"]
max_file_size: 52428800 # 50MB
# Filtering
exclude_paths:
- "Forms/"
- "_catalogs/"
include_content_types:
- "Document"
- "Page"
- "List Item"
# Metadata
custom_columns:
- "Department"
- "Category"
- "Tags"Authentication Methods
1. App-Only (Client Credentials) - Recommended for Production
# Azure AD App registration required
client_credentials = ClientCredential(client_id, client_secret)
ctx = ClientContext(site_url).with_credentials(client_credentials)2. Username/Password - Development & Testing
user_credentials = UserCredential(username, password)
ctx = ClientContext(site_url).with_credentials(user_credentials)3. Certificate-Based - Enterprise Security
# For high-security environments
ctx = ClientContext(site_url).with_certificate(cert_path, thumbprint, client_id)4. Interactive - Development
# Browser-based authentication for development
ctx = ClientContext(site_url).with_interactive(tenant_id, client_id)Content Processing Strategy
Document Libraries
- Enumerate all files in configured document libraries
- Extract file metadata (title, author, created/modified dates, custom properties)
- Download and convert files using existing file conversion pipeline
- Create parent-child relationships for files with attachments
SharePoint Lists
- Process list items as structured documents
- Extract list item fields as metadata
- Handle rich text fields and attachments
- Support custom content types
Pages and Wiki Content
- Extract SharePoint pages and wiki content
- Process web parts and embedded content
- Maintain page hierarchy and navigation structure
📦 Dependencies
Primary Library
- Office365-REST-Python-Client (>=2.6.0)
- Actively maintained and comprehensive
- Supports both SharePoint REST API and Microsoft Graph API
- Multiple authentication methods
- Python 3.12 compatible
Integration Points
- Existing File Conversion: Leverage
markitdownfor Office documents, PDFs - Document Model: Extend existing Document model with SharePoint-specific metadata
- State Management: Use existing change detection and state tracking
- Configuration System: Follow existing Pydantic-based configuration pattern
🔧 Implementation Plan
Phase 1: Core Infrastructure (Week 1-2)
- Create SharePoint connector package structure
- Implement SharePointConfig with authentication options
- Set up basic SharePoint connection and authentication
- Add Office365-REST-Python-Client dependency
Phase 2: Document Library Support (Week 3-4)
- Implement document library enumeration
- Add file download and metadata extraction
- Integrate with existing file conversion pipeline
- Implement basic error handling and logging
Phase 3: SharePoint Lists Support (Week 5)
- Add SharePoint list processing
- Extract list item content and metadata
- Handle rich text fields and attachments
- Support custom content types
Phase 4: Advanced Features (Week 6)
- Implement incremental updates and change detection
- Add support for SharePoint pages and wiki content
- Enhance metadata extraction with custom columns
- Add comprehensive filtering options
Phase 5: Testing and Documentation (Week 7)
- Add comprehensive unit and integration tests
- Create documentation and configuration examples
- Add error handling for common SharePoint scenarios
- Performance optimization and testing
🎯 Benefits
For Organizations
- Unified Knowledge Base: Include SharePoint content in AI-powered development workflows
- Comprehensive Search: Search across SharePoint and other sources simultaneously
- Existing Investment: Leverage existing SharePoint content without migration
- Enterprise Integration: Native support for enterprise authentication and security
For Developers
- Contextual AI Assistance: Access SharePoint documentation through Cursor, Windsurf, etc.
- Cross-Platform Search: Find information across Git, Confluence, JIRA, and SharePoint
- Rich Metadata: Leverage SharePoint's rich metadata for better search results
- File Format Support: Process Office documents, PDFs, and other SharePoint files
🔍 Use Cases
- Enterprise Documentation: Access company policies, procedures, and guidelines
- Project Knowledge: Include project documentation stored in SharePoint
- Training Materials: Process training content and knowledge base articles
- Cross-Team Collaboration: Search across team sites and document libraries
- Compliance Documentation: Include regulatory and compliance documents
- Meeting Notes: Process meeting minutes and decision records
📊 Success Criteria
- Successfully connect to SharePoint Online and On-Premises environments
- Support all major authentication methods (App-Only, username/password, certificate)
- Process document libraries with file conversion integration
- Extract and process SharePoint list content
- Maintain rich metadata from SharePoint (author, dates, custom columns)
- Implement efficient change detection and incremental updates
- Handle file attachments with parent-child relationships
- Achieve processing time < 30 seconds for typical documents
- Test coverage > 90% for new components
- Zero breaking changes to existing functionality
🔒 Security Considerations
- Authentication: Support enterprise-grade authentication methods
- Permissions: Respect SharePoint permissions and access controls
- Data Privacy: Handle sensitive content according to organizational policies
- Secure Storage: Store credentials securely using environment variables
- Audit Trail: Log access and processing activities for compliance
📚 Configuration Examples
Basic SharePoint Online Setup
sources:
sharepoint:
main-site:
base_url: "https://company.sharepoint.com/sites/main"
source: "main-site"
source_type: "sharepoint"
authentication_method: "client_credentials"
client_id: "${SHAREPOINT_CLIENT_ID}"
client_secret: "${SHAREPOINT_CLIENT_SECRET}"
document_libraries: ["Documents"]
enable_file_conversion: trueAdvanced Multi-Library Configuration
sources:
sharepoint:
knowledge-base:
base_url: "https://company.sharepoint.com/sites/kb"
source: "knowledge-base"
source_type: "sharepoint"
authentication_method: "client_credentials"
client_id: "${SHAREPOINT_CLIENT_ID}"
client_secret: "${SHAREPOINT_CLIENT_SECRET}"
document_libraries:
- "Technical Documentation"
- "Policies and Procedures"
- "Training Materials"
sharepoint_lists:
- "FAQ"
- "Best Practices"
- "Announcements"
enable_file_conversion: true
download_attachments: true
custom_columns: ["Department", "Category", "Priority"]
exclude_paths:
- "Archive/"
- "Templates/"🏷️ Related Issues
This enhancement complements existing features:
- File conversion support (Add File Conversion Support for PDF, Office Documents, and Other Formats #16) - for processing SharePoint documents
- Multiple projects support (Support Multiple Projects in Ingestion and MCP Server #20) - for organizing SharePoint content by project
📖 References
- Office365-REST-Python-Client Documentation
- SharePoint REST API Reference
- Microsoft Graph SharePoint API
- Azure AD App Registration Guide
Labels: enhancement, feature-request, connector, sharepoint, enterprise
Priority: High - addresses common enterprise requirement
Effort: Medium - leverages existing architecture and proven libraries