Skip to content

Add SharePoint Connector for Microsoft SharePoint Online and On-Premises #22

@martin-papy

Description

@martin-papy

Enhancement: SharePoint Connector

🎯 Overview

Add support for ingesting documents and content from Microsoft SharePoint Online and SharePoint On-Premises environments, enabling organizations to include their SharePoint-based knowledge repositories in their AI-powered development workflows.

📋 Problem Statement

Many organizations store critical documentation, policies, procedures, and knowledge assets in Microsoft SharePoint. Currently, qdrant-loader cannot access this content, creating gaps in organizational knowledge bases and limiting the effectiveness of AI-powered development tools that rely on comprehensive documentation.

Common SharePoint content includes:

  • Technical documentation and wikis
  • Project documentation and specifications
  • Policies and procedures
  • Training materials and guides
  • Meeting notes and decisions
  • File attachments (PDFs, Office documents, etc.)

🚀 Proposed Solution

Implement a SharePoint connector that integrates with both SharePoint Online (Office 365) and SharePoint On-Premises environments using the Office365-REST-Python-Client library.

Key Features

  • Multi-Environment Support: SharePoint Online, SharePoint 2013+, and On-Premises
  • Multiple Authentication Methods: App-Only, username/password, certificate-based, interactive
  • Document Libraries: Process files from SharePoint document libraries
  • SharePoint Lists: Extract content from custom lists and built-in lists
  • Rich Metadata: Capture SharePoint-specific metadata (author, created/modified dates, custom columns)
  • File Conversion Integration: Leverage existing file conversion for Office documents, PDFs, etc.
  • Incremental Updates: Change detection for efficient synchronization
  • Attachment Processing: Handle file attachments with parent-child relationships

🏗️ Technical Implementation

Configuration Structure

sources:
  sharepoint:
    company-intranet:
      base_url: "https://company.sharepoint.com/sites/intranet"
      source: "company-intranet"
      source_type: "sharepoint"
      
      # Authentication
      authentication_method: "client_credentials"  # or "username_password", "certificate", "interactive"
      client_id: "${SHAREPOINT_CLIENT_ID}"
      client_secret: "${SHAREPOINT_CLIENT_SECRET}"
      tenant_id: "${SHAREPOINT_TENANT_ID}"
      
      # Content Selection
      document_libraries:
        - "Documents"
        - "Shared Documents"
        - "Policies"
      
      sharepoint_lists:
        - "Announcements"
        - "Project Updates"
        - "Knowledge Base"
      
      # File Processing
      enable_file_conversion: true
      download_attachments: true
      file_extensions: [".docx", ".pdf", ".xlsx", ".pptx", ".txt", ".md"]
      max_file_size: 52428800  # 50MB
      
      # Filtering
      exclude_paths:
        - "Forms/"
        - "_catalogs/"
      include_content_types:
        - "Document"
        - "Page"
        - "List Item"
      
      # Metadata
      custom_columns:
        - "Department"
        - "Category"
        - "Tags"

Authentication Methods

1. App-Only (Client Credentials) - Recommended for Production

# Azure AD App registration required
client_credentials = ClientCredential(client_id, client_secret)
ctx = ClientContext(site_url).with_credentials(client_credentials)

2. Username/Password - Development & Testing

user_credentials = UserCredential(username, password)
ctx = ClientContext(site_url).with_credentials(user_credentials)

3. Certificate-Based - Enterprise Security

# For high-security environments
ctx = ClientContext(site_url).with_certificate(cert_path, thumbprint, client_id)

4. Interactive - Development

# Browser-based authentication for development
ctx = ClientContext(site_url).with_interactive(tenant_id, client_id)

Content Processing Strategy

Document Libraries

  • Enumerate all files in configured document libraries
  • Extract file metadata (title, author, created/modified dates, custom properties)
  • Download and convert files using existing file conversion pipeline
  • Create parent-child relationships for files with attachments

SharePoint Lists

  • Process list items as structured documents
  • Extract list item fields as metadata
  • Handle rich text fields and attachments
  • Support custom content types

Pages and Wiki Content

  • Extract SharePoint pages and wiki content
  • Process web parts and embedded content
  • Maintain page hierarchy and navigation structure

📦 Dependencies

Primary Library

  • Office365-REST-Python-Client (>=2.6.0)
    • Actively maintained and comprehensive
    • Supports both SharePoint REST API and Microsoft Graph API
    • Multiple authentication methods
    • Python 3.12 compatible

Integration Points

  • Existing File Conversion: Leverage markitdown for Office documents, PDFs
  • Document Model: Extend existing Document model with SharePoint-specific metadata
  • State Management: Use existing change detection and state tracking
  • Configuration System: Follow existing Pydantic-based configuration pattern

🔧 Implementation Plan

Phase 1: Core Infrastructure (Week 1-2)

  • Create SharePoint connector package structure
  • Implement SharePointConfig with authentication options
  • Set up basic SharePoint connection and authentication
  • Add Office365-REST-Python-Client dependency

Phase 2: Document Library Support (Week 3-4)

  • Implement document library enumeration
  • Add file download and metadata extraction
  • Integrate with existing file conversion pipeline
  • Implement basic error handling and logging

Phase 3: SharePoint Lists Support (Week 5)

  • Add SharePoint list processing
  • Extract list item content and metadata
  • Handle rich text fields and attachments
  • Support custom content types

Phase 4: Advanced Features (Week 6)

  • Implement incremental updates and change detection
  • Add support for SharePoint pages and wiki content
  • Enhance metadata extraction with custom columns
  • Add comprehensive filtering options

Phase 5: Testing and Documentation (Week 7)

  • Add comprehensive unit and integration tests
  • Create documentation and configuration examples
  • Add error handling for common SharePoint scenarios
  • Performance optimization and testing

🎯 Benefits

For Organizations

  • Unified Knowledge Base: Include SharePoint content in AI-powered development workflows
  • Comprehensive Search: Search across SharePoint and other sources simultaneously
  • Existing Investment: Leverage existing SharePoint content without migration
  • Enterprise Integration: Native support for enterprise authentication and security

For Developers

  • Contextual AI Assistance: Access SharePoint documentation through Cursor, Windsurf, etc.
  • Cross-Platform Search: Find information across Git, Confluence, JIRA, and SharePoint
  • Rich Metadata: Leverage SharePoint's rich metadata for better search results
  • File Format Support: Process Office documents, PDFs, and other SharePoint files

🔍 Use Cases

  1. Enterprise Documentation: Access company policies, procedures, and guidelines
  2. Project Knowledge: Include project documentation stored in SharePoint
  3. Training Materials: Process training content and knowledge base articles
  4. Cross-Team Collaboration: Search across team sites and document libraries
  5. Compliance Documentation: Include regulatory and compliance documents
  6. Meeting Notes: Process meeting minutes and decision records

📊 Success Criteria

  • Successfully connect to SharePoint Online and On-Premises environments
  • Support all major authentication methods (App-Only, username/password, certificate)
  • Process document libraries with file conversion integration
  • Extract and process SharePoint list content
  • Maintain rich metadata from SharePoint (author, dates, custom columns)
  • Implement efficient change detection and incremental updates
  • Handle file attachments with parent-child relationships
  • Achieve processing time < 30 seconds for typical documents
  • Test coverage > 90% for new components
  • Zero breaking changes to existing functionality

🔒 Security Considerations

  • Authentication: Support enterprise-grade authentication methods
  • Permissions: Respect SharePoint permissions and access controls
  • Data Privacy: Handle sensitive content according to organizational policies
  • Secure Storage: Store credentials securely using environment variables
  • Audit Trail: Log access and processing activities for compliance

📚 Configuration Examples

Basic SharePoint Online Setup

sources:
  sharepoint:
    main-site:
      base_url: "https://company.sharepoint.com/sites/main"
      source: "main-site"
      source_type: "sharepoint"
      authentication_method: "client_credentials"
      client_id: "${SHAREPOINT_CLIENT_ID}"
      client_secret: "${SHAREPOINT_CLIENT_SECRET}"
      document_libraries: ["Documents"]
      enable_file_conversion: true

Advanced Multi-Library Configuration

sources:
  sharepoint:
    knowledge-base:
      base_url: "https://company.sharepoint.com/sites/kb"
      source: "knowledge-base"
      source_type: "sharepoint"
      authentication_method: "client_credentials"
      client_id: "${SHAREPOINT_CLIENT_ID}"
      client_secret: "${SHAREPOINT_CLIENT_SECRET}"
      
      document_libraries:
        - "Technical Documentation"
        - "Policies and Procedures"
        - "Training Materials"
      
      sharepoint_lists:
        - "FAQ"
        - "Best Practices"
        - "Announcements"
      
      enable_file_conversion: true
      download_attachments: true
      custom_columns: ["Department", "Category", "Priority"]
      
      exclude_paths:
        - "Archive/"
        - "Templates/"

🏷️ Related Issues

This enhancement complements existing features:

📖 References


Labels: enhancement, feature-request, connector, sharepoint, enterprise
Priority: High - addresses common enterprise requirement
Effort: Medium - leverages existing architecture and proven libraries

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions