Skip to content

aybanda/Unsiloed

Repository files navigation

Logo

πŸ“„ Unsiloed AI Document Data extractor

A super simple way to extract text from documents for for intelligent document processing, extraction, and chunking with multi-threaded processing capabilities.

πŸš€ Features

πŸ“Š Document Chunking

  • Supported File Types: PDF, DOCX, PPTX
  • Chunking Strategies:
    • Fixed Size: Splits text into chunks of specified size with optional overlap
    • Page-based: Splits PDF by pages (PDF only, falls back to paragraph for other file types)
    • Semantic: Uses Multi-Modal Model to identify meaningful semantic chunks
    • Paragraph: Splits text by paragraphs
    • Heading: Splits text by identified headings

πŸ”§ Technical Details

🧠 OpenAI Integration

  • Uses OpenAI GPT-4o for semantic chunking
  • Handles authentication via API key from environment variables
  • Implements automatic retries and timeout handling
  • Provides structured JSON output for semantic chunks

πŸ”„ Parallel Processing

  • Multi-threaded processing for improved performance
  • Parallel page extraction from PDFs
  • Distributes processing of large documents across multiple threads

πŸ“ Document Processing

  • Extracts text from PDF, DOCX, and PPTX files
  • Handles image encoding for vision-based models
  • Generates extraction prompts for structured data extraction

βš™οΈ Configuration

Environmental Variables

  • OPENAI_API_KEY: Your OpenAI API key

πŸ›‘ Constraints & Limitations

File Handling

  • Temporary files are created during processing and deleted afterward
  • Files are processed in-memory where possible

Text Processing

  • Long text (>25,000 characters) is automatically split and processed in parallel for semantic chunking
  • Maximum token limit of 4000 for OpenAI responses

API Constraints

  • Request timeout set to 60 seconds
  • Maximum of 3 retries for OpenAI API calls

πŸ“‹ Request Parameters

Document Chunking Endpoint

  • document_file: The document file to process (PDF, DOCX, PPTX)
  • strategy: Chunking strategy to use (default: "semantic")
    • Options: "fixed", "page", "semantic", "paragraph", "heading"
  • chunk_size: Size of chunks for fixed strategy in characters (default: 1000)
  • overlap: Overlap size for fixed strategy in characters (default: 100)

πŸ“¦ Installation

Using pip

pip install unsiloed

Requirements

Unsiloed requires Python 3.8 or higher and has the following dependencies:

  • openai
  • PyPDF2
  • python-docx
  • python-pptx
  • fastapi
  • python-multipart

πŸ”‘ Environment Setup

Before using Unsiloed, set up your OpenAI API key:

Using environment variables

# Linux/macOS
export OPENAI_API_KEY="your-api-key-here"

# Windows (Command Prompt)
set OPENAI_API_KEY=your-api-key-here

# Windows (PowerShell)
$env:OPENAI_API_KEY="your-api-key-here"

Using a .env file

Create a .env file in your project directory:

OPENAI_API_KEY=your-api-key-here

Then in your Python code:

from dotenv import load_dotenv
load_dotenv()  # This loads the variables from .env

πŸ” Usage Example

Python

import os
import Unsiloed

# Example 1: Semantic chunking (default)
result = Unsiloed.process_sync({
    "filePath": "./test.pdf",
    "credentials": {
        "apiKey": os.environ.get("OPENAI_API_KEY")
    },
    "strategy": "semantic",
    "chunkSize": 1000,
    "overlap": 100
})

# Print the result
print(result)

# Example 2: Fixed-size chunking
fixed_result = Unsiloed.process_sync({
    "filePath": "./test.pdf", #path to your file
    "credentials": {
        "apiKey": os.environ.get("OPENAI_API_KEY")
    },
    "strategy": "fixed",
    "chunkSize": 1500,
    "overlap": 150
})

# Example 3: Page-based chunking (PDF only)
page_result = Unsiloed.process_sync({
    "filePath": "./test.pdf",
    "credentials": {
        "apiKey": os.environ.get("OPENAI_API_KEY")
    },
    "strategy": "page"
})

# Example 4: Paragraph chunking
paragraph_result = Unsiloed.process_sync({
    "filePath": "./document.docx",
    "credentials": {
        "apiKey": os.environ.get("OPENAI_API_KEY")
    },
    "strategy": "paragraph"
})

# Example 5: Heading chunking
heading_result = Unsiloed.process_sync({
    "filePath": "./presentation.pptx",
    "credentials": {
        "apiKey": os.environ.get("OPENAI_API_KEY")
    },
    "strategy": "heading"
})

πŸ› οΈ Development Setup

Prerequisites

  • Python 3.8 or higher
  • pip (Python package installer)
  • git

Setting Up Local Development Environment

  1. Clone the repository:
git clone https://github.com/Unsiloed-opensource/Unsiloed.git
cd Unsiloed
  1. Create a virtual environment:
# Using venv
python -m venv venv

# Activate the virtual environment
# On Windows
venv\Scripts\activate
# On macOS/Linux
source venv/bin/activate
  1. Install dependencies:
pip install -r requirements.txt
  1. Set up your environment variables:
# Create a .env file
echo "OPENAI_API_KEY=your-api-key-here" > .env
  1. Run the FastAPI server locally:
uvicorn Unsiloed.app:app --reload
  1. Access the API documentation: Open your browser and go to http://localhost:8000/docs

πŸ‘¨β€πŸ’» Contributing

We welcome contributions to Unsiloed! Here's how you can help:

Setting Up Development Environment

  1. Fork the repository and clone your fork:
git clone https://github.com/YOUR_USERNAME/Unsiloed.git
cd Unsiloed
  1. Install development dependencies:
pip install -r requirements.txt

Making Changes

  1. Create a new branch for your feature:
git checkout -b feature/your-feature-name
  1. Make your changes and write tests if applicable

  2. Commit your changes:

git commit -m "Add your meaningful commit message here"
  1. Push to your fork:
git push origin feature/your-feature-name
  1. Create a Pull Request from your fork to the main repository

Code Style and Standards

  • We follow PEP 8 for Python code style
  • Use type hints where appropriate
  • Document functions and classes with docstrings
  • Write tests for new features

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

🌐 Community and Support

Join the Community

  • GitHub Discussions: For questions, ideas, and discussions
  • Issues: For bug reports and feature requests
  • Pull Requests: For contributing to the codebase

Staying Updated

  • Star the repository to show support
  • Watch for notification on new releases

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages