Repository avatar
Monitoring
v1.0.2
active

parseflow

io.github.Libres-coder/parseflow

PDF parsing server with text extraction, metadata, search, images, and TOC via MCP

Documentation

๐Ÿ“„ ParseFlow

Universal document parsing library for PDF, Word, and Excel files

npm version MCP Server License: MIT

ParseFlow is a comprehensive document parsing solution that supports PDF, Word (docx), and Excel (xlsx/xls) files. It provides both a standalone library and an MCP (Model Context Protocol) server for AI assistants.

ไธญๆ–‡ๆ–‡ๆกฃ | Examples | GitHub


โœจ Features

๐Ÿ“„ PDF Support

  • โœ… Text extraction with multiple strategies (raw, formatted, clean)
  • โœ… Page-specific and range-based extraction
  • โœ… Metadata retrieval (title, author, dates, page count)
  • โœ… Full-text search with context
  • โœ… Image extraction (placeholder)
  • โœ… Table of contents (TOC) extraction (placeholder)

๐Ÿ“ Word (docx) Support

  • โœ… Text extraction
  • โœ… HTML conversion
  • โœ… Metadata retrieval
  • โœ… Text search with context

๐Ÿ“Š Excel (xlsx/xls) Support

  • โœ… Multi-sheet data extraction
  • โœ… Multiple output formats (JSON, CSV, Text)
  • โœ… Sheet-specific extraction
  • โœ… Cell-based search
  • โœ… Range extraction
  • โœ… Workbook metadata

๐Ÿค– MCP Server

  • โœ… 9 tools for AI assistants (5 PDF + 2 Word + 2 Excel)
  • โœ… Works with Claude Desktop and other MCP clients
  • โœ… Path security with allowlist support

๐Ÿ“ฆ Installation

Core Library

npm install parseflow-core

MCP Server (Global)

npm install -g parseflow-mcp-server

Or use with npx:

npx parseflow-mcp-server

๐Ÿš€ Quick Start

PDF Parsing

import { PDFParser } from 'parseflow-core';

const parser = new PDFParser();

// Extract all text
const text = await parser.extractText('document.pdf');

// Extract specific page
const page5 = await parser.extractPage('document.pdf', 5);

// Search
const results = await parser.search('document.pdf', 'keyword');

// Get metadata
const metadata = await parser.getMetadata('document.pdf');

Word Parsing

import { WordParser } from 'parseflow-core';

const parser = new WordParser();

// Extract text
const result = await parser.extractText('report.docx');
console.log(result.text);

// Convert to HTML
const html = await parser.extractHTML('report.docx');

// Search
const matches = await parser.searchText('report.docx', 'budget');

Excel Parsing

import { ExcelParser } from 'parseflow-core';

const parser = new ExcelParser();

// Extract all sheets (JSON format)
const data = await parser.extractData('spreadsheet.xlsx');

// Extract specific sheet
const sales = await parser.extractData('data.xlsx', {
  sheetName: 'Q4 Sales',
  format: 'json'
});

// Search in cells
const results = await parser.searchText('data.xlsx', 'revenue');

๐Ÿ› ๏ธ MCP Server Usage

Configuration for Claude Desktop

Add to claude_desktop_config.json:

{
  "mcpServers": {
    "parseflow": {
      "command": "npx",
      "args": ["-y", "parseflow-mcp-server"],
      "env": {
        "PARSEFLOW_ALLOWED_PATHS": "C:\\Documents;D:\\Projects"
      }
    }
  }
}

Available Tools

PDF Tools

  • extract_text - Extract text from PDF files
  • search_pdf - Search for keywords in PDF
  • get_metadata - Get PDF metadata
  • extract_images - Extract images from PDF
  • get_toc - Get table of contents

Word Tools

  • extract_word - Extract text/HTML from Word documents
  • search_word - Search in Word documents

Excel Tools

  • extract_excel - Extract data from Excel spreadsheets
  • search_excel - Search in Excel cells

Example Usage in Claude

"่ฏท่ฏปๅ– report.docx ๆ–‡ไปถ็š„ๅ†…ๅฎน"
โ†’ Uses extract_word tool

"ๅœจ sales.xlsx ไธญๆŸฅๆ‰พ 'ไบงๅ“A'"
โ†’ Uses search_excel tool

"ๆๅ– document.pdf ็š„ๅ…ƒๆ•ฐๆฎ"
โ†’ Uses get_metadata tool

๐Ÿ“š Documentation


๐Ÿ—๏ธ Project Structure

ParseFlow/
โ”œโ”€โ”€ packages/
โ”‚   โ”œโ”€โ”€ pdf-parser-core/      # Core library (parseflow-core)
โ”‚   โ”‚   โ”œโ”€โ”€ src/
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ parser.ts     # PDF parser
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ WordParser.ts # Word parser
โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ ExcelParser.ts # Excel parser
โ”‚   โ”‚   โ””โ”€โ”€ package.json
โ”‚   โ””โ”€โ”€ mcp-server/           # MCP server (parseflow-mcp-server)
โ”‚       โ”œโ”€โ”€ src/
โ”‚       โ”‚   โ”œโ”€โ”€ index.ts      # Server entry
โ”‚       โ”‚   โ””โ”€โ”€ tools/        # MCP tools
โ”‚       โ””โ”€โ”€ package.json
โ”œโ”€โ”€ docs/                     # Documentation
โ”œโ”€โ”€ examples/                 # Usage examples
โ”œโ”€โ”€ tests/                    # Test files
โ””โ”€โ”€ scripts/                  # Build scripts

๐Ÿงช Testing

# Run all tests
pnpm test

# Test coverage
pnpm test:coverage

# Run specific test
pnpm test parser.test.ts

Test Files

  • Wordๆต‹่ฏ•ๆ–‡ไปถ.docx - Word test document
  • Excelๆต‹่ฏ•ๆ–‡ไปถ.xlsx - Excel test workbook (3 sheets)
  • PDFๆต‹่ฏ•ๆ–‡ๆกฃ.pdf - PDF test document

๐Ÿ”ง Development

# Install dependencies
pnpm install

# Build all packages
pnpm build

# Watch mode
pnpm dev

# Lint
pnpm lint

# Type check
pnpm type-check

๐Ÿ“ˆ Roadmap

v1.1.0 (Current)

  • โœ… Word (docx) support
  • โœ… Excel (xlsx/xls) support
  • โœ… 9 MCP tools

v1.2.0 (Planned)

  • Encrypted PDF support
  • OCR text recognition
  • PowerPoint (pptx) support
  • Batch processing optimization

v2.0.0 (Future)

  • Plugin system
  • More document formats (CSV, TXT, RTF)
  • Advanced table extraction
  • Document conversion

๐Ÿค Contributing

We welcome contributions! Please see CONTRIBUTING.md for details.

Ways to Contribute

  • ๐Ÿ› Report bugs
  • ๐Ÿ’ก Suggest features
  • ๐Ÿ“ Improve documentation
  • ๐Ÿ”ง Submit pull requests

๐Ÿ“ฆ Packages

PackageVersionDescription
parseflow-core1.0.1Core parsing library
parseflow-mcp-server1.0.2MCP server for AI

๐Ÿ”— Links


๐Ÿ“„ License

MIT License - see LICENSE file for details.


๐Ÿ™ Acknowledgments

  • pdf-parse - PDF parsing
  • pdf-lib - PDF manipulation
  • mammoth - Word document parsing
  • xlsx - Excel spreadsheet parsing
  • MCP SDK - Model Context Protocol

๐Ÿ“Š Stats

  • Test Coverage: 83%+
  • Supported Formats: 3 (PDF, Word, Excel)
  • MCP Tools: 9
  • Dependencies: Minimal and well-maintained

๐Ÿ’ฌ Community


Made with โค๏ธ by Libres-coder

Status: ๐ŸŽ‰ Production Ready (v1.1.0)