Skip to main content
POST
/
memory
/
documents
/
extract
curl -X POST https://api.60db.ai/memory/documents/extract \
  -H "Authorization: Bearer your-api-key" \
  -F "[email protected]" \
  -F "collection=company_handbook" \
  -F "type=knowledge" \
  -F "title=Q4 2026 Report"
{
  "success": true,
  "data": {
    "collection_id": "company_handbook",
    "collection_label": "Company Handbook",
    "filename": "quarterly-report.pdf",
    "chunks": 18,
    "characters": 24680,
    "memory_type": "knowledge",
    "total_queued": 18,
    "results": [
      { "id": "mem_01HV8K...", "status": "pending", "message": "Queued for processing" },
      { "id": "mem_01HV8L...", "status": "pending", "message": "Queued for processing" }
    ],
    "metadata": {
      "source": "document_upload",
      "filename": "quarterly-report.pdf",
      "mime_type": "application/pdf",
      "page_count": 24,
      "detected_languages": ["eng"],
      "total_chunks": 18
    }
  }
}
Upload a document to have it extracted, chunked, and ingested into a memory collection in a single request. 60db’s document extraction engine handles 91+ file formats and includes built-in OCR for scanned PDFs and images, so you can send the raw file and let the server do the rest. The browser just posts the file and 60db handles format detection, OCR, chunking, and ingestion.
Supported formats (partial list — 91 total):
  • Documents: PDF, DOCX, DOC, ODT, RTF, TXT, MD, HTML, EPUB
  • Spreadsheets: XLSX, XLS, CSV, ODS
  • Presentations: PPTX, PPT, ODP
  • Email: EML, MSG, PST, MBOX
  • Images (OCR): PNG, JPG, JPEG, TIFF, BMP, GIF
  • Code & structured: JSON, XML, YAML, LaTeX, Markdown variants
  • Archives: ZIP, TAR, GZIP, 7Z (extracted recursively)

Request

Headers

Authorization
string
required
Bearer token with your API key
Content-Type
string
multipart/form-data

Body (multipart/form-data)

file
file
required
The document to extract. Max 200 MB per file.
collection
string
Collection ID to store the extracted chunks in. Defaults to the caller’s personal collection.
type
string
default:"knowledge"
Memory type for the ingested chunks. One of: user, knowledge, hive. For document uploads, knowledge is almost always the right choice.
title
string
Display title for the document. Defaults to the uploaded filename. When the document produces multiple chunks, each chunk is labeled "{title} (part N/M)".
chunk_size
integer
default:"1500"
Maximum characters per chunk. Larger chunks preserve more context but are less precise for recall. Minimum 200, maximum 8000.
chunk_overlap
integer
default:"200"
Characters of overlap between adjacent chunks. Helps preserve sentences that span chunk boundaries. Must be less than chunk_size.

Response

success
boolean
true on success.
data
object

Examples

curl -X POST https://api.60db.ai/memory/documents/extract \
  -H "Authorization: Bearer your-api-key" \
  -F "[email protected]" \
  -F "collection=company_handbook" \
  -F "type=knowledge" \
  -F "title=Q4 2026 Report"
{
  "success": true,
  "data": {
    "collection_id": "company_handbook",
    "collection_label": "Company Handbook",
    "filename": "quarterly-report.pdf",
    "chunks": 18,
    "characters": 24680,
    "memory_type": "knowledge",
    "total_queued": 18,
    "results": [
      { "id": "mem_01HV8K...", "status": "pending", "message": "Queued for processing" },
      { "id": "mem_01HV8L...", "status": "pending", "message": "Queued for processing" }
    ],
    "metadata": {
      "source": "document_upload",
      "filename": "quarterly-report.pdf",
      "mime_type": "application/pdf",
      "page_count": 24,
      "detected_languages": ["eng"],
      "total_chunks": 18
    }
  }
}

Pipeline

When you POST a file, 60db runs it through this pipeline:
  1. Validate — file present, type allowed, under 200 MB, collection accessible.
  2. Extract — the document extraction engine detects the format (PDF, DOCX, image, etc.) and returns plain text plus metadata (mime_type, page_count, tables, quality_score). OCR is applied automatically for scanned PDFs and images.
  3. Chunk — split the extracted text into overlapping segments of chunk_size characters with chunk_overlap character overlap.
  4. Register collection — ensure the target collection is ready (idempotent, cached).
  5. Ingest — stream all chunks into the memory layer in a single batch.
  6. Return — the response includes one {id, status, message} entry per chunk. Processing continues asynchronously.
For very large documents, prefer a higher chunk_size (e.g. 3000) to keep the per-chunk memory count low. The endpoint rejects uploads that produce more than 100 chunks with a TOO_MANY_CHUNKS error — split such files before upload or use a larger chunk size.

Tuning

Document typeRecommended chunk_sizechunk_overlapNotes
Technical docs / API refs1500200Default. Balances recall precision and context.
Long-form prose / books2500300Fewer chunks, more context per result.
FAQs / short snippets800100Higher precision — each Q&A becomes its own chunk.
Spreadsheet exports30000Tables should stay contiguous; overlap hurts.
Scanned PDFs (OCR)2000250OCR adds whitespace noise; slightly longer chunks help.

Billing

Document upload is two-stage billing — you pay for the extraction and for the resulting ingest.
StageRateWhen chargedRefund on failure
Extract fee$0.003 per MBBefore extraction runsYes, auto
Ingest fee$0.0001 per 1,000 extracted charactersAfter extraction, before ingestYes, auto
The two charges are separate rows in transaction_log (MEMORY_EXTRACT and MEMORY_INGEST) so you can distinguish extraction cost from storage cost in your reporting. Example — uploading a 2 MB PDF that extracts to 50,000 characters of text:
Extract fee: 2 × $0.003         = $0.006
Ingest fee:  (50,000 / 1000) × $0.0001 = $0.005
Total:                            $0.011
Response headers on success:
HeaderMeaning
x-credit-balanceWallet balance after the extract fee was deducted (set before ingest runs)
x-credit-chargedJust the extract fee
x-credit-charged-totalExtract fee + ingest fee combined
x-billing-txUUID of the extract audit row (the ingest row is linked via metadata)
Special failure case — if extraction succeeds but your wallet can’t cover the post-extraction ingest charge, the extract fee is automatically refunded and the response is 402 INSUFFICIENT_CREDITS with details.extract_fee_refunded populated. You pay nothing for the failed attempt. See Pricing & Billing for the full policy.

Error responses

StatusCodeMeaning
400NO_FILENo file was attached to the request.
400INVALID_TYPEtype must be user, knowledge, or hive.
402INSUFFICIENT_CREDITSWallet cannot cover the extract fee (pre-charge) OR the post-extraction ingest fee (in which case details.extract_fee_refunded is populated).
403POLICY_DENYYour role (e.g. viewer) is not allowed to create memories.
404COLLECTION_NOT_FOUNDThe specified collection doesn’t exist or you can’t access it.
413TOO_MANY_CHUNKSDocument produced more than 100 chunks. Increase chunk_size or split the file. Full auto-refund.
422EMPTY_EXTRACTIONDocument contains no extractable text. Full auto-refund.
422EMPTY_CHUNKSChunking produced zero segments. Full auto-refund.
422EXTRACTION_FAILEDExtraction engine rejected the file. Full auto-refund.
503MEMORY_INFRA_NOT_READYThe workspace’s memory layer is still provisioning. Retry in ~10s. Full auto-refund.
503EXTRACTION_SERVICE_UNAVAILABLEDocument extraction is temporarily unavailable. Full auto-refund.
202MEMORY_QUEUEDMemory layer is temporarily unreachable; chunks were queued and will retry automatically. Both extract and ingest fees are refunded because the work won’t actually happen.

Checking ingestion status

The endpoint returns immediately once chunks are queued — full embedding/indexing happens asynchronously. Poll GET /memory/:id/status with any of the returned chunk IDs to check progress:
curl https://api.60db.com/memory/mem_01HV8K.../status \
  -H "Authorization: Bearer sk_abc123"
Statuses: pendingprocessingready (or failed).

Size limits

  • Per file: 200 MB
  • Chunks per document: 100 (use a larger chunk_size to fit bigger files)
  • Chunk text length: 100,000 characters
  • Supported languages for OCR: 100+ languages including English, Spanish, French, German, Chinese, Japanese, Arabic, Hindi, and more
  • Rate limit: Same as POST /memory/ingest/batch (30 uploads/min per workspace on default plans)