Local Processing · Privacy-First · Open Source

AI-Powered
Research Paper
Downloader

An intelligent, fully local system that discovers, downloads, and filters research papers from open-access repositories using Ollama Qwen3:8B for AI-powered relevance analysis — no data leaves your machine.

Ollama Qwen3:8B arXiv · DOAJ · PubMed Local Inference Privacy-First Open Source
airpd · session
$ python airpd.py
# query: "transformer attention mechanisms"
→ Searching arXiv API...
→ Searching DOAJ API...
→ Searching PubMed Central...
→ Searching PLOS ONE...
Found 48 candidate papers
→ Downloading PDFs...
→ Extracting text (5 pages/doc)...
→ Ollama Qwen3:8B relevance check...
✓ RELEVANT attention_is_all_you_need.pdf
✓ RELEVANT flash_attention_v2.pdf
✗ FILTERED unrelated_paper_23.pdf
→ Organizing by query folder...
✓ Complete 31 relevant / 48 total
$
4
Academic Sources
5pg
Pages Extracted / PDF
5000ch
LLM Context Limit
0kb
Data Sent to Cloud
Processing Flow

5-Stage Pipeline

A deterministic automated workflow from research query to filtered, organised, relevance-scored paper library — running entirely on local infrastructure.

Query → Discovery → Download → AI Relevance Check → Organised Output
01 💬

Query Input

User submits a research topic or keyword query to initiate paper discovery across all sources.

02 🔍

Multi-Source Search

Simultaneous API queries to arXiv, DOAJ, PubMed Central, and PLOS ONE for comprehensive discovery.

03 📥

Download & Extract

PDF retrieval via Requests, text extraction with PyPDF2 — 5 pages and 5000 characters per document.

04 🧠

AI Relevance Check

Ollama Qwen3:8B analyses each paper's extracted text and scores it against the original query intent.

05 📂

Filter & Save

Relevant papers saved to query-named folders; rejected papers logged with reasons for transparency.

Infrastructure

Technology Stack

Chosen for local-first, privacy-preserving operation. Every component runs on-device — no API keys, no cloud dependency, no data leaving the machine.

Core Language & Libraries

Python Runtime

Primary language driving the full pipeline, from API calls to PDF processing and file organisation.

  • Python Core runtime and orchestration
  • Requests HTTP library for API calls and PDF retrieval with error handling
  • BeautifulSoup HTML/XML parsing for structured API response extraction
  • PyPDF2 PDF text extraction with efficient page-limit processing
AI & Local Inference

Ollama Qwen3:8B

Advanced language model for relevance classification and content analysis — deployed locally with no external inference calls.

  • Qwen3:8B Scientific knowledge base and advanced reasoning for document analysis
  • Ollama Runtime Local model serving with zero cloud dependency
  • Context Window 5000-character limit per document for efficient processing
  • Page Sampling 5-page extraction for representative content coverage
Academic Source APIs

Research Repositories

Four open-access academic APIs providing comprehensive, legally accessible paper discovery across scientific domains.

  • arXiv API Multi-disciplinary scientific preprints and papers
  • DOAJ API Directory of Open Access Journals
  • PubMed Central Biomedical and life sciences literature
  • PLOS ONE Peer-reviewed open-access publications
Operational Parameters

Performance Specifications

Calibrated defaults for balanced throughput, LLM context quality, and API rate-limit compliance.

5
Pages per PDF
Extracted per document for representative content coverage
5000ch
Character Limit
Maximum text passed to Qwen3:8B for relevance analysis
2s
Paper Delay
Processing pause between consecutive papers
5s
Query Delay
Pause between query batches for API rate-limit compliance
4
Parallel Sources
Simultaneous searches across academic repositories
0
Cloud API Calls
All inference runs locally via Ollama — zero external calls
Capabilities

Key Features

Six capabilities that distinguish aiRPD from generic paper scrapers and cloud-dependent research tools.

🧠

AI-Powered Filtering

Qwen3:8B intelligently filters papers based on semantic relevance to your query — not just keyword matching — saving time and storage.

🌐

Multi-Source Discovery

Searches arXiv, DOAJ, PubMed Central, and PLOS ONE simultaneously for the broadest possible open-access coverage.

🔒

Privacy-First Design

All processing runs locally. No queries, document text, or metadata are sent to external servers at any stage of the pipeline.

⚙️

Configurable Parameters

Adjustable page extraction limits, character context windows, and processing delays to match your hardware and throughput requirements.

📊

Detailed Rejection Logging

Every filtered paper is logged with the AI-generated reason for rejection, providing a full audit trail for transparency and review.

📂

Organised Storage

Relevant papers are automatically sorted into query-named folders with clear naming conventions for easy retrieval and citation.

Data Sources

Academic Repositories

Four open-access APIs covering preprints, peer-reviewed journals, biomedical literature, and multidisciplinary publications.

Preprints & CS / Physics / Math
arXiv
Comprehensive repository of scientific preprints across computer science, physics, mathematics, and engineering disciplines.
Open Access Journals
DOAJ
Directory of Open Access Journals — diverse peer-reviewed academic publications across all subject areas.
Biomedical & Life Sciences
PubMed
Free full-text archive of biomedical and life sciences journal literature maintained by the NIH National Library of Medicine.
Multidisciplinary Peer Review
PLOS ONE
Rigorously peer-reviewed open-access scientific publication covering all disciplines with transparent methodology standards.