Local Processing · Privacy-First · Open Source

AI-Powered
Research Paper
Downloader

An intelligent, fully local system that discovers, downloads, and filters research papers from open-access repositories using Ollama Qwen3:8B for AI-powered relevance analysis — no data leaves your machine.

Ollama Qwen3:8B arXiv · DOAJ · PubMed Local Inference Privacy-First Open Source

View Pipeline Tech Stack

airpd · session

$ python airpd.py

# query: "transformer attention mechanisms"

→ Searching arXiv API...

→ Searching DOAJ API...

→ Searching PubMed Central...

→ Searching PLOS ONE...

Found 48 candidate papers

→ Downloading PDFs...

→ Extracting text (5 pages/doc)...

→ Ollama Qwen3:8B relevance check...

✓ RELEVANT attention_is_all_you_need.pdf

✓ RELEVANT flash_attention_v2.pdf

✗ FILTERED unrelated_paper_23.pdf

→ Organizing by query folder...

✓ Complete 31 relevant / 48 total

Academic Sources

5pg

Pages Extracted / PDF

5000ch

LLM Context Limit

0kb

Data Sent to Cloud

Processing Flow

5-Stage Pipeline

A deterministic automated workflow from research query to filtered, organised, relevance-scored paper library — running entirely on local infrastructure.

Query → Discovery → Download → AI Relevance Check → Organised Output

01 💬

Query Input

User submits a research topic or keyword query to initiate paper discovery across all sources.

02 🔍

Multi-Source Search

Simultaneous API queries to arXiv, DOAJ, PubMed Central, and PLOS ONE for comprehensive discovery.

03 📥

Download & Extract

PDF retrieval via Requests, text extraction with PyPDF2 — 5 pages and 5000 characters per document.

04 🧠

AI Relevance Check

Ollama Qwen3:8B analyses each paper's extracted text and scores it against the original query intent.

05 📂

Filter & Save

Relevant papers saved to query-named folders; rejected papers logged with reasons for transparency.

Infrastructure

Technology Stack

Chosen for local-first, privacy-preserving operation. Every component runs on-device — no API keys, no cloud dependency, no data leaving the machine.

Core Language & Libraries

Python Runtime

Primary language driving the full pipeline, from API calls to PDF processing and file organisation.

Python Core runtime and orchestration
Requests HTTP library for API calls and PDF retrieval with error handling
BeautifulSoup HTML/XML parsing for structured API response extraction
PyPDF2 PDF text extraction with efficient page-limit processing

AI & Local Inference

Ollama Qwen3:8B

Advanced language model for relevance classification and content analysis — deployed locally with no external inference calls.

Qwen3:8B Scientific knowledge base and advanced reasoning for document analysis
Ollama Runtime Local model serving with zero cloud dependency
Context Window 5000-character limit per document for efficient processing
Page Sampling 5-page extraction for representative content coverage

Academic Source APIs

Research Repositories

Four open-access academic APIs providing comprehensive, legally accessible paper discovery across scientific domains.

arXiv API Multi-disciplinary scientific preprints and papers
DOAJ API Directory of Open Access Journals
PubMed Central Biomedical and life sciences literature
PLOS ONE Peer-reviewed open-access publications

Operational Parameters

Performance Specifications

Calibrated defaults for balanced throughput, LLM context quality, and API rate-limit compliance.

Pages per PDF

Extracted per document for representative content coverage

5000ch

Character Limit

Maximum text passed to Qwen3:8B for relevance analysis

Paper Delay

Processing pause between consecutive papers

Query Delay

Pause between query batches for API rate-limit compliance

Parallel Sources

Simultaneous searches across academic repositories

Cloud API Calls

All inference runs locally via Ollama — zero external calls

Capabilities

Key Features

Six capabilities that distinguish aiRPD from generic paper scrapers and cloud-dependent research tools.

🧠

AI-Powered Filtering

Qwen3:8B intelligently filters papers based on semantic relevance to your query — not just keyword matching — saving time and storage.

🌐

Multi-Source Discovery

Searches arXiv, DOAJ, PubMed Central, and PLOS ONE simultaneously for the broadest possible open-access coverage.

🔒

Privacy-First Design

All processing runs locally. No queries, document text, or metadata are sent to external servers at any stage of the pipeline.

⚙️

Configurable Parameters

Adjustable page extraction limits, character context windows, and processing delays to match your hardware and throughput requirements.

📊

Detailed Rejection Logging

Every filtered paper is logged with the AI-generated reason for rejection, providing a full audit trail for transparency and review.

📂

Organised Storage

Relevant papers are automatically sorted into query-named folders with clear naming conventions for easy retrieval and citation.

Data Sources

Academic Repositories

Four open-access APIs covering preprints, peer-reviewed journals, biomedical literature, and multidisciplinary publications.

Preprints & CS / Physics / Math

arXiv

Comprehensive repository of scientific preprints across computer science, physics, mathematics, and engineering disciplines.

Open Access Journals

DOAJ

Directory of Open Access Journals — diverse peer-reviewed academic publications across all subject areas.

Biomedical & Life Sciences

PubMed

Free full-text archive of biomedical and life sciences journal literature maintained by the NIH National Library of Medicine.

Multidisciplinary Peer Review

PLOS ONE

Rigorously peer-reviewed open-access scientific publication covering all disciplines with transparent methodology standards.

AI-PoweredResearch PaperDownloader