Specify a github or local repo, github pull request, arXiv or Sci-Hub paper, Youtube transcript or documentation URL on the web and scrape into a text file and clipboard for easier LLM ingestion
A CLI utility for aggregating and structuring multi-source data for LLM context.
A common task when using Large Language Models is providing them with sufficient context about a complex topic, such as a software project, a research paper, or technical documentation. This often involves manually gathering content from multiple sources like code files, web pages, and API responses, then copy-pasting them into a prompt.
OneFileLLM
is a command-line tool that automates this data aggregation process. It accepts multiple input sources of various types, determines how to process each one, fetches the content, and combines it all into a single, structured XML output. The result is then copied to the system clipboard, ready for use.
For example, instead of a multi-step manual process:
# Before: Manual gathering
cat ./src/main.py
cat ./src/utils.py
# (manually open browser to view GitHub issue, copy content)
# (manually combine and paste everything into LLM)
OneFileLLM
provides a direct, single-command alternative:
# After: Automated aggregation
python onefilellm.py ./src/ https://github.com/user/project/issues/123
# (manually paste into LLM)
git clone https://github.com/jimmc414/onefilellm.git
cd onefilellm
pip install -r requirements.txt
```3. (Recommended) Set the `GITHUB_TOKEN` environment variable to prevent API rate-limiting and to access private repositories.
```bash
export GITHUB_TOKEN="your_personal_access_token"
The tool is invoked by providing a list of input sources. It supports a variety of source types, which it detects automatically.
Synopsis:
python onefilellm.py [OPTIONS] [INPUT_SOURCES...]
Supported Input Sources:
path/to/file.py
, path/to/project/
https://github.com/user/repo
https://github.com/user/repo/issues/1
https://github.com/user/repo/pull/1
https://docs.example.com/
(for crawling)https://arxiv.org/abs/1706.03762
10.1038/s41586-021-03819-2
https://www.youtube.com/watch?v=...
stdin
: cat file.txt | python onefilellm.py -
python onefilellm.py --clipboard
The tool can process and aggregate any number of different sources in a single command, including a mix of live inputs and pre-defined aliases. It processes inputs concurrently and combines the structured results into one XML document.
Example: Combine a local directory, a specific GitHub issue, and a documentation page.
python onefilellm.py ./src/ https://github.com/jimmc414/onefilellm/issues/1 https://react.dev/
```This allows for the creation of rich, multi-faceted contexts for an LLM.
### Creating Workflow Aliases
The aliasing system allows you to save and re-use complex data aggregation commands. This is useful for recurring tasks, such as gathering the context for a specific project or technology stack.
**Creating an Alias:**
The `--alias-add` command defines a new alias. The alias name is the first argument, followed by a space-separated list of sources.
```bash
# Alias for a project's key components
python onefilellm.py --alias-add project-context src/ docs/README.md https://github.com/user/project/issues
# Alias for an entire SDK (repository + documentation)
python onefilellm.py --alias-add vercel-ai-sdk https://github.com/vercel/ai https://sdk.vercel.ai/docs
# Alias for a technical specification (Model Context Protocol)
python onefilellm.py --alias-add mcp-spec https://modelcontextprotocol.io/llms-full.txt https://github.com/modelcontextprotocol/python-sdk
Using an Alias:
You can then invoke an alias by its name.
python onefilellm.py project-context
Aliases with Placeholders:
Aliases can include a {}
placeholder for dynamic input. This is useful for creating searchable shortcuts.
# Create an alias to search a GitHub organization
python onefilellm.py --alias-add search-msft "https://github.com/microsoft/{}"
# Use the alias with a search term
python onefilellm.py search-msft "terminal"
Managing Aliases:
--alias-list
: View all defined aliases.--alias-remove <name>
: Remove a user-defined alias.Combining Multiple Aliases for Cross-Stack Analysis
Aggregate the context from several pre-defined aliases to analyze interactions between different technology stacks.
# Prerequisite: Create aliases for each stack
python onefilellm.py --alias-add k8s-stack https://github.com/kubernetes/kubernetes https://kubernetes.io/docs/
python onefilellm.py --alias-add istio-stack https://github.com/istio/istio https://istio.io/latest/docs/
# Combine aliases in a single command
python onefilellm.py k8s-stack istio-stack
Outcome: Provides an LLM with the source code and documentation for both Kubernetes and Istio, enabling questions about their integration, configuration, and comparative features.
In-depth and Respectful Documentation Crawling
Perform a deep crawl of a large documentation site while filtering noise and respecting the server’s load.
python onefilellm.py https://kubernetes.io/docs/concepts/ \
--crawl-max-depth 5 \
--crawl-restrict-path \
--crawl-delay 0.5 \
--crawl-exclude-pattern ".(jpg|png|svg)$"
Outcome: Gathers a comprehensive snapshot of the Kubernetes concepts documentation.
--crawl-restrict-path
: Ensures the crawler does not leave the /docs/concepts/
section.--crawl-delay 0.5
: Adds a 500ms delay between requests to avoid overloading the server.--crawl-exclude-pattern
: Prevents the crawler from downloading image files.Aggregating Industry Specifications and SDKs
Gather the complete context for a technical standard by combining its specification file, reference implementation, and a related tool.
# Prerequisite: Create an alias for the Model Context Protocol
python onefilellm.py --alias-add mcp-spec https://modelcontextprotocol.io/llms-full.txt https://github.com/modelcontextprotocol/python-sdk
# Combine the alias with another relevant SDK
python onefilellm.py mcp-spec https://github.com/anthropics/anthropic-sdk-python
Outcome: Creates a prompt context containing the MCP specification, its Python SDK, and the official Anthropic Python SDK, allowing for detailed analysis of implementation and compatibility.
Codebase Review with Full Context
Aggregate a local directory of code changes, a relevant GitHub Pull Request, and its corresponding issue to provide a complete context for code review.
python onefilellm.py ./my-feature-branch/ https://github.com/user/repo/pull/42 https://github.com/user/repo/issues/35
Outcome: Provides the LLM with the PR diff, discussion, related issue, and the current state of local code for a comprehensive review.
Pipeline Integration with other CLI tools
Use onefilellm
as the data aggregation stage in a larger command-line workflow, piping its output to another LLM tool for analysis.
python onefilellm.py k8s-stack | llm -m claude-3-opus "Summarize the key architectural patterns."
Outcome: Demonstrates how the tool can function as a modular component in automated, script-driven AI workflows.
~/.onefilellm_aliases/aliases.json
. This file can be edited directly.{
"project-context": "src/ docs/README.md https://github.com/user/project/issues",
"search-msft": "https://github.com/microsoft/{}",
"mcp-spec": "https://modelcontextprotocol.io/llms-full.txt https://github.com/modelcontextprotocol/python-sdk"
}
GITHUB_TOKEN
: A GitHub Personal Access Token. Used to access private repositories and to avoid public API rate limits.old readme.md below
OneFileLLM is a command-line tool designed to streamline the creation of information-dense prompts for large language models (LLMs). It aggregates and preprocesses data from a variety of sources, compiling them into a single text file that is automatically copied to your clipboard for quick use.
Install the required dependencies:
pip install -U -r requirements.txt
To access private GitHub repositories, generate a personal access token as described in the ‘Obtaining a GitHub Personal Access Token’ section here:
Obtaining a GitHub Personal Access Token
Clone the repository or download the source code.
Run the script using the following command:
python onefilellm.py
You can pass a single URL or path as a command line argument for faster processing:
python onefilellm.py https://github.com/jimmc414/1filellm
OneFileLLM now supports processing text directly from standard input (stdin) or the system clipboard:
Use the -
argument to process text from standard input:
# Process text from a file via pipe
cat README.md | python onefilellm.py -
# Process output from another command
git diff | python onefilellm.py -
Use the --clipboard
or -c
argument to process text from the system clipboard:
# Copy text to clipboard first, then run:
python onefilellm.py --clipboard
# Or using the short form:
python onefilellm.py -c
OneFileLLM automatically detects the format of input text (plain text, Markdown, JSON, HTML, YAML) and processes it accordingly. You can override this detection with the --format
or -f
option:
# Force processing as JSON
cat data.txt | python onefilellm.py - --format json
# Force processing clipboard content as Markdown
python onefilellm.py --clipboard -f markdown
Supported format types: text
, markdown
, json
, html
, yaml
, doculing
, markitdown
OneFileLLM supports processing multiple inputs at once. Simply provide multiple paths or URLs as command line arguments:
python onefilellm.py https://github.com/jimmc414/1filellm test_file1.txt test_file2.txt
When multiple inputs are provided, OneFileLLM will:
<onefilellm_output>
root tagoutput.xml
OneFileLLM includes a powerful alias system that allows you to create shortcuts for frequently used commands with support for placeholders and advanced management:
Create, list, and remove aliases using the new alias management commands:
# Add or update an alias
python onefilellm.py --alias-add myrepo "https://github.com/user/repo"
# Add alias with flags and options
python onefilellm.py --alias-add deepcrawl "https://docs.example.com --crawl-max-depth 4 --crawl-include-pattern '/docs/'"
# List all aliases (core and user-defined)
python onefilellm.py --alias-list
# List only pre-shipped core aliases
python onefilellm.py --alias-list-core
# Remove a user-defined alias
python onefilellm.py --alias-remove myrepo
Aliases support a {}
placeholder that gets replaced with your input:
# Create alias with placeholder
python onefilellm.py --alias-add github_search "https://github.com/search?q={}"
# Use with replacement value
python onefilellm.py github_search "onefilellm"
# Expands to: https://github.com/search?q=onefilellm
# Create complex alias with multiple flags
python onefilellm.py --alias-add crawl_site "{} --crawl-max-depth 3 --crawl-respect-robots"
# Use it
python onefilellm.py crawl_site "https://docs.python.org"
# Expands to: https://docs.python.org --crawl-max-depth 3 --crawl-respect-robots
OneFileLLM comes with useful pre-shipped aliases:
ofl_repo
- OneFileLLM GitHub repositoryofl_readme
- OneFileLLM README filegh_search
- GitHub search with placeholder: https://github.com/search?q={}
arxiv_search
- ArXiv search with placeholder~/.onefilellm_aliases/aliases.json
Use aliases just like any other input:
# Use a simple alias
python onefilellm.py ofl_repo
# Use alias with placeholder
python onefilellm.py gh_search "python"
# Mix aliases with direct inputs
python onefilellm.py ofl_repo local_file.txt
# Combine multiple aliases and arguments
python onefilellm.py ofl_repo github_search "machine learning" --format markdown
OneFileLLM features a powerful asynchronous web crawler with extensive configuration options for precise control over content extraction:
# Basic web crawl (default: 3 levels deep, up to 1000 pages)
python onefilellm.py https://docs.example.com
# Custom depth and page limits
python onefilellm.py https://example.com --crawl-max-depth 5 --crawl-max-pages 200
# Include only specific URL patterns
python onefilellm.py https://docs.example.com --crawl-include-pattern "/docs/"
# Exclude specific patterns (CSS, JS, images)
python onefilellm.py https://example.com --crawl-exclude-pattern "\.(css|js|png|jpg|gif)$"
# Restrict crawling to paths under the start URL
python onefilellm.py https://example.com/docs --crawl-restrict-path
# Include images and code blocks
python onefilellm.py https://example.com --crawl-include-images
# Exclude code blocks from output
python onefilellm.py https://example.com --crawl-no-include-code
# Disable heading extraction
python onefilellm.py https://example.com --crawl-no-extract-headings
# Include PDF files in crawl
python onefilellm.py https://example.com --crawl-include-pdfs
# Follow external links (default: stay on same domain)
python onefilellm.py https://example.com --crawl-follow-links
# Disable readability cleaning (keep raw HTML structure)
python onefilellm.py https://example.com --crawl-no-clean-html
# Keep JavaScript and CSS code
python onefilellm.py https://example.com --crawl-no-strip-js --crawl-no-strip-css
# Keep HTML comments
python onefilellm.py https://example.com --crawl-no-strip-comments
# Custom user agent
python onefilellm.py https://example.com --crawl-user-agent "MyBot/1.0"
# Delay between requests (seconds)
python onefilellm.py https://example.com --crawl-delay 1.0
# Request timeout (seconds)
python onefilellm.py https://example.com --crawl-timeout 30
# Concurrent requests (default: 3)
python onefilellm.py https://example.com --crawl-concurrency 5
# Respect robots.txt (default: ignore for backward compatibility)
python onefilellm.py https://example.com --crawl-respect-robots
# Comprehensive crawl with custom settings
python onefilellm.py https://docs.example.com \
--crawl-max-depth 4 \
--crawl-max-pages 500 \
--crawl-include-pattern "/docs/|/api/" \
--crawl-exclude-pattern "\.(pdf|zip|exe)$" \
--crawl-include-images \
--crawl-delay 0.5 \
--crawl-concurrency 2 \
--crawl-respect-robots \
--crawl-restrict-path
Option | Type | Default | Description |
---|---|---|---|
--crawl-max-depth |
int | 3 | Maximum crawl depth from start URL |
--crawl-max-pages |
int | 1000 | Maximum number of pages to crawl |
--crawl-user-agent |
str | OneFileLLMCrawler/1.1 | User agent string for requests |
--crawl-delay |
float | 0.25 | Delay between requests in seconds |
--crawl-include-pattern |
str | None | Regex pattern for URLs to include |
--crawl-exclude-pattern |
str | None | Regex pattern for URLs to exclude |
--crawl-timeout |
int | 20 | Request timeout in seconds |
--crawl-include-images |
flag | False | Include image URLs in output |
--crawl-no-include-code |
flag | False | Exclude code blocks from output |
--crawl-no-extract-headings |
flag | False | Exclude heading extraction |
--crawl-follow-links |
flag | False | Follow links to external domains |
--crawl-no-clean-html |
flag | False | Disable readability cleaning |
--crawl-no-strip-js |
flag | False | Keep JavaScript code |
--crawl-no-strip-css |
flag | False | Keep CSS styles |
--crawl-no-strip-comments |
flag | False | Keep HTML comments |
--crawl-respect-robots |
flag | False | Respect robots.txt files |
--crawl-concurrency |
int | 3 | Number of concurrent requests |
--crawl-restrict-path |
flag | False | Restrict crawl to paths under start URL |
--crawl-no-include-pdfs |
flag | False | Skip PDF files during crawl |
--crawl-no-ignore-epubs |
flag | False | Include EPUB files in crawl |
The tool supports the following input options:
cat file.txt | python onefilellm.py -
)python onefilellm.py --clipboard
)The tool supports the following input options, with their corresponding output actions. Note that the input file extensions are selected based on the following section of code (Applicable to Repos only):
allowed_extensions = ['.xyz', '.pdq', '.example']
The output for all options is encapsulated in LLM prompt-appropriate XML and automatically copied to the clipboard.
Local file path
C:\documents\report.pdf
Local directory path
C:\projects\research
GitHub repository URL
https://github.com/jimmc414/onefilellm
GitHub pull request URL
https://github.com/dear-github/dear-github/pull/102
GitHub issue URL
https://github.com/isaacs/github/issues/1191
ArXiv paper URL
https://arxiv.org/abs/2401.14295
YouTube video URL
https://www.youtube.com/watch?v=KZ_NlnmPQYk
Webpage URL
https://llm.datasette.io/en/stable/
Sci-Hub Paper DOI
10.1053/j.ajkd.2017.08.002
Sci-Hub Paper PMID
29203127
Standard Input
cat file.txt | python onefilellm.py -
Clipboard
python onefilellm.py --clipboard
The script generates the following output files:
output.xml
: The full XML-structured output, automatically copied to the clipboard.compressed_output.txt
: Cleaned and compressed text (when NLTK processing is enabled).processed_urls.txt
: A list of all processed URLs during web crawling.To modify the allowed file types for repository processing, update the allowed_extensions
list in the code:
allowed_extensions = ['.py', '.txt', '.js', '.rst', '.sh', '.md', '.pyx', '.html', '.yaml','.json', '.jsonl', '.ipynb', '.h', '.c', '.sql', '.csv']
Web crawling behavior is now controlled through command-line arguments rather than hardcoded values. You can configure:
--crawl-max-depth N
(default: 3)--crawl-max-pages N
(default: 1000)--crawl-include-pattern
and --crawl-exclude-pattern
--crawl-include-images
, --crawl-no-include-code
--crawl-delay
, --crawl-timeout
, --crawl-concurrency
--crawl-follow-links
, --crawl-restrict-path
--crawl-respect-robots
The tool supports environment variables for configuration:
true
to enable integration teststrue
to enable slow testsYou can also use a .env
file in the project root directory to set these variables:
# .env file
GITHUB_TOKEN=your_github_token_here
RUN_INTEGRATION_TESTS=false
RUN_SLOW_TESTS=false
To access private GitHub repositories, you need a personal access token. Follow these steps:
repo
for private repositories).In the onefilellm.py
script, replace GITHUB_TOKEN
with your actual token or set it as an environment variable:
For Windows:
setx GITHUB_TOKEN "YourGitHubToken"
For Linux:
echo 'export GITHUB_TOKEN="YourGitHubToken"' >> ~/.bashrc
source ~/.bashrc
All output is encapsulated in XML tags. This structure was implemented based on evaluations showing that LLMs perform better with prompts structured in XML. The general structure of the output is as follows:
<onefilellm_output>
<source type="[source_type]" [additional_attributes]>
<[content_type]>
[Extracted content]
</[content_type]>
</source>
</onefilellm_output>
<onefilellm_output>
<source type="[source_type_1]" [additional_attributes]>
<[content_type]>
[Extracted content 1]
</[content_type]>
</source>
<source type="[source_type_2]" [additional_attributes]>
<[content_type]>
[Extracted content 2]
</[content_type]>
</source>
<!-- Additional sources as needed -->
</onefilellm_output>
Where [source_type]
could be one of: “github_repository”, “github_pull_request”, “github_issue”, “arxiv_paper”, “youtube_transcript”, “web_documentation”, “sci_hub_paper”, “local_directory”, “local_file”, “stdin”, or “clipboard”.
This XML structure provides clear delineation of different content types and sources, improving the LLM’s understanding and processing of the input.
+--------------------------------+
| External Services |
|--------------------------------|
| GitHub API | YouTube API |
| Sci-Hub | ArXiv |
+--------------------------------+
|
|
v
+----------------------+ +---------------------+ +----------------------+
| | | | | |
| User | | Command Line Tool | | External Libraries |
|----------------------| |---------------------| |----------------------|
| - Provides input URL |--------->| - Handles user input| | - Requests |
| - Provides text via | | - Detects source |<--------| - BeautifulSoup |
| pipe or clipboard | | type | | - PyPDF2 |
| - Receives text | | - Calls appropriate | | - Tiktoken |
| in clipboard |<---------| - processing modules| | - NLTK |
| | | - Preprocesses text | | - Nbformat |
+----------------------+ | - Generates output | | - Nbconvert |
| files | | - YouTube Transcript |
| - Copies text to | | API |
| clipboard | | - Pyperclip |
| - Reports token | | - Wget |
| count | | - Tqdm |
+---------------------+ | - Rich |
| | - PyYAML |
| +----------------------+
v
+---------------------+
| Source Type |
| Detection |
|---------------------|
| - Determines type |
| of source |
+---------------------+
|
v
+---------------------+
| Processing Modules |
|---------------------|
| - GitHub Repo Proc |
| - Local Dir Proc |
| - YouTube Transcript|
| Proc |
| - ArXiv PDF Proc |
| - Sci-Hub Paper Proc|
| - Webpage Crawling |
| Proc |
| - Text Stream Proc |
+---------------------+
|
v
+---------------------+
| Text Preprocessing |
|---------------------|
| - Stopword removal |
| - Lowercase |
| conversion |
| - Text cleaning |
+---------------------+
|
v
+---------------------+
| Output Generation |
|---------------------|
| - Compressed text |
| file output |
| - Uncompressed text |
| file output |
+---------------------+
|
v
+---------------------+
| Token Count |
| Reporting |
|---------------------|
| - Report token count|
| |
| - Copies text to |
| clipboard |
+---------------------+
2025-06-01:
~/.onefilellm_aliases/aliases.json
){}
token for dynamic command substitution--alias-add
, --alias-remove
, --alias-list
, --alias-list-core
2025-05-30:
--crawl-*
options)2025-05-14:
--format TYPE
or -f TYPE
flags2025-05-10:
2025-05-07:
<combined_sources>
to <onefilellm_output>
2025-05-03:
2025-01-20:
2025-01-17:
2024-07-29:
2024-05-17: Added ability to pass path or URL as command line argument.
2024-05-16: Updated text colors.
2024-05-11:
onefilellm.py
.2024-04-04:
2024-04-03:
onefilellm.py
to return an error when Sci-hub is inaccessible or no document is found.allowed_extensions
list in the code to add or remove file types:allowed_extensions = ['.py', '.txt', '.js', '.rst', '.sh', '.md', '.pyx', '.html', '.yaml','.json', '.jsonl', '.ipynb', '.h', '.c', '.sql', '.csv']
excluded_patterns
list to customize which files are filtered outEXCLUDED_DIRS
list to customize which directories are skipped--crawl-max-depth N
command-line option (default: 3)--crawl-max-pages N
command-line option (default: 1000)--crawl-include-pattern
and --crawl-exclude-pattern
for precise control--crawl-*
flags for images, code blocks, headings, etc.--crawl-concurrency N
and --crawl-delay X
for request management--crawl-respect-robots
to honor robots.txt files.env
file for easier configuration