github_fetcher¶
GitHub file extraction and synchronization utilities for documentation packaging.
This module provides tools for retrieving, processing, and exporting files from GitHub
repositories, with a focus on preparing content for AI knowledge bases and documentation
systems. It includes capabilities for file filtering with glob patterns, metadata
enrichment, XML serialization, and structured export. The module’s core components are
the GitHubFile class, which represents individual repository files with their content
and metadata, and the GitHubPipeline class, which orchestrates the entire process of
extracting files matching specific criteria and exporting them to a target location.
The resulting exported files preserve both content and contextual information, making
them suitable for knowledge extraction, documentation generation, and AI context building.
- docpack.github_fetcher.extract_domain(url: str) str[source]¶
Extract the domain part from a URL.
This function takes a URL as input and returns just the domain name, removing any protocol prefixes (http://, https://) and any paths or parameters that might follow the domain.
- Parameters:
url – A URL string (e.g., “https://github.com/abc-team/xyz-project”)
- Returns:
The domain part of the URL (e.g., “github.com”)
- Examples:
>>> extract_domain("https://github.com/abc-team/xyz-project") 'github.com' >>> extract_domain("http://github.com") 'github.com'
- docpack.github_fetcher.get_github_url(domain: str, account: str, repo: str, branch: str, path_parts: tuple[str, ...]) str[source]¶
Generate a GitHub URL for a file in a repository.
- class docpack.github_fetcher.GitHubFile(*, domain: str, account: str, repo: str, branch: str, github_url: str, path_parts: tuple[str, ...], title: str, description: str, content: str)[source]¶
A data container representing a file in a GitHub repository with metadata and content.
This class provides utilities for working with GitHub files, including methods for serializing to LLM friendly XML format, generating unique identifiers based on the file path, and exporting the file data to disk.
- Parameters:
domain – The domain name of the GitHub instance (e.g., ‘github.com’)
account – The GitHub account or organization name
repo – The name of the GitHub repository
branch – The branch name (e.g., ‘main’, ‘master’) or tag name.
github_url – The full URL to the file on GitHub, this is usually a calculated value.
path_parts – The file path broken into components
title – An optional title for the file
description – An optional description of the file
content – The raw content of the file
- property path: str¶
Get the relative path of the file from the repository root.
- Returns:
The path as a string with components joined by ‘/’
- to_xml(wanted_fields: list[str] | None = None) str[source]¶
Serialize the file data to XML format.
This method generates an XML representation of the file including its GitHub metadata and content, suitable for document storage or AI context input.
- property uri_hash: str¶
Generate a short hash identifier for the file.
Creates a unique identifier based on the file’s GitHub location including domain, account, repo, branch, and path. This hash can be used for creating unique filenames or identifiers.
- Returns:
A 7-character hash string derived from the file’s URI
- property breadcrumb_path: str¶
Create a flattened representation of the file path.
Converts the hierarchical path structure into a single string with path components joined by ‘~’ characters. This format is useful for creating filesystem-safe filenames that preserve path information.
- Returns:
The path with components joined by ‘~’ instead of ‘/’
- export_to_file(dir_out: Path, wanted_fields: list[str] | None = None) Path[source]¶
Export the file data as an XML document to the specified directory.
Creates an XML file in the specified directory with a filename that combines the breadcrumb path and URI hash to ensure uniqueness.
- Parameters:
dir_out – The directory where the XML file should be saved
- Returns:
The path to the created XML file
- model_config: ClassVar[ConfigDict] = {}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- docpack.github_fetcher.sort_github_files(github_file_list: list[GitHubFile]) list[GitHubFile][source]¶
Sort GitHub files by their relative path within the repository.
This function takes a list of
GitHubFileobjects and returns a new list sorted alphabetically by their path property. Sorting helps maintain consistent ordering when processing or displaying files.- Parameters:
github_file_list – A list of
GitHubFileobjects to sort- Returns:
A new list containing the same
GitHubFileobjects but sorted by their paths
- docpack.github_fetcher.find_matching_github_files_from_cloned_folder(domain: str, account: str, repo: str, branch: str, dir_repo: Path, include: list[str], exclude: list[str]) list[GitHubFile][source]¶
Find and process files from a local clone of a GitHub repository.
This function scans a local directory containing a Git repository clone, matches files based on include/exclude patterns, and converts matching files into GitHubFile objects with appropriate metadata. The function uses the find_matching_files utility to apply pattern filtering.
- Parameters:
domain – The domain of the GitHub instance (e.g., ‘github.com’)
account – The GitHub account or organization name
repo – The name of the GitHub repository
branch – The branch name (e.g., ‘main’, ‘master’) or tag name.
dir_repo – Path to the root of the cloned repository
include – List of glob patterns specifying which files to include (e.g., [”.py”, “docs/*/*.md”])
exclude – List of glob patterns specifying which files to exclude (e.g., [”/__pycache__/”, “/.git/”])
- Returns:
A sorted list of
GitHubFileobjects representing the matching files from the repository
Note
This function uses get_web_url from git_web_url.api to generate the GitHub URL for each file based on its local path.
- class docpack.github_fetcher.GitHubPipeline(*, domain: str, account: str, repo: str, branch: str, dir_repo: Path, include: list[str], exclude: list[str], dir_out: Path, wanted_fields: list[str] | None = None)[source]¶
A data pipeline that extracts and synchronizes files from a GitHub repository to a target location.
GitHubPipeline provides an abstraction for defining a GitHub repository source and a set of file filters, then synchronizing the matching files to a specified output directory. This pipeline handles the entire workflow from selecting files to saving them as structured XML documents that preserve both content and metadata.
- Parameters:
domain – The domain of the GitHub instance (e.g., ‘github.com’)
account – The GitHub account or organization name
repo – The name of the GitHub repository
branch – The branch name (e.g., ‘main’, ‘master’) or tag name.
dir_repo – Path to the root of the cloned repository
include – List of glob patterns specifying which files to include (e.g., [”.py”, “docs/*/*.md”])
exclude – List of glob patterns specifying which files to exclude (e.g., [”/__pycache__/”, “/.git/”])
dir_out – The directory where the XML files should be exported.
- model_post_init(_GitHubPipeline__context: Any) None[source]¶
Override this method to perform additional initialization after __init__ and model_construct. This is useful if you want to do some validation that requires the entire model to be initialized.
- fetch()[source]¶
Execute the pipeline to extract and export GitHub files to the target directory.
This method performs the complete workflow:
Finds all files in the local repository that match the include/exclude patterns
Converts each file to a GitHubFile object with metadata
Exports each file as an XML document to the specified output directory
- model_config: ClassVar[ConfigDict] = {}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].