github_fetcher

GitHub file extraction and synchronization utilities for documentation packaging.

This module provides tools for retrieving, processing, and exporting files from GitHub repositories, with a focus on preparing content for AI knowledge bases and documentation systems. It includes capabilities for file filtering with glob patterns, metadata enrichment, XML serialization, and structured export. The module’s core components are the GitHubFile class, which represents individual repository files with their content and metadata, and the GitHubPipeline class, which orchestrates the entire process of extracting files matching specific criteria and exporting them to a target location. The resulting exported files preserve both content and contextual information, making them suitable for knowledge extraction, documentation generation, and AI context building.

docpack.github_fetcher.extract_domain(url: str) str[source]

Extract the domain part from a URL.

This function takes a URL as input and returns just the domain name, removing any protocol prefixes (http://, https://) and any paths or parameters that might follow the domain.

Parameters:

url – A URL string (e.g., “https://github.com/abc-team/xyz-project”)

Returns:

The domain part of the URL (e.g., “github.com”)

Examples:
>>> extract_domain("https://github.com/abc-team/xyz-project")
'github.com'
>>> extract_domain("http://github.com")
'github.com'
docpack.github_fetcher.get_github_url(domain: str, account: str, repo: str, branch: str, path_parts: tuple[str, ...]) str[source]

Generate a GitHub URL for a file in a repository.

class docpack.github_fetcher.GitHubFile(*, domain: str, account: str, repo: str, branch: str, github_url: str, path_parts: tuple[str, ...], title: str, description: str, content: str)[source]

A data container representing a file in a GitHub repository with metadata and content.

This class provides utilities for working with GitHub files, including methods for serializing to LLM friendly XML format, generating unique identifiers based on the file path, and exporting the file data to disk.

Parameters:
  • domain – The domain name of the GitHub instance (e.g., ‘github.com’)

  • account – The GitHub account or organization name

  • repo – The name of the GitHub repository

  • branch – The branch name (e.g., ‘main’, ‘master’) or tag name.

  • github_url – The full URL to the file on GitHub, this is usually a calculated value.

  • path_parts – The file path broken into components

  • title – An optional title for the file

  • description – An optional description of the file

  • content – The raw content of the file

property path: str

Get the relative path of the file from the repository root.

Returns:

The path as a string with components joined by ‘/’

to_xml(wanted_fields: list[str] | None = None) str[source]

Serialize the file data to XML format.

This method generates an XML representation of the file including its GitHub metadata and content, suitable for document storage or AI context input.

property uri_hash: str

Generate a short hash identifier for the file.

Creates a unique identifier based on the file’s GitHub location including domain, account, repo, branch, and path. This hash can be used for creating unique filenames or identifiers.

Returns:

A 7-character hash string derived from the file’s URI

property breadcrumb_path: str

Create a flattened representation of the file path.

Converts the hierarchical path structure into a single string with path components joined by ‘~’ characters. This format is useful for creating filesystem-safe filenames that preserve path information.

Returns:

The path with components joined by ‘~’ instead of ‘/’

export_to_file(dir_out: Path, wanted_fields: list[str] | None = None) Path[source]

Export the file data as an XML document to the specified directory.

Creates an XML file in the specified directory with a filename that combines the breadcrumb path and URI hash to ensure uniqueness.

Parameters:

dir_out – The directory where the XML file should be saved

Returns:

The path to the created XML file

model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

docpack.github_fetcher.sort_github_files(github_file_list: list[GitHubFile]) list[GitHubFile][source]

Sort GitHub files by their relative path within the repository.

This function takes a list of GitHubFile objects and returns a new list sorted alphabetically by their path property. Sorting helps maintain consistent ordering when processing or displaying files.

Parameters:

github_file_list – A list of GitHubFile objects to sort

Returns:

A new list containing the same GitHubFile objects but sorted by their paths

docpack.github_fetcher.find_matching_github_files_from_cloned_folder(domain: str, account: str, repo: str, branch: str, dir_repo: Path, include: list[str], exclude: list[str]) list[GitHubFile][source]

Find and process files from a local clone of a GitHub repository.

This function scans a local directory containing a Git repository clone, matches files based on include/exclude patterns, and converts matching files into GitHubFile objects with appropriate metadata. The function uses the find_matching_files utility to apply pattern filtering.

Parameters:
  • domain – The domain of the GitHub instance (e.g., ‘github.com’)

  • account – The GitHub account or organization name

  • repo – The name of the GitHub repository

  • branch – The branch name (e.g., ‘main’, ‘master’) or tag name.

  • dir_repo – Path to the root of the cloned repository

  • include – List of glob patterns specifying which files to include (e.g., [”.py”, “docs/*/*.md”])

  • exclude – List of glob patterns specifying which files to exclude (e.g., [”/__pycache__/”, “/.git/”])

Returns:

A sorted list of GitHubFile objects representing the matching files from the repository

Note

This function uses get_web_url from git_web_url.api to generate the GitHub URL for each file based on its local path.

class docpack.github_fetcher.GitHubPipeline(*, domain: str, account: str, repo: str, branch: str, dir_repo: Path, include: list[str], exclude: list[str], dir_out: Path, wanted_fields: list[str] | None = None)[source]

A data pipeline that extracts and synchronizes files from a GitHub repository to a target location.

GitHubPipeline provides an abstraction for defining a GitHub repository source and a set of file filters, then synchronizing the matching files to a specified output directory. This pipeline handles the entire workflow from selecting files to saving them as structured XML documents that preserve both content and metadata.

Parameters:
  • domain – The domain of the GitHub instance (e.g., ‘github.com’)

  • account – The GitHub account or organization name

  • repo – The name of the GitHub repository

  • branch – The branch name (e.g., ‘main’, ‘master’) or tag name.

  • dir_repo – Path to the root of the cloned repository

  • include – List of glob patterns specifying which files to include (e.g., [”.py”, “docs/*/*.md”])

  • exclude – List of glob patterns specifying which files to exclude (e.g., [”/__pycache__/”, “/.git/”])

  • dir_out – The directory where the XML files should be exported.

model_post_init(_GitHubPipeline__context: Any) None[source]

Override this method to perform additional initialization after __init__ and model_construct. This is useful if you want to do some validation that requires the entire model to be initialized.

fetch()[source]

Execute the pipeline to extract and export GitHub files to the target directory.

This method performs the complete workflow:

  1. Finds all files in the local repository that match the include/exclude patterns

  2. Converts each file to a GitHubFile object with metadata

  3. Exports each file as an XML document to the specified output directory

model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].