confluence_fetcher

Confluence page fetching and processing utilities.

class docpack.confluence_fetcher.ConfluencePage(*, page_data: dict[str, Any], site_url: str, id_path: str | None, position_path: str | None, breadcrumb_path: str | None)[source]

A data container for Confluence pages that enriches the API response data with hierarchical metadata and navigation properties.

This class wraps the raw page data returned by Confluence’s get pages API and adds additional attributes for working with page hierarchies and navigation.

Parameters:
  • page_data – The raw item response from the Confluence.get_pages API call

  • site_url – Base URL of the Confluence site

  • id_path – Hierarchical ID-based path (e.g., “/parent_id/child_id”) for filtering with glob patterns

  • position_path – Position-based path (e.g., “/1/3/2”) used for hierarchical sorting

  • breadcrumb_path – Human-readable title hierarchy (e.g., “|| Parent || Child || Page”) similar to UI breadcrumbs

The class assumes the body format is Atlas Doc Format

Properties like id, title, parent_id provide convenient access to commonly used attributes from the raw page data.

to_xml(wanted_fields: list[str] | None = None) str[source]

Serialize the file data to XML format.

This method generates an XML representation of the file including its GitHub metadata and content, suitable for document storage or AI context input.

model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

docpack.confluence_fetcher.fetch_raw_pages_from_space(confluence: Confluence, space_id: int) list[ConfluencePage][source]

Crawls and retrieves all pages from a Confluence space using pagination.

This function fetches raw page data from the Confluence API, converts each page to a ConfluencePage object with minimal initialization, and returns the complete collection without processing hierarchical relationships.

Parameters:
  • confluence – Authenticated Confluence API client

  • space_id – ID of the Confluence space to crawl

Returns:

List of ConfluencePage objects with initialized page_data and site_url, but without hierarchy information (id_path, position_path, breadcrumb_path)

docpack.confluence_fetcher.enrich_pages_with_hierarchy_data(raw_pages: list[ConfluencePage]) list[ConfluencePage][source]

Enriches Confluence page objects with hierarchical relationship information.

This function processes a list of raw ConfluencePage objects to:

  1. Create ID-based paths (id_path) representing the page hierarchy

  2. Generate position-based paths (position_path) for correct sorting

  3. Build human-readable title hierarchies (breadcrumb_path) for display

The function creates a complete hierarchy tree by iteratively processing pages for up to 20 levels of depth, starting with parent pages and moving to children.

Parameters:

raw_pages – List of ConfluencePage objects with basic data but no hierarchy info

Returns:

List of ConfluencePage objects enriched with hierarchy data and sorted by their position in the hierarchy

docpack.confluence_fetcher.load_or_build_page_hierarchy(confluence: Confluence, space_id: int, cache: Cache, cache_key: str, expire: int = 86400) list[ConfluencePage][source]

Retrieves a complete Confluence page hierarchy with caching support.

This function either:

  1. Returns a cached page hierarchy if available

  2. Or fetches pages, builds their hierarchy, and caches the result

The function uses a composite cache key consisting of the Confluence URL, space ID, and provided cache key to ensure proper cache isolation. Results are compressed with gzip before caching to reduce storage usage.

Parameters:
  • confluence – Authenticated Confluence API client

  • space_id – ID of the Confluence space to crawl

  • cache_key – Additional key component for cache differentiation (e.g., to cache different point-in-time snapshot of the same space)

Returns:

List of ConfluencePage objects with complete hierarchy data, sorted by their hierarchical position

docpack.confluence_fetcher.extract_id(url_or_id: str) str[source]

Extract the page ID from a Confluence URL or return the ID if directly provided.

This function handles different Confluence URL formats and extracts the page ID. It also handles cases where the URL has a trailing /* or when just the ID is provided.

Parameters:

url_or_id – A Confluence page URL or direct page ID. Example: “https://example.atlassian.net/wiki/spaces/BD/pages/123456/Value+Proposition” or just “123456”

Returns:

The extracted page ID as a string

docpack.confluence_fetcher.process_include_exclude(include: list[str], exclude: list[str]) tuple[list[str], list[str]][source]

Process include and exclude patterns for Confluence page IDs or URLs.

This function takes lists of include and exclude patterns that might be Confluence page URLs or IDs, extracts the page IDs from them, and preserves any trailing wildcards (/*). It normalizes all inputs to a consistent format of either just the ID or ID with wildcard.

Parameters:
  • include – List of Confluence page URLs or IDs to include Items can be full URLs, page IDs, or patterns with /* suffix

  • exclude – List of Confluence page URLs or IDs to exclude Items can be full URLs, page IDs, or patterns with /* suffix

Returns:

A tuple of two lists: 1. Normalized include patterns with extracted IDs 2. Normalized exclude patterns with extracted IDs

docpack.confluence_fetcher.is_matching(page_mapping: dict[str, ConfluencePage], page: ConfluencePage, include: List[str], exclude: List[str]) bool[source]

Determine if a Confluence page matches the include/exclude filtering criteria.

This function implements the filtering logic similar to gitignore patterns, where:

  • A page is included if it matches any include pattern

  • A page is excluded if it matches any exclude pattern

  • Patterns with /* suffix match the specified page and all its descendants

  • If no include patterns are provided, all pages are initially included (before exclusions)

Parameters:
  • page_mapping – Dictionary mapping page IDs to their ConfluencePage objects for efficient parent-child relationship lookups

  • page – The ConfluencePage object to check against the filters

  • include – List of normalized page IDs or page ID patterns (with /* suffix) to include in results. This is a processed “include” list from process_include_exclude()

  • exclude – List of normalized page IDs or page ID patterns (with /* suffix) to exclude from results. This is a processed “exclude” list from process_include_exclude()

Returns:

True if the page should be included in the results, False otherwise

docpack.confluence_fetcher.find_matching_pages(sorted_pages: list[ConfluencePage], include: List[str], exclude: List[str])[source]

Filter Confluence pages based on include/exclude patterns similar to gitignore.

This function lets you specify which pages to include or exclude using either direct page IDs or hierarchical patterns. It supports URL or ID formats and allows using /* suffix to indicate a page and all its descendants (like a folder).

Filtering logic follows these rules:

  1. First, normalize all URL or ID patterns to a consistent format

  2. Pages matching any include pattern are considered (or all if no include patterns)

  3. Then, any page matching an exclude pattern is filtered out

  4. Patterns with /* match the specified page and all its descendants

Parameters:
  • sorted_pages – List of ConfluencePage objects sorted by hierarchy (typically from enrich_pages_with_hierarchy_data)

  • include – List of Confluence page URLs or IDs to include Can be full URLs, page IDs, or patterns with /* suffix

  • exclude – List of Confluence page URLs or IDs to exclude Can be full URLs, page IDs, or patterns with /* suffix

Returns:

Filtered list of ConfluencePage objects that match the criteria

class docpack.confluence_fetcher.ConfluencePipeline(*, confluence: Confluence, space_id: int | str, include: list[str], exclude: list[str], dir_out: Path, cache_key: str, cache_expire: int = 86400, cache_path: str = '/home/docs/docpack/.cache', wanted_fields: list[str] | None = None)[source]

A data pipeline that extracts and synchronizes Confluence pages to a target location.

ConfluencePipeline provides an abstraction for defining a Confluence space source and filtering criteria, then exporting the matching pages to a specified output directory as structured XML documents that preserve both content and metadata.

The pipeline handles the complete workflow from authentication to content extraction, hierarchical processing, filtering, and file export with metadata preservation.

Example:

confluence_pipeline = ConfluencePipeline(
    confluence=confluence,
    space_id=space_id,
    # Use cache key to avoid re-fetching the same page hierarchy
    # it will store all pages in the cache and use it for filtering
    # if you change the include / exclude pattern
    cache_key=cache_key,
    include=[
        # include all child page
        f"{confluence.url}/wiki/spaces/{space_key}/pages/{page_id}/{page_title}/*",
        # only include this page, no child page
        f"{confluence.url}/wiki/spaces/{space_key}/pages/{page_id}/{page_title}",
    ],
    exclude=[
        # exclude all child page
        f"{confluence.url}/wiki/spaces/{space_key}/pages/{page_id}/{page_title}/*",
        # only exclude this page, no child page
        f"{confluence.url}/wiki/spaces/{space_key}/pages/{page_id}/{page_title}",
    ],
)
Parameters:
  • confluence – Authenticated Confluence API client instance

  • space_id – space ID (int) or space key (str) of the Confluence space to process

  • include – List of patterns (URLs or IDs) specifying which pages to include. Use Page URL + /* to include all children of a page.

  • exclude – List of patterns (URLs or IDs) specifying which pages to exclude Use Page URL + /* to include all children of a page.

  • dir_out – The directory where the XML files should be exported

  • cache_key – Key for caching and retrieving page hierarchies

  • cache_expire – Cache expiration time in seconds (default: 24 hours)

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

post_process_confluence_page(confluence_page: ConfluencePage) ConfluencePage[source]

Post-process the ConfluencePage object after fetching it.

User can override this method to add custom processing logic

post_process_path_out(confluence_page: ConfluencePage, path_out: Path)[source]

Post-process the output path after exporting a Confluence page.

fetch()[source]

Execute the pipeline to extract and export Confluence pages to the target directory.

This method performs the complete workflow:

  1. List all pages in the given Confluence space that match the include/exclude patterns

  2. Converts each page to a ConfluencePage object with metadata

  3. Exports each page as an XML document to the specified output directory