{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "82fd6ca995305743",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "# GitHub Document Fetching Tutorial\n",
    "\n",
    "This tutorial demonstrates how to use the [docpack](https://github.com/MacHu-GWU/docpack-project) library to efficiently extract and process documentation from GitHub repositories. Whether you're building an AI knowledge base, centralizing documentation across multiple projects, or preparing content for analysis, ``docpack`` provides a streamlined way to fetch, transform, and organize GitHub files with rich metadata. By following this notebook, you'll learn how to set up document extraction pipelines that can filter files using glob patterns, preserve contextual information like repository structure, and output consistently formatted files ready for integration with AI systems or documentation platforms. This approach eliminates manual document gathering and ensures your knowledge base stays synchronized with your source repositories."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "681098edf559852c",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "## Setting Up the GitHub Fetching Pipeline\n",
    "\n",
    "Before diving into the actual implementation, let's understand how to use the ``GitHubPipeline`` class to extract documents from GitHub repositories. This approach allows you to selectively fetch files while preserving their original context and metadata.\n",
    "The code below demonstrates setting up a pipeline that fetches selected files from a GitHub repository. The pipeline uses glob patterns to determine which files to include and exclude - a powerful pattern-matching approach similar to ``.gitignore`` files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "a7687c01351275b8",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-03-03T04:20:46.651804Z",
     "start_time": "2025-03-03T04:20:46.546678Z"
    },
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "import shutil\n",
    "from pathlib import Path\n",
    "\n",
    "from docpack.github_fetcher import GitHubPipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "1ebc280ae54d83ae",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-03-03T04:20:46.708388Z",
     "start_time": "2025-03-03T04:20:46.704086Z"
    },
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "dir_tmp = /Users/sanhehu/Documents/GitHub/docpack-project/docs/source/02-GitHub-Fetcher/tmp\n",
      "dir_repo = /Users/sanhehu/Documents/GitHub/docpack-project\n"
     ]
    }
   ],
   "source": [
    "dir_here = Path.cwd().absolute()\n",
    "dir_tmp = dir_here / \"tmp\"\n",
    "print(f\"{dir_tmp = !s}\")\n",
    "shutil.rmtree(dir_tmp, ignore_errors=True)\n",
    "\n",
    "dir_repo = Path.cwd().parent.parent.parent\n",
    "print(f\"{dir_repo = !s}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "498ee7ce87d2153f",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "**Key Points About File Filtering**\n",
    "\n",
    "Include-Exclude Pattern Matching:\n",
    "\n",
    "1. The pipeline uses glob patterns to filter files\n",
    "    - If a file matches ANY pattern in the include list, it's initially considered for inclusion\n",
    "    - If a file matches ANY pattern in the exclude list, it's excluded regardless of inclusion rules\n",
    "    - If both include and exclude patterns match a file, exclusion takes precedence\n",
    "2. Understanding Glob Patterns:\n",
    "    - Patterns are relative to the repository root\n",
    "    - ``**/*.py`` means \"any Python file in any directory or subdirectory\"\n",
    "    - ``/**/`` indicates recursive matching through all subdirectories\n",
    "    - For expression in ``exclude``, it is not exactly the glob pattern. For example ``exclude = [\"docpack/tests/**\", ...]`` means exclude any file in ``docpack/tests`` directory but not in subdirectories, in order to exclude all files in ``docpack/tests`` directory and subdirectories, you have to do both.\n",
    "    - Simple patterns like ``README.rst`` match specific files\n",
    "3. ``dir_out`` specifies the directory where the AI friendly XML files should be exported.\n",
    "\n",
    "When the ``fetch()`` method is called, the pipeline processes all files matching the include patterns (but not matching exclude patterns) and exports them to the specified output directory with their metadata intact."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "f23aaeb74c4f377f",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-03-03T04:20:47.585425Z",
     "start_time": "2025-03-03T04:20:47.575874Z"
    },
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "gh_pipeline = GitHubPipeline(\n",
    "    domain=\"github.com\",\n",
    "    account=\"MacHu-GWU\",\n",
    "    repo=\"docpack-project\",\n",
    "    branch=\"main\",\n",
    "    dir_repo=dir_repo,\n",
    "    include=[\n",
    "        f\"README.rst\",\n",
    "        \"docpack/**/*.py\",\n",
    "        \"tests/**/*.py\",\n",
    "        \"docs/source/**/index.rst\",\n",
    "    ],\n",
    "    exclude=[\n",
    "        \"docpack/tests/**\",\n",
    "        \"docpack/tests/**/*.*\",\n",
    "        \"docpack/vendor/**\",\n",
    "        \"docpack/vendor/**/*.*\",\n",
    "        \"tests/all.py\",\n",
    "        \"tests/**/all.py\",\n",
    "        \"docs/source/index.rst\",\n",
    "    ],\n",
    "    dir_out=dir_tmp,\n",
    ")\n",
    "gh_pipeline.fetch()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4fa182f37af44f82",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "## Exploring the Extracted Documents\n",
    "\n",
    "After running the pipeline, let's examine the output files to understand what docpack produces. The code below displays the first 500 characters from five randomly selected output files:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "9cc4b0a7c9dda37",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-03-03T04:22:33.364078Z",
     "start_time": "2025-03-03T04:22:33.361267Z"
    },
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "#============================================================\n",
      "| content of 'docpack~paths.py~e12af5d.xml'\n",
      "#============================================================\n",
      "<document>\n",
      "  <source_type>GitHub Repository</source_type>\n",
      "  <github_url>https://github.com/MacHu-GWU/docpack-project/blob/main/docpack/paths.py</github_url>\n",
      "  <account>MacHu-GWU</account>\n",
      "  <repo>docpack-project</repo>\n",
      "  <branch>main</branch>\n",
      "  <path>docpack/paths.py</path>\n",
      "  <content>\n",
      "# -*- coding: utf-8 -*-\n",
      "\n",
      "from pathlib import Path\n",
      "\n",
      "dir_here = Path(__file__).absolute().parent\n",
      "dir_package = dir_here\n",
      "PACKAGE_NAME = dir_here.name\n",
      "dir_home = Path.home()\n",
      "\n",
      "dir_project_root = dir_here.parent\n",
      "dir_tmp\n",
      "...\n",
      "#============================================================\n",
      "| content of 'tests~test_api.py~9fb2de7.xml'\n",
      "#============================================================\n",
      "<document>\n",
      "  <source_type>GitHub Repository</source_type>\n",
      "  <github_url>https://github.com/MacHu-GWU/docpack-project/blob/main/tests/test_api.py</github_url>\n",
      "  <account>MacHu-GWU</account>\n",
      "  <repo>docpack-project</repo>\n",
      "  <branch>main</branch>\n",
      "  <path>tests/test_api.py</path>\n",
      "  <content>\n",
      "# -*- coding: utf-8 -*-\n",
      "\n",
      "from docpack import api\n",
      "\n",
      "\n",
      "def test():\n",
      "    _ = api\n",
      "\n",
      "\n",
      "if __name__ == \"__main__\":\n",
      "    from docpack.tests import run_cov_test\n",
      "\n",
      "    run_cov_test(__file__, \"docpack.api\", preview=False)\n",
      "\n",
      "  </c\n",
      "...\n",
      "#============================================================\n",
      "| content of 'docpack~__init__.py~98a2489.xml'\n",
      "#============================================================\n",
      "<document>\n",
      "  <source_type>GitHub Repository</source_type>\n",
      "  <github_url>https://github.com/MacHu-GWU/docpack-project/blob/main/docpack/__init__.py</github_url>\n",
      "  <account>MacHu-GWU</account>\n",
      "  <repo>docpack-project</repo>\n",
      "  <branch>main</branch>\n",
      "  <path>docpack/__init__.py</path>\n",
      "  <content>\n",
      "# -*- coding: utf-8 -*-\n",
      "\n",
      "\"\"\"\n",
      "DocPack efficiently consolidates documentation from GitHub, Confluence, and files into a structured knowledge base for AI access.\n",
      "\"\"\"\n",
      "\n",
      "from ._version import __version__\n",
      "\n",
      "__short_\n",
      "...\n",
      "#============================================================\n",
      "| content of 'tests~test_github_fetcher.py~b480033.xml'\n",
      "#============================================================\n",
      "<document>\n",
      "  <source_type>GitHub Repository</source_type>\n",
      "  <github_url>https://github.com/MacHu-GWU/docpack-project/blob/main/tests/test_github_fetcher.py</github_url>\n",
      "  <account>MacHu-GWU</account>\n",
      "  <repo>docpack-project</repo>\n",
      "  <branch>main</branch>\n",
      "  <path>tests/test_github_fetcher.py</path>\n",
      "  <content>\n",
      "# -*- coding: utf-8 -*-\n",
      "\n",
      "import shutil\n",
      "from docpack.github_fetcher import (\n",
      "    extract_domain,\n",
      "    GitHubPipeline,\n",
      ")\n",
      "from docpack.paths import (\n",
      "    dir_project_root,\n",
      "    dir_tmp,\n",
      "    PACK\n",
      "...\n",
      "#============================================================\n",
      "| content of 'docpack~github_fetcher.py~6458e19.xml'\n",
      "#============================================================\n",
      "<document>\n",
      "  <source_type>GitHub Repository</source_type>\n",
      "  <github_url>https://github.com/MacHu-GWU/docpack-project/blob/main/docpack/github_fetcher.py</github_url>\n",
      "  <account>MacHu-GWU</account>\n",
      "  <repo>docpack-project</repo>\n",
      "  <branch>main</branch>\n",
      "  <path>docpack/github_fetcher.py</path>\n",
      "  <content>\n",
      "# -*- coding: utf-8 -*-\n",
      "\n",
      "\"\"\"\n",
      "GitHub file extraction and synchronization utilities for documentation packaging.\n",
      "\n",
      "This module provides tools for retrieving, processing, and exporting files from Git\n",
      "...\n"
     ]
    }
   ],
   "source": [
    "for path in list(dir_tmp.glob(\"*.xml\"))[:5]:\n",
    "    print(\"#\" + \"=\" * 60)\n",
    "    print(f\"| content of {path.name!r}\")\n",
    "    print(\"#\" + \"=\" * 60)\n",
    "    print(path.read_text()[:500] + \"\\n...\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "121bd2ee3ced0836",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "Each extracted file is saved in a structured XML format that preserves both the content and rich contextual metadata from the original GitHub repository. The XML structure includes:\n",
    "\n",
    "- ``<source_type>``: Identifies the origin as a GitHub repository\n",
    "- ``<github_url>``: Direct link to view the file in GitHub's web interface\n",
    "- ``<account>``, ``<repo>``, ``<branch>``: Repository metadata for complete context\n",
    "- ``<path>``: The file's location within the repository structure\n",
    "- ``<content>``: The actual file content, preserved exactly as in the original\n",
    "\n",
    "This format is particularly valuable for AI-powered documentation systems and knowledge bases because it maintains the complete context of each file. When using this data with an AI assistant, the assistant can not only access the file content but also provide source links for verification, understand the file's position in the project hierarchy, and maintain references to the original repository.\n",
    "\n",
    "Notice how the filenames contain both the path (with '~' replacing '/') and a unique hash, ensuring each file has a distinct, identifiable name while preserving its original location information."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}