===================================== ``libs.file`` Subpackage ===================================== This subpackage provides convenient file and directory manipulation utilities, including functions for: - Splitting text or token streams into chunks - Reading/writing files - Listing or filtering directory contents - Copying files - Processing entire directories concurrently for chunked ingestion ------------------------------------- 1) ``chunk.py``: Splitting into Chunks ------------------------------------- .. module:: lionagi.libs.file.chunk Contains utilities for splitting text or token lists into smaller sections, with optional overlap. It also supports bundling metadata with each chunk, and can optionally tokenize text using a custom tokenizer. .. function:: chunk_by_chars(text: str, chunk_size=2048, overlap=0, threshold=256) -> list[str] Split a *string* into multiple chunks of roughly ``chunk_size`` characters. If the text is short enough for only one chunk, it returns the entire text. Overlap is optional, expressed as a fraction of ``chunk_size``. If the last chunk is under the ``threshold``, it merges with the previous chunk. **Parameters**: - **text** (str): The entire text to chunk. - **chunk_size** (int): Target length of each chunk in characters. - **overlap** (float): Fraction of chunk size to overlap the edges. - **threshold** (int): Minimum size for the last chunk (avoid tiny leftover chunk). **Returns**: - list of str: The splitted chunks. **Example**:: >>> text = "This is a sample text for chunking." >>> chunks = chunk_by_chars(text, 10, 0.2) >>> chunks ['This is a ', 'a sample t', ... ] .. function:: chunk_by_tokens(tokens: list[str], chunk_size=1024, overlap=0, threshold=128, return_tokens=False) -> list[str|list[str]] Split a list of tokens into multiple chunks of length ~``chunk_size`` tokens. Overlap is optional, again as a fraction of chunk size. If the final chunk is under ``threshold``, it merges with the previous chunk. **Parameters**: - **tokens** (list[str]): The token list to chunk. - **chunk_size** (int): Max tokens per chunk. - **overlap** (float): Fraction of chunk size to overlap. - **threshold** (int): Minimum token count for the last chunk. - **return_tokens** (bool): If True, returns lists of tokens. Otherwise, returns joined strings. **Returns**: - list of either str or list[str] **Example**:: >>> tokens = ["This","is","a","sample","text"] >>> chunk_by_tokens(tokens, 3, 0.2) ['This is a', 'a sample text'] .. function:: chunk_content(content: str, chunk_by="chars", tokenizer=str.split, chunk_size=1024, overlap=0, threshold=256, metadata=None, return_tokens=False, **kwargs) -> list[dict[str, Any]] High-level function to chunk a big string, either by characters or by tokens (using a given tokenizer), and attach optional metadata. Returns a list of dictionaries, one per chunk, each containing: - `"chunk_content"`: The chunk string or token list - `"chunk_id"`, `"total_chunks"` - `"chunk_size"`: The length of the chunk - Additional fields from *metadata* **Parameters**: - **content** (str): The content to chunk. - **chunk_by** ({"chars","tokens"}): Splitting method. - **tokenizer** (Callable): A function that splits the text into tokens (only used if chunk_by="tokens"). - **chunk_size** (int): The nominal chunk length in chars or tokens. - **overlap** (float): Fraction of chunk size for overlap. - **threshold** (int): Minimum size for the final chunk. - **metadata** (dict | None): Additional data to attach to each chunk. - **return_tokens** (bool): If True and chunk_by="tokens", store token lists instead of joined strings. **Returns**: - list of dict: Each dict describes a chunk + metadata. ---------------------------------- 2) ``ops.py``: File-level Utilities ---------------------------------- .. module:: lionagi.libs.file.ops General file reading, copying, listing: .. function:: copy_file(src, dest) -> None Copy a single file from *src* to *dest*, preserving metadata. Raises errors if the file doesn't exist or permissions are invalid. .. function:: get_file_size(path) -> int Returns the size (in bytes) of a single file or total size of all files under a directory path. Raises exceptions if path is invalid or there's no permission. .. function:: list_files(dir_path, extension=None) -> list[Path] Recursively list all files in *dir_path*. If *extension* is given, only include those with the matching suffix. .. function:: read_file(path) -> str Read the contents of *path* (UTF-8) and return the text. Raises FileNotFoundError or PermissionError as needed. --------------------------------------- 3) ``dir_process.py``: Directory Handling --------------------------------------- .. module:: lionagi.libs.file.dir_process Tools for processing entire directories in a concurrent or chunked manner. .. function:: dir_to_files(directory, file_types=None, max_workers=None, ignore_errors=False, verbose=False) -> list[Path] Recursively discover all files in *directory* (and subdirs). Optionally filter by a list of extensions. Uses a ThreadPoolExecutor to handle concurrency. If *ignore_errors* is True, logs warnings instead of raising on file access issues. **Returns**: - list[Path]: The discovered file paths. .. function:: file_to_chunks(file_path, chunk_func, chunk_size=1500, overlap=0.1, threshold=200, encoding="utf-8", custom_metadata=None, output_dir=None, verbose=False, timestamp=True, random_hash_digits=4) -> list[dict[str, Any]] Reads the text from *file_path*, then calls *chunk_func* to split it into smaller chunks. Each chunk is returned as a dictionary with metadata including the file name, size, etc. If *output_dir* is given, it also writes each chunk to a separate JSON file. **Parameters**: - **file_path** (str|Path): The file to process. - **chunk_func** (Callable): A function for chunking the text (e.g., :func:`chunk_by_chars`). - **chunk_size**, **overlap**, **threshold**: Passed to the chunker. - **encoding** (str): File read encoding. - **custom_metadata** (dict|None): Additional metadata to attach to chunks. - **output_dir** (Path|None): If not None, writes each chunk to JSON in that directory. - **timestamp** (bool), **random_hash_digits** (int): For naming chunk files. **Returns**: - list of dict: The chunk definitions. ----------------------------------------- 4) ``save.py``: Saving Text or Chunk Files ----------------------------------------- .. module:: lionagi.libs.file.save Utilities to save string or chunk data to disk, often used after chunking. .. function:: save_to_file(text, directory, filename, extension=None, timestamp=False, dir_exist_ok=True, file_exist_ok=False, time_prefix=False, timestamp_format=None, random_hash_digits=0, verbose=True) -> Path Create a path via :func:`lionagi.utils.create_path` and write *text* to it using UTF-8. Optionally logs the resulting path if *verbose* is True. **Parameters**: - **text** (str): The text to save. - **directory** (str|Path): Directory to place the file. - **filename** (str): Base name (without extension, unless specified). - **extension** (str|None): If given, appends to filename with a dot. - **timestamp** (bool): If True, embed time in the filename. - **random_hash_digits** (int): Add a short random suffix. - **verbose** (bool): Print/log the file path after saving. **Returns**: - Path: The final path that was written. .. function:: save_chunks(chunks, output_dir, verbose, timestamp, random_hash_digits) Helper to save chunk dictionaries to JSON files, each with a name like ``chunk_1_.json``. The chunk itself is written as pretty-printed JSON. --------- Usage Example --------- Below is a demonstration of how you might combine modules from this subpackage: .. code-block:: python from lionagi.libs.file.chunk import chunk_by_chars, chunk_content from lionagi.libs.file.ops import read_file, list_files from lionagi.libs.file.dir_process import file_to_chunks # 1) List .txt files in a directory text_files = list_files("my_dir", extension="txt") # 2) Read the first file content = read_file(text_files[0]) # 3) Chunk by characters chunks = chunk_by_chars(content, chunk_size=500, overlap=0.1) # 4) Alternatively, chunk with metadata meta_chunks = chunk_content( content, chunk_by="chars", chunk_size=500, overlap=0.1, metadata={"source": "my_dir/myfile.txt"} ) # 5) Optionally store chunked results from lionagi.libs.file.save import save_chunks save_chunks(meta_chunks, "output_chunk_dir", verbose=True, timestamp=True, random_hash_digits=2) # 6) Or process an entire file in one go: results = file_to_chunks( "my_dir/myfile.txt", chunk_func=chunk_by_chars, chunk_size=500, overlap=0.1, output_dir="output_chunk_dir" ) All together, the modules in ``lionagi.libs.file`` facilitate consistent, straightforward manipulation of file data, particularly in multi-file contexts where chunk-based ingestion is needed.