libs.file
Subpackage¶
This subpackage provides convenient file and directory manipulation utilities, including functions for:
Splitting text or token streams into chunks
Reading/writing files
Listing or filtering directory contents
Copying files
Processing entire directories concurrently for chunked ingestion
1) chunk.py
: Splitting into Chunks¶
Contains utilities for splitting text or token lists into smaller sections, with optional overlap. It also supports bundling metadata with each chunk, and can optionally tokenize text using a custom tokenizer.
- lionagi.libs.file.chunk.chunk_by_chars(text: str, chunk_size=2048, overlap=0, threshold=256) list[str] ¶
Split a string into multiple chunks of roughly
chunk_size
characters. If the text is short enough for only one chunk, it returns the entire text. Overlap is optional, expressed as a fraction ofchunk_size
. If the last chunk is under thethreshold
, it merges with the previous chunk.Parameters: - text (str): The entire text to chunk. - chunk_size (int): Target length of each chunk in characters. - overlap (float): Fraction of chunk size to overlap the edges. - threshold (int): Minimum size for the last chunk (avoid tiny
leftover chunk).
Returns: - list of str: The splitted chunks.
Example:
>>> text = "This is a sample text for chunking." >>> chunks = chunk_by_chars(text, 10, 0.2) >>> chunks ['This is a ', 'a sample t', ... ]
- lionagi.libs.file.chunk.chunk_by_tokens(tokens: list[str], chunk_size=1024, overlap=0, threshold=128, return_tokens=False) list[str | list[str]] ¶
Split a list of tokens into multiple chunks of length ~``chunk_size`` tokens. Overlap is optional, again as a fraction of chunk size. If the final chunk is under
threshold
, it merges with the previous chunk.Parameters: - tokens (list[str]): The token list to chunk. - chunk_size (int): Max tokens per chunk. - overlap (float): Fraction of chunk size to overlap. - threshold (int): Minimum token count for the last chunk. - return_tokens (bool): If True, returns lists of tokens. Otherwise,
returns joined strings.
Returns: - list of either str or list[str]
Example:
>>> tokens = ["This","is","a","sample","text"] >>> chunk_by_tokens(tokens, 3, 0.2) ['This is a', 'a sample text']
- lionagi.libs.file.chunk.chunk_content(content: str, chunk_by='chars', tokenizer=str.split, chunk_size=1024, overlap=0, threshold=256, metadata=None, return_tokens=False, **kwargs) list[dict[str, Any]] ¶
High-level function to chunk a big string, either by characters or by tokens (using a given tokenizer), and attach optional metadata. Returns a list of dictionaries, one per chunk, each containing:
“chunk_content”: The chunk string or token list
“chunk_id”, “total_chunks”
“chunk_size”: The length of the chunk
Additional fields from metadata
Parameters: - content (str): The content to chunk. - chunk_by ({“chars”,”tokens”}): Splitting method. - tokenizer (Callable): A function that splits the text into tokens
(only used if chunk_by=”tokens”).
chunk_size (int): The nominal chunk length in chars or tokens.
overlap (float): Fraction of chunk size for overlap.
threshold (int): Minimum size for the final chunk.
metadata (dict | None): Additional data to attach to each chunk.
return_tokens (bool): If True and chunk_by=”tokens”, store token lists instead of joined strings.
Returns: - list of dict: Each dict describes a chunk + metadata.
2) ops.py
: File-level Utilities¶
General file reading, copying, listing:
- lionagi.libs.file.ops.copy_file(src, dest) None ¶
Copy a single file from src to dest, preserving metadata. Raises errors if the file doesn’t exist or permissions are invalid.
- lionagi.libs.file.ops.get_file_size(path) int ¶
Returns the size (in bytes) of a single file or total size of all files under a directory path. Raises exceptions if path is invalid or there’s no permission.
- lionagi.libs.file.ops.list_files(dir_path, extension=None) list[Path] ¶
Recursively list all files in dir_path. If extension is given, only include those with the matching suffix.
- lionagi.libs.file.ops.read_file(path) str ¶
Read the contents of path (UTF-8) and return the text. Raises FileNotFoundError or PermissionError as needed.
3) dir_process.py
: Directory Handling¶
Tools for processing entire directories in a concurrent or chunked manner.
- lionagi.libs.file.dir_process.dir_to_files(directory, file_types=None, max_workers=None, ignore_errors=False, verbose=False) list[Path] ¶
Recursively discover all files in directory (and subdirs). Optionally filter by a list of extensions. Uses a ThreadPoolExecutor to handle concurrency. If ignore_errors is True, logs warnings instead of raising on file access issues.
Returns: - list[Path]: The discovered file paths.
- lionagi.libs.file.dir_process.file_to_chunks(file_path, chunk_func, chunk_size=1500, overlap=0.1, threshold=200, encoding='utf-8', custom_metadata=None, output_dir=None, verbose=False, timestamp=True, random_hash_digits=4) list[dict[str, Any]] ¶
Reads the text from file_path, then calls chunk_func to split it into smaller chunks. Each chunk is returned as a dictionary with metadata including the file name, size, etc. If output_dir is given, it also writes each chunk to a separate JSON file.
Parameters: - file_path (str|Path): The file to process. - chunk_func (Callable): A function for chunking the text
(e.g.,
chunk_by_chars()
).chunk_size, overlap, threshold: Passed to the chunker.
encoding (str): File read encoding.
custom_metadata (dict|None): Additional metadata to attach to chunks.
output_dir (Path|None): If not None, writes each chunk to JSON in that directory.
timestamp (bool), random_hash_digits (int): For naming chunk files.
Returns: - list of dict: The chunk definitions.
4) save.py
: Saving Text or Chunk Files¶
Utilities to save string or chunk data to disk, often used after chunking.
- lionagi.libs.file.save.save_to_file(text, directory, filename, extension=None, timestamp=False, dir_exist_ok=True, file_exist_ok=False, time_prefix=False, timestamp_format=None, random_hash_digits=0, verbose=True) Path ¶
Create a path via
lionagi.utils.create_path()
and write text to it using UTF-8. Optionally logs the resulting path if verbose is True.Parameters: - text (str): The text to save. - directory (str|Path): Directory to place the file. - filename (str): Base name (without extension, unless specified). - extension (str|None): If given, appends to filename with a dot. - timestamp (bool): If True, embed time in the filename. - random_hash_digits (int): Add a short random suffix. - verbose (bool): Print/log the file path after saving.
Returns: - Path: The final path that was written.
- lionagi.libs.file.save.save_chunks(chunks, output_dir, verbose, timestamp, random_hash_digits)¶
Helper to save chunk dictionaries to JSON files, each with a name like
chunk_1_<timestamp>.json
. The chunk itself is written as pretty-printed JSON.
Usage Example¶
Below is a demonstration of how you might combine modules from this subpackage:
from lionagi.libs.file.chunk import chunk_by_chars, chunk_content
from lionagi.libs.file.ops import read_file, list_files
from lionagi.libs.file.dir_process import file_to_chunks
# 1) List .txt files in a directory
text_files = list_files("my_dir", extension="txt")
# 2) Read the first file
content = read_file(text_files[0])
# 3) Chunk by characters
chunks = chunk_by_chars(content, chunk_size=500, overlap=0.1)
# 4) Alternatively, chunk with metadata
meta_chunks = chunk_content(
content,
chunk_by="chars",
chunk_size=500,
overlap=0.1,
metadata={"source": "my_dir/myfile.txt"}
)
# 5) Optionally store chunked results
from lionagi.libs.file.save import save_chunks
save_chunks(meta_chunks, "output_chunk_dir", verbose=True, timestamp=True, random_hash_digits=2)
# 6) Or process an entire file in one go:
results = file_to_chunks(
"my_dir/myfile.txt",
chunk_func=chunk_by_chars,
chunk_size=500,
overlap=0.1,
output_dir="output_chunk_dir"
)
All together, the modules in lionagi.libs.file
facilitate consistent,
straightforward manipulation of file data, particularly in multi-file contexts
where chunk-based ingestion is needed.