init: first upload

2026-01-04 09:07:25 +08:00
commit 29f6e25f70
9 changed files with 2598 additions and 0 deletions
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -0,0 +1,28 @@
+# Repository Guidelines
+
+## Project Structure & Module Organization
+The repository centers on four Python entry points under `scripts/`: `quick_start.py`, `match_cas_numbers.py`, `keyword_matcher.py`, and `expand_keywords_with_llm.py`. Source data lives in `data/input/`, generated spreadsheets land in `data/output/`, and supporting evidence files reside in `data/images/`. Keep API credentials in `.env` files that copy the keys documented in `config.env.example`.
+
+## Build, Test, and Development Commands
+Set up a sandboxed interpreter before running anything:
+```bash
+python3 -m venv .venv && source .venv/bin/activate
+pip install pandas openpyxl pyahocorasick
+```
+Core routines:
+- `cd scripts && python3 quick_start.py` validates the entire ingest → match → export flow with bundled sample sheets.
+- `python3 match_cas_numbers.py` reads `data/input/clickin_text_img.xlsx` and writes normalized CAS matches to `data/output/cas_matched_results_final.xlsx`.
+- `python3 keyword_matcher.py` now orchestrates the three detection modes (CAS 列、文本精确列、模糊容错) and writes per-mode reports such as `keyword_matched_results_cas.xlsx`; install `pyahocorasick` for the fast exact path and `rapidfuzz` for the fuzzy path.
+- `python3 expand_keywords_with_llm.py ../data/input/keywords.xlsx -m` mocks the LLM expansion; remove `-m` only after exporting `OPENAI_API_KEY` or `ANTHROPIC_API_KEY`.
+
+## Coding Style & Naming Conventions
+Follow PEP 8: four-space indentation, `snake_case` functions, Upper_Snake constants. Module-level configuration such as column names or separators (`SEPARATOR = "|||"`) should be defined once and imported where needed. Preserve spreadsheet column spelling because pandas filters depend on exact casing. When expanding functionality, keep CLI argument names lowercase with hyphenated long options for consistency.
+
+## Testing Guidelines
+Testing is empirical: rerun `quick_start.py` and the specific script you edited, then compare row counts, unique IDs, and timing stats against previous outputs. Use lightweight fixtures copied from `data/input/` to isolate regressions, and treat script warnings or pandas SettingWithCopy notices as failures until they are explained.
+
+## Commit & Pull Request Guidelines
+Git history is not distributed with this bundle, so default to Conventional Commit subjects (`feat: add cas normalizer`, `fix: guard empty rows`). Each PR should list the commands executed, describe the input data used, and reference any README or `data/` updates. Link tracking tickets, attach screenshots of spreadsheet diffs when UI proof is needed, and keep binary artifacts out of the diff by adding them to `.gitignore` if necessary.
+
+## Security & Configuration Tips
+Never commit real API keys or sensitive spreadsheets; point reviewers to sanitized snippets instead. Load secrets with `export $(cat config.env.example | xargs)` or a dedicated `.env` loader rather than embedding them in code. Generated Excel files may contain investigative evidence, so confine them to `data/output/` and scrub PII when sharing externally.