3.0 KiB
Repository Guidelines
Project Structure & Module Organization
The repository centers on four Python entry points under scripts/: quick_start.py, match_cas_numbers.py, keyword_matcher.py, and expand_keywords_with_llm.py. Source data lives in data/input/, generated spreadsheets land in data/output/, and supporting evidence files reside in data/images/. Keep API credentials in .env files that copy the keys documented in config.env.example.
Build, Test, and Development Commands
Set up a sandboxed interpreter before running anything:
python3 -m venv .venv && source .venv/bin/activate
pip install pandas openpyxl pyahocorasick
Core routines:
cd scripts && python3 quick_start.pyvalidates the entire ingest → match → export flow with bundled sample sheets.python3 match_cas_numbers.pyreadsdata/input/clickin_text_img.xlsxand writes normalized CAS matches todata/output/cas_matched_results_final.xlsx.python3 keyword_matcher.pynow orchestrates the three detection modes (CAS 列、文本精确列、模糊容错) and writes per-mode reports such askeyword_matched_results_cas.xlsx; installpyahocorasickfor the fast exact path andrapidfuzzfor the fuzzy path.python3 expand_keywords_with_llm.py ../data/input/keywords.xlsx -mmocks the LLM expansion; remove-monly after exportingOPENAI_API_KEYorANTHROPIC_API_KEY.
Coding Style & Naming Conventions
Follow PEP 8: four-space indentation, snake_case functions, Upper_Snake constants. Module-level configuration such as column names or separators (SEPARATOR = "|||") should be defined once and imported where needed. Preserve spreadsheet column spelling because pandas filters depend on exact casing. When expanding functionality, keep CLI argument names lowercase with hyphenated long options for consistency.
Testing Guidelines
Testing is empirical: rerun quick_start.py and the specific script you edited, then compare row counts, unique IDs, and timing stats against previous outputs. Use lightweight fixtures copied from data/input/ to isolate regressions, and treat script warnings or pandas SettingWithCopy notices as failures until they are explained.
Commit & Pull Request Guidelines
Git history is not distributed with this bundle, so default to Conventional Commit subjects (feat: add cas normalizer, fix: guard empty rows). Each PR should list the commands executed, describe the input data used, and reference any README or data/ updates. Link tracking tickets, attach screenshots of spreadsheet diffs when UI proof is needed, and keep binary artifacts out of the diff by adding them to .gitignore if necessary.
Security & Configuration Tips
Never commit real API keys or sensitive spreadsheets; point reviewers to sanitized snippets instead. Load secrets with export $(cat config.env.example | xargs) or a dedicated .env loader rather than embedding them in code. Generated Excel files may contain investigative evidence, so confine them to data/output/ and scrub PII when sharing externally.