Repository Guidelines

Project Structure & Module Organization

The repository centers on four Python entry points under scripts/: quick_start.py, match_cas_numbers.py, keyword_matcher.py, and expand_keywords_with_llm.py. Source data lives in data/input/, generated spreadsheets land in data/output/, and supporting evidence files reside in data/images/. Keep API credentials in .env files that copy the keys documented in config.env.example.

Build, Test, and Development Commands

Set up a sandboxed interpreter before running anything:

python3 -m venv .venv && source .venv/bin/activate
pip install pandas openpyxl pyahocorasick

Core routines:

cd scripts && python3 quick_start.py validates the entire ingest → match → export flow with bundled sample sheets.
python3 match_cas_numbers.py reads data/input/clickin_text_img.xlsx and writes normalized CAS matches to data/output/cas_matched_results_final.xlsx.
python3 keyword_matcher.py now orchestrates the three detection modes (CAS 列、文本精确列、模糊容错) and writes per-mode reports such as keyword_matched_results_cas.xlsx; install pyahocorasick for the fast exact path and rapidfuzz for the fuzzy path.
python3 expand_keywords_with_llm.py ../data/input/keywords.xlsx -m mocks the LLM expansion; remove -m only after exporting OPENAI_API_KEY or ANTHROPIC_API_KEY.

Coding Style & Naming Conventions

Follow PEP 8: four-space indentation, snake_case functions, Upper_Snake constants. Module-level configuration such as column names or separators (SEPARATOR = "|||") should be defined once and imported where needed. Preserve spreadsheet column spelling because pandas filters depend on exact casing. When expanding functionality, keep CLI argument names lowercase with hyphenated long options for consistency.

Testing Guidelines

Testing is empirical: rerun quick_start.py and the specific script you edited, then compare row counts, unique IDs, and timing stats against previous outputs. Use lightweight fixtures copied from data/input/ to isolate regressions, and treat script warnings or pandas SettingWithCopy notices as failures until they are explained.

Commit & Pull Request Guidelines

Git history is not distributed with this bundle, so default to Conventional Commit subjects (feat: add cas normalizer, fix: guard empty rows). Each PR should list the commands executed, describe the input data used, and reference any README or data/ updates. Link tracking tickets, attach screenshots of spreadsheet diffs when UI proof is needed, and keep binary artifacts out of the diff by adding them to .gitignore if necessary.

Security & Configuration Tips

Never commit real API keys or sensitive spreadsheets; point reviewers to sanitized snippets instead. Load secrets with export $(cat config.env.example | xargs) or a dedicated .env loader rather than embedding them in code. Generated Excel files may contain investigative evidence, so confine them to data/output/ and scrub PII when sharing externally.

3.0 KiB Raw Permalink Blame History