Schema Reference
Schema Reference
Section titled “Schema Reference”This page documents the data shapes that make the archive reusable. It is descriptive, not a formal validator yet. The rule is simple: external tools may consume these files, but review-state fields still decide whether a record can be cited as verified fact.
Data Files
Section titled “Data Files”| ID | Purpose | Path | Exists | Representative Fields |
|---|---|---|---|---|
source_catalog | Seeded source records and custody pointers. | sources/source_catalog.json | yes | internet_archive_id, local_raw_file, processed_status, site_path, source_id, source_status, title, year |
research_index | Corpus-wide processing inventory. | processed/research_index.json | yes | quality_note, source_count, sources |
evidence_ledger | Traceability records for source/candidate/promoted items. | processed/evidence_ledger.json | yes | collection, confidence, evidence, id, label, person, record_id, record_type |
chapter_atlas | Theme routing records for processed sections. | processed/chapter_atlas.json | yes | id, kind, line_end, line_start, sequence, source_id, source_title, status |
book_coverage_atlas | Book-level coverage records for processed sources and sections. | processed/book_coverage_atlas.json | yes | generated_at, quality_note, source_count, sources, total_sections |
chapter_workbench | Section-level research workbench records. | processed/chapter_workbench.json | yes | concept_hits, equation_count, equations, figure_count, figures, glossary_hits, id, kind |
concept_concordance | Source-text concept hit records. | processed/concept_concordance.json | yes | collection, concepts, generated_at, person, quality_note, section_count, source_count, total_concepts |
canonical_equations | First equation canon and review state. | processed/canonical_equations.json | yes | id, modern_form, original_form, site_path, source_id, source_ref, source_title, status |
completion_audit | Source-by-source readiness gates. | processed/completion_audit.json | yes | curated_public_pages, gates, has_ocr_seed, has_source_manifest, links, next_actions, original_crop_manifests, processed_status |
citation_index | Project and source citation records. | processed/citation_index.json | yes | author, id, issued, recommended_citation, site_url, source_url, title, type |
notation_ledger | Equation notation and translation ledger. | processed/notation_ledger.json | yes | equation_id, modern_form, original_form, review_actions, site_path, source_id, source_ref, source_title |
diagram_provenance_ledger | Original crop and redraw provenance ledger. | processed/diagram_provenance_ledger.json | yes | asset_type, crop_box_pixels, height, id, manifest_path, output_path, public_url, quality_note |
schema_reference | Machine-readable schema/reference guide. | processed/schema_reference.json | yes | files, generated_at, quality_note |
expert_review_packets | Review bundles for experts and contributors. | processed/expert_review_packets.json | yes | artifact_links, id, ready_count, reviewer_profile, scope, tasks, title |
release_readiness | Named publication release levels and readiness states. | processed/release_readiness.json | yes | generated_at, levels, quality_note |
accessibility_audit | Automated accessibility-readiness scan and manual review gates. | processed/accessibility_audit.json | yes | gates, generated_at, html_table_count, iframe_count, image_tag_count, image_tag_missing_alt, issue_pages, long_table_pages |
edition_comparison_index | Edition collation queue for seeded sources. | processed/edition_comparison_index.json | yes | edition_review_status, internet_archive_id, local_raw_file, priority, processed_status, review_actions, source_id, title |
patent_theory_bridge | Seeded bridge from patents to concepts and theory-review targets. | processed/patent_theory_bridge.json | yes | bridge_status, concept_links, diagram_targets, domain_tags, patent_number, patent_url, pdf_url, publication_date |
canonical_verification_workbench | Top-level queue index for canonical verification work. | processed/canonical_verification_workbench.json | yes | generated_at, quality_note, queues, summary |
equation_verification_queue | Equation scan-check queue with OCR line snippets. | processed/equation_verification_queue.json | yes | candidate_status, chapter_id, chapter_refs, chapter_title, id, line_anchors, line_ranges, links |
figure_verification_queue | Original figure crop verification queue. | processed/figure_verification_queue.json | yes | crop_box_pixels, id, links, manifest_path, output_path, public_url, review_actions, sha256 |
patent_verification_queue | Patent authority PDF, claim, drawing, and theory bridge queue. | processed/patent_verification_queue.json | yes | concept_links, diagram_targets, domain_tags, links, patent_number, patent_url, pdf_url, publication_date |
claim_attribution_ledger | Source-isolation ledger for fact, candidate, translation, patent, diagram, and interpretation layers. | processed/claim_attribution_ledger.json | yes | allowed_use, claim_type, collection, confidence, id, interpretation_layer, label, person |
Review-State Fields
Section titled “Review-State Fields”When building external tools, preserve fields such as status, verification, confidence, quality_note, source_ref, and review_state. Removing them makes candidate OCR look more certain than it is.
Future Formalization
Section titled “Future Formalization”The next level is formal JSON Schema files in pipeline/schemas/, versioned export contracts, and validation in CI.