Files
yanting/report-notebooklm-api/docs/CONTENT_PIPELINE.md
T

143 lines
6.0 KiB
Markdown

# Content Pipeline Handoff
This is a handoff snapshot, not the product SSOT.
Product SSOT: mall-docs report-notebooklm docs, snapshot date: 2026-06-03.
## Content Principle
Use NotebookLM as a source-driven research engine, not as a generic rewriting model.
The pipeline may orchestrate, clean, validate, map, and review NotebookLM-native artifacts. It must not silently replace missing NotebookLM artifacts with locally rewritten publishable content.
## Source Inputs
Phase 1 content is based on public or authorized institutional research reports. Priority source categories:
- Official public sources.
- Authorized partner sources.
- Gray broker public sources, with stricter review and source display handling.
Vision source lists, tiering, and historical source-health experience may be used as reference material. Production data must not depend on a local Vision runtime, local path, local cache, or local account state.
## NotebookLM Workflow
Recommended report run order:
1. Inspect the source PDF: title, institution, date, page count, size, and report type.
2. Create or reuse one notebook for one report source unless a multi-report synthesis is explicitly planned.
3. Upload the report source.
4. Generate the P0 text package:
- source description
- native Briefing Doc
- native Blog Post
- data table
- query dimensions
- query key data
- query divergence
- query weaknesses
5. Generate useful P1 artifacts:
- query timeline
- query related sources
- Study Guide
- mind map, if download succeeds
6. Generate P2 artifacts asynchronously:
- infographic candidate
- audio brief
- research discovery
7. Persist every artifact status in a manifest.
8. Deterministically assemble display modules from reviewed artifacts.
9. Run human review before publishing.
## Artifact Types
The Phase 1 schema supports these NotebookLM artifact types:
| Artifact type | Purpose | Publish blocking | Human review |
|---|---|---:|---:|
| `source_summary` | Source-level summary. | No | No |
| `notebook_summary` | Notebook-level summary. | No | No |
| `native_briefing_doc` | Native briefing document. | Yes | No |
| `native_blog_post` | Native blog post. | Yes | No |
| `native_study_guide` | FAQ, study guide, glossary. | No | No |
| `data_table` | Structured table data. | Yes | No |
| `mind_map` | Mind map or graph source. | No | No |
| `query_dimensions` | Analysis dimensions. | Yes | No |
| `query_key_data` | Key data points. | Yes | No |
| `query_divergence` | Views that diverge from consensus. | No | No |
| `query_weaknesses` | Weaknesses and open questions. | No | No |
| `query_timeline` | Timeline and turning points. | No | No |
| `query_related_sources` | Related source candidates. | No | Yes |
| `research_discovery` | Enrichment queue. | No | Yes |
| `infographic` | Candidate public image. | No | Yes |
| `audio_brief` | Listening preview or audio source. | No | No |
Artifact records should keep status, object reference, format, size, hash, generated time, error, and review flags. Raw payloads should stay in object storage and remain internal.
## Module Mapping
| Product module | Primary artifact sources | Notes |
|---|---|---|
| `basic_info` | Source metadata and source summary. | P0, inline. |
| `executive_overview` | Briefing Doc and Blog Post. | P0, heavy card plus page. |
| `core_insights` | Briefing Doc and query dimensions. | P0, inline with optional detail page. |
| `key_data` | Data table and query key data. | P0, heavy card plus page. |
| `source_compliance` | Source metadata and review notes. | P0, inline, must include disclaimer. |
| `institution` | Institution record. | P0, inline. |
| `differentiated_view` | Query divergence. | P1, optional. |
| `weaknesses` | Query weaknesses. | P1, optional, avoid investment-advice wording. |
| `timeline` | Query timeline. | P1, optional. |
| `study_guide` | Native Study Guide. | P1, optional, replaces legacy `faq`. |
| `structure_graph` | Mind map or deterministic fallback. | P1, optional. |
| `related_sources` | Related-source query and review queue. | P1, review required before display. |
| `infographic` | Infographic candidate. | P2, review required before display. |
| `audio` | Audio brief or reviewed audio asset. | P2, not required for text publish. |
| `research_discovery` | Research discovery queue. | P2, internal or reviewed only. |
## Publish Gates
Blocking before public release:
- Source upload succeeded and is traceable.
- Required P0 text artifacts exist and have usable content.
- `basic_info`, `executive_overview`, `core_insights`, `key_data`, and `source_compliance` are present unless a product decision allows a partial report.
- Display artifact is reviewed and approved.
- Source attribution and risk disclaimer are present.
- No raw artifact payload, local path, private notebook ID, or account information appears in public responses.
Non-blocking:
- Mind map.
- Study guide.
- Timeline.
- Related-source candidates.
- Research discovery.
- Infographic.
- Audio.
If optional artifacts fail, record the failure and continue without inventing fallback public copy. Deterministic fallback is allowed for structure graph from already available artifacts.
## Cadence Notes
NotebookLM operations should be conservative by default:
- One active NotebookLM operation per account.
- Text artifacts first.
- Media artifacts after text success.
- Heavy media should not block publishable text.
- On transient failure, retry once; if an optional artifact fails again, mark it failed and continue.
The seed importer is not a production runner. A production runner should persist manifests after every operation and support resumable review/import.
## Human Review
Review is mandatory for:
- Gray broker sources.
- Related-source candidate display.
- Infographic or generated media.
- Any content where citations/page labels are ambiguous.
- Any copy that could be interpreted as investment advice.
Do not display raw NotebookLM page labels until they are normalized against verifiable source pages or sections.