How does the RSS.app Generator Work?
The Generator Pipeline
The RSS.app Generator operates as a repeatable pipeline rather than a single action. Each refresh cycle executes the following stages: fetching the page, extracting content, detecting changes, normalizing data, and producing formatted outputs.
Understanding this pipeline explains how the Generator maintains accuracy over time, handles edge cases, and delivers consistent results across different source pages. Each stage has specific responsibilities and failure modes.
Page Fetching
The pipeline begins by requesting the source URL. This is not a simple HTTP GET; the Generator employs multiple strategies to retrieve content reliably:
- Browser rendering. For JavaScript-heavy pages, a headless browser renders the page and waits for dynamic content to load.
- Header management. Appropriate headers ensure the request appears as a standard browser visit, avoiding blocks or degraded responses.
- Timeout handling. Configurable timeouts prevent indefinite waits on slow or unresponsive pages.
- Retry logic. Transient failures trigger automatic retries with backoff intervals.
The output of this stage is the fully rendered HTML of the source page, ready for content extraction.
Content Extraction
With the HTML available, the Generator identifies content items on the page. This involves pattern recognition to find repeating structures:
Container detection. The Generator locates the parent element containing all content items: a list, grid, or table that holds articles, posts, or entries.
Item identification. Within the container, individual items are identified by their structural similarity. Each card, row, or entry becomes a potential feed item.
Field mapping. For each item, the Generator extracts standard feed fields:
title— The headline or name of the itemlink— The URL to the full contentdescription— Summary text or excerptpubDate— Publication or modification timestampimage— Associated thumbnail or featured imageauthor— Creator attribution when available
The extraction uses heuristics refined across millions of pages. For non-standard layouts, the RSS Builder provides manual control over field selection.
Change Comparison
Detecting new content requires comparing current extraction results against previous runs. The Generator maintains state between refresh cycles:
GUID assignment. Each item receives a globally unique identifier based on its URL and a content hash. This GUID persists across refreshes.
Seen-state tracking. The Generator remembers which GUIDs it has encountered. Items with new GUIDs are marked as new; items with known GUIDs are skipped or updated.
Content updates. If an existing item's content changes (title edited, description modified), the Generator can detect this through hash comparison and update the feed accordingly.
This comparison ensures feed consumers receive new items exactly once, without duplicates or missed updates.
Why this matters: Reliable change detection is what separates a feed generator from a simple scraper. Consumers can trust that new items in the feed are genuinely new, and that they will not miss updates between refreshes.
Data Normalization
Raw extracted content requires cleanup before becoming a valid feed. Normalization handles:
Date parsing. Source pages express dates in countless formats—relative ("2 hours ago"), localized ("4 février"), or non-standard layouts. The Generator normalizes these to RFC 822 timestamps required by RSS.
URL resolution. Relative links are converted to absolute URLs. Protocol-relative URLs receive explicit schemes. Malformed URLs are corrected or excluded.
Text cleaning. Excessive whitespace, HTML entities, and encoding issues are resolved. Descriptions are truncated to reasonable lengths if needed.
Image extraction. If images exist but are not directly associated with items, the Generator attempts to match images to their corresponding content.
The result is clean, consistent data ready for format-specific output.
Format-Specific Output
Normalized data is serialized into multiple formats simultaneously:
XML (RSS 2.0)
The standard syndication format. Items become <item> elements with proper channel metadata, publication dates, and GUIDs. The output validates against RSS 2.0 specifications.
JSON
A structured JSON array matching the JSON Feed specification. Each item is an object with typed fields. This format integrates directly with JavaScript applications and REST APIs.
CSV
Tabular export with one row per item. Column headers match feed fields. This format imports directly into spreadsheets and database systems.
All formats share the same underlying data. Choosing a format is purely about consumer compatibility. The information content is identical.
Refresh Cycle Behavior
The pipeline runs on a configurable schedule. Each cycle:
- Fetches the source page fresh (no caching of source content)
- Extracts and compares against known state
- Updates output files only if changes are detected
- Records metadata about the refresh (timestamp, items found, status)
If no new content exists, outputs remain unchanged. Consumers fetching the feed URL receive cached results until the next successful update. This minimizes unnecessary processing while ensuring freshness when content changes.
Failure Handling and Stability
The Generator implements multiple resilience patterns:
Transient failure recovery. Network timeouts, temporary 5xx errors, and rate limits trigger automatic retries. The Generator distinguishes between recoverable and permanent failures.
Content preservation. If a refresh fails, existing feed content is preserved. Consumers continue receiving the last successful data rather than errors or empty feeds.
Layout change adaptation. Minor source page changes (CSS updates, element reordering) are handled automatically. The extraction heuristics tolerate reasonable variation without breaking.
Alerting. Persistent failures or dramatic content changes can trigger notifications, allowing intervention before consumers are affected.
Infrastructure mindset: The Generator is designed as infrastructure that runs unattended. Failures are expected and handled. The goal is continuous, reliable operation without manual intervention.
Frequently Asked Questions
How does the Generator know what content to extract?
The Generator uses pattern recognition to identify repeating content structures on a page: article cards, list items, table rows. It looks for common elements like titles, links, dates, and images within these structures and maps them to feed fields.
What happens if the source page is temporarily unavailable?
The Generator retries failed fetches with exponential backoff. If a page remains unavailable, the existing feed content is preserved. Consumers continue receiving the last successful data until the source recovers.
How are duplicate items prevented?
Each item receives a unique identifier (GUID) based on its URL and content hash. When the Generator detects an item it has seen before, it skips adding a duplicate. This ensures feed consumers do not receive the same item twice.
Can the Generator handle pages with infinite scroll?
For pages using infinite scroll, the Generator captures content visible in the initial viewport and any content that loads within the configured timeout. Enabling JavaScript rendering and adjusting timeouts can improve coverage.
How quickly do new items appear in the feed?
New items appear after the next refresh cycle. The interval depends on your plan, ranging from hourly to near-real-time. Once detected, new items are immediately available in all output formats.
Does the Generator preserve the order of items?
Yes. Items appear in the feed in the same order they appear on the source page, typically with newest content first. The Generator does not reorder items unless the source itself changes their sequence.