Methodology
ResumaryAI Hiring Index Methodology: How We Track 800,000+ Roles and 10,000+ Employers in Our Commercial Job Warehouse
A transparent look at our data collection, deduplication, and freshness assurance pipeline.
Note on headline numbers: These figures match public copy on our homepage (e.g. 10,000+ employers in the index and 800,000+ open roles tracked over time). They are refreshed as the warehouse grows; the live count of is_active rows in public.jobs changes with every ingestion cycle.
Introduction: Why We Built Our Own Hiring Index
Job market data is notoriously difficult to trust. Government labor reports lag by 30–60 days. Aggregated job boards are flooded with "ghost postings"—roles that remain listed long after they've been filled. And most AI resume tools rely on generic language models that have no connection to what companies are actually hiring for right now.
At ResumaryAI, we maintain a commercial job warehouse (Supabase companies + jobs) that backs homepage hiring signals, Career Match Live, and related product surfaces—on the order of 10,000+ employers and 800,000+ roles tracked over time. This article explains our design intent: sources, ingestion patterns, how we think about uniqueness and freshness, and where the live product differs (e.g. AI features that use web search rather than a SQL join to jobs).
We believe transparency is the foundation of trust. If you use Candidate Fit, Company Research, or Career Match, you deserve to know what is driven by our warehouse versus model + search.
1. Where Does the Data Come From?
Our warehouse aggregates postings from multiple connector families. The in-repo verification stack includes adapters for Greenhouse, Lever, Ashby, Workable, Workday, generic careers-page HTML, Teamtailor, Talent Brew / iCIMS, and an aggregate path we label jsearch_aggregate (public/API-sourced listings), among others. The mix shifts as we add employers.
| Source Type | Share (indicative) | Description |
|---|---|---|
| ATS & career endpoints | Majority | Structured JSON from public ATS APIs (e.g. Greenhouse boards API) and similar feeds, plus HTML career pages where no API exists. |
| Aggregators & niche boards | Substantial minority | Rows ingested via permitted aggregate/API paths; we aim to respect site policies and rate limits on each connector. |
| Structured markup | Where available | Some sources expose JobPosting JSON-LD or other structured fields we parse when present. |
What we avoid: We do not target LinkedIn or Indeed in ways that require bypassing login walls or violating their Terms of Service. If a source forbids automated access, we do not use it in production ingestion.
2. How We Identify a Unique Job Posting (Deduplication)
In our published schema, job_url is the global uniqueness key—the same posting URL is not stored twice (UNIQUE (job_url) on public.jobs). That is the authoritative identity for "one listing" in the warehouse.
The same role can still appear under different URLs across mirrors (boards, aggregators, locale variants). For those cases we use normalization and matching heuristics during ingestion (company keys, titles, locations) to avoid double-counting in analytics. The following hash is an illustrative deterministic fingerprint we may combine with URL-first identity—not a promise that every deployment hashes exactly these fields:
fingerprint = SHA256(
normalize(company_name) +
normalize(job_title) +
city +
state/country
)
Normalization (conceptual): Common title suffixes and synonyms (e.g. "Software Engineer" vs. "SWE") may be collapsed for comparison.
Collisions: When two candidate rows disagree, we prefer the most recently observed upstream version and keep audit fields such as first_seen_at / last_seen_at.
These rules materially reduce double-counting versus raw crawl volume. Exact savings depend on the employer mix; we tune the pipeline over time.
3. Ingestion Infrastructure: APIs First, Browsers Last
High-volume hiring data requires disciplined ingestion: prefer stable APIs, fall back to HTML, and only then consider a browser.
3.1 Throttling & backoff
Connectors watch HTTP status and latency. On 429 responses or sustained slowness, we reduce concurrency or delay retries for that host so we stay within reasonable operational and policy bounds.
3.2 Headless browser fallback (Playwright)
Primary path: Structured JSON from ATS public endpoints (e.g. Greenhouse board APIs) or other machine-readable feeds.
Fallback: For heavily client-rendered career sites, our stack can use Playwright (Chromium) to render the page. In production serverless environments this path is often disabled or limited because it requires a full browser runtime; most volume is designed to flow through API/HTML fetchers instead.
3.3 Network & headers
Outbound requests use identifiable, conventional HTTP clients with valid User-Agent strings. We honor X-Robots-Tag and similar signals returned by the origin. Specific egress topology (e.g. fixed vs. rotating IPs) is an operational detail that may change; we do not rely on "stealth" as a substitute for permissioned data access.
3.4 Robots.txt & terms
Where our connectors implement robots policy checks, we follow the declared rules for the declared crawler identity. Domains that disallow automation are skipped or queued for manual review. Exact cache TTLs for robots files vary by worker.
4. Keeping Data Fresh: last_seen_at and Active Rows
Job postings have a short half-life. A listing that was active yesterday might be filled today.
Timestamps in public.jobs: Each row carries first_seen_at and last_seen_at (timestamptz). When a listing is observed again on a successful ingest, last_seen_at advances.
Active vs. removed: Open roles are those with is_active = true. When a posting disappears from upstream sources, ingestion marks it inactive and sets removed_at (soft delete) so history remains auditable. Public "open role" counts and APIs that filter to active jobs exclude inactive rows.
Scheduling: Employer-level cadence is driven by fields such as next_crawl_at on public.companies (priority, volume, and source reliability)—not a single fixed calendar for every brand.
Homepage radar: Pre-aggregates such as country_top_roles (via RPC like get_top_roles_by_country) power hiring "radar" style UI; they reflect batch jobs over the warehouse, not a live crawl during page load.
4.1 Market snapshot tables & the growth_rate field
Where the UI reads from: Job Search "Market hiring snapshot" calls Supabase RPCs get_top_roles_by_country and get_function_insights_by_geo. Those return rows from public.country_top_roles and public.country_function_insights for the latest computed_date in each table, including numeric columns job_count, company_count, and growth_rate.
Where growth_rate is written: This repository's migrations define the columns, but the values are produced by your offline batch / ETL that upserts these tables. Internal docs in migrations refer to a generator script named top_roles_generator.py; that script is not shipped in this workspace, so the exact formula must be confirmed in the environment where you run the job (or in your data team's repo).
Why every bucket might look "up": That is not guaranteed to be economically meaningful. Common causes include: (1) the previous computed_date had thin or partial backfill while the latest run is fuller, inflating sequential deltas; (2) the warehouse index grew monotonically in coverage for that market; (3) the pipeline stores a ratio in a different convention (e.g. already scaled) while a consumer assumes another—always reconcile against raw counts.
How to verify in SQL (Supabase SQL editor): Pick your market string (same as country / geo_market, e.g. 'United States') and inspect the last few days for one row:
-- Role bucket: compare job_count and stored growth_rate across dates
SELECT computed_date, role_category, job_count, company_count, growth_rate
FROM public.country_top_roles
WHERE country = 'United States' AND role_category = 'Software Engineer'
ORDER BY computed_date DESC
LIMIT 7;
-- Function bucket
SELECT computed_date, function_l2, job_count, company_count, growth_rate
FROM public.country_function_insights
WHERE geo_market = 'United States' AND function_l2 = 'Engineering'
ORDER BY computed_date DESC
LIMIT 7;
-- Manual implied growth from last two rows (if growth_rate is (curr-prev)/prev):
WITH x AS (
SELECT computed_date, job_count,
LAG(job_count) OVER (ORDER BY computed_date) AS prev_j
FROM public.country_function_insights
WHERE geo_market = 'United States' AND function_l2 = 'Engineering'
)
SELECT computed_date, job_count, prev_j,
CASE WHEN prev_j IS NULL OR prev_j = 0 THEN NULL
ELSE (job_count::float - prev_j) / prev_j END AS implied_ratio
FROM x
ORDER BY computed_date DESC
LIMIT 5;
If implied_ratio does not match growth_rate (within rounding), your ETL is using a different definition—or a different pair of dates—than a simple day-over-day ratio on job_count.
5. Known Limitations & Biases (What We Don't Claim)
We are transparent about the blind spots in our methodology:
| Limitation | Explanation |
|---|---|
| Small Business Underrepresentation | Companies without a dedicated careers page or ATS feed (e.g., local restaurants, boutique consultancies) are not captured. Our index skews toward employers with formalized hiring processes. |
| Private / Hidden Listings | Roles filled exclusively through executive search firms or internal referrals are invisible to our crawler. |
| Geographic Gaps | Our coverage is strongest in North America and Europe. Emerging markets may have thinner coverage due to differing ATS ecosystems. |
| Linguistic Limitations | We process English-language job descriptions most reliably. Support for other languages varies by source. |
We encourage users to treat the Hiring Index as a directional signal—a high-quality sample of the active labor market—rather than a census of every available job on earth.
6. How This Warehouse Relates to Product Features
The job warehouse and the AI layer serve different roles. Below is how each major feature actually uses data in the current architecture:
| Feature | Relationship to jobs / index |
|---|---|
| Company Research | The production API (/api/company/research) is built around AI plus web search over public pages (careers sites, reviews, news). It does not stream a guaranteed join against every row in public.jobs for each request. Insights may still align with what you would see on careers pages because both draw from the same public hiring ecosystem. |
| Candidate Fit | Multi-stage AI: resume profiling, market/role intelligence (web research when enabled), and synthesis grounded in your pasted job description and resume. It is not a single opaque ATS keyword score. Optional retrieval or similarity steps may support consistency; user-facing copy stays in coaching language. |
| Career Match (Live) | Database-first: POST /api/career-match/live resolves AI-suggested employers to rows in companies, then loads active listings through get_company_jobs on public.jobs. It does not re-crawl ATS endpoints during the user session. Trend-style UI elsewhere may use aggregates such as country_top_roles. |
7. Future Improvements
We are actively working on:
- Expanded ATS connector coverage for international platforms.
- Improved deduplication for roles with ambiguous titles (e.g., "Customer Success Manager" vs. "Client Partner").
- Public API access for academic researchers and career services.
8. Questions or Feedback?
We welcome scrutiny. If you're a researcher, journalist, or curious user who wants to understand more about our data pipeline, reach out to us at help@resumaryai.com.
This methodology document was last updated on April 10, 2026.