ResumaryAI Hiring Index Methodology

Introduction: Why We Built Our Own Hiring Index

Job market data is notoriously difficult to trust. Government labor reports lag by 30–60 days. Aggregated job boards are flooded with "ghost postings"—roles that remain listed long after they've been filled. And most AI resume tools rely on generic language models that have no connection to what companies are actually hiring for right now.

At ResumaryAI, we maintain a commercial job warehouse (Supabase companies + jobs) that backs homepage hiring signals, Career Match Live, and related product surfaces—on the order of 10,000+ employers and 800,000+ roles tracked over time. This article explains our design intent: sources, ingestion patterns, how we think about uniqueness and freshness, and where the live product differs (e.g. AI features that use web search rather than a SQL join to jobs).

We believe transparency is the foundation of trust. If you use Candidate Fit, Company Research, or Career Match, you deserve to know what is driven by our warehouse versus model + search.

1. Where Does the Data Come From?

Our warehouse aggregates postings from multiple connector families. The in-repo verification stack includes adapters for Greenhouse, Lever, Ashby, Workable, Workday, generic careers-page HTML, Teamtailor, Talent Brew / iCIMS, and an aggregate path we label jsearch_aggregate (public/API-sourced listings), among others. The mix shifts as we add employers.

Source Type	Share (indicative)	Description
ATS & career endpoints	Majority	Structured JSON from public ATS APIs (e.g. Greenhouse boards API) and similar feeds, plus HTML career pages where no API exists.
Aggregators & niche boards	Substantial minority	Rows ingested via permitted aggregate/API paths; we aim to respect site policies and rate limits on each connector.
Structured markup	Where available	Some sources expose JobPosting JSON-LD or other structured fields we parse when present.

What we avoid: We do not target LinkedIn or Indeed in ways that require bypassing login walls or violating their Terms of Service. If a source forbids automated access, we do not use it in production ingestion.

2. How We Identify a Unique Job Posting (Deduplication)

In our published schema, job_url is the global uniqueness key—the same posting URL is not stored twice (UNIQUE (job_url) on public.jobs). That is the authoritative identity for "one listing" in the warehouse.

The same role can still appear under different URLs across mirrors (boards, aggregators, locale variants). For those cases we use normalization and matching heuristics during ingestion (company keys, titles, locations) to avoid double-counting in analytics. The following hash is an illustrative deterministic fingerprint we may combine with URL-first identity—not a promise that every deployment hashes exactly these fields:

fingerprint = SHA256(
    normalize(company_name) +
    normalize(job_title) +
    city +
    state/country
)

Normalization (conceptual): Common title suffixes and synonyms (e.g. "Software Engineer" vs. "SWE") may be collapsed for comparison.

Collisions: When two candidate rows disagree, we prefer the most recently observed upstream version and keep audit fields such as first_seen_at / last_seen_at.

These rules materially reduce double-counting versus raw crawl volume. Exact savings depend on the employer mix; we tune the pipeline over time.

3. Ingestion Infrastructure: APIs First, Browsers Last

High-volume hiring data requires disciplined ingestion: prefer stable APIs, fall back to HTML, and only then consider a browser.

3.1 Throttling & backoff

Connectors watch HTTP status and latency. On 429 responses or sustained slowness, we reduce concurrency or delay retries for that host so we stay within reasonable operational and policy bounds.

3.2 Headless browser fallback (Playwright)

Primary path: Structured JSON from ATS public endpoints (e.g. Greenhouse board APIs) or other machine-readable feeds.

Fallback: For heavily client-rendered career sites, our stack can use Playwright (Chromium) to render the page. In production serverless environments this path is often disabled or limited because it requires a full browser runtime; most volume is designed to flow through API/HTML fetchers instead.

3.3 Network & headers

Outbound requests use identifiable, conventional HTTP clients with valid User-Agent strings. We honor X-Robots-Tag and similar signals returned by the origin. Specific egress topology (e.g. fixed vs. rotating IPs) is an operational detail that may change; we do not rely on "stealth" as a substitute for permissioned data access.

3.4 Robots.txt & terms

Where our connectors implement robots policy checks, we follow the declared rules for the declared crawler identity. Domains that disallow automation are skipped or queued for manual review. Exact cache TTLs for robots files vary by worker.

4. Keeping Data Fresh: `last_seen_at` and Active Rows

Job postings have a short half-life. A listing that was active yesterday might be filled today.

Timestamps in public.jobs: Each row carries first_seen_at and last_seen_at (timestamptz). When a listing is observed again on a successful ingest, last_seen_at advances.

Active vs. removed: Open roles are those with is_active = true. When a posting disappears from upstream sources, ingestion marks it inactive and sets removed_at (soft delete) so history remains auditable. Public "open role" counts and APIs that filter to active jobs exclude inactive rows.

Scheduling: Employer-level cadence is driven by fields such as next_crawl_at on public.companies (priority, volume, and source reliability)—not a single fixed calendar for every brand.

Homepage radar: Pre-aggregates such as country_top_roles (via RPC like get_top_roles_by_country) power hiring "radar" style UI; they reflect batch jobs over the warehouse, not a live crawl during page load.

4.1 Market snapshot tables & the `growth_rate` field

Where the UI reads from: Job Search "Market hiring snapshot" calls Supabase RPCs get_top_roles_by_country and get_function_insights_by_geo. Those return rows from public.country_top_roles and public.country_function_insights for the latest computed_date in each table, including numeric columns job_count, company_count, and growth_rate.

Where growth_rate is written: This repository's migrations define the columns, but the values are produced by your offline batch / ETL that upserts these tables. Internal docs in migrations refer to a generator script named top_roles_generator.py; that script is not shipped in this workspace, so the exact formula must be confirmed in the environment where you run the job (or in your data team's repo).

Why every bucket might look "up": That is not guaranteed to be economically meaningful. Common causes include: (1) the previous computed_date had thin or partial backfill while the latest run is fuller, inflating sequential deltas; (2) the warehouse index grew monotonically in coverage for that market; (3) the pipeline stores a ratio in a different convention (e.g. already scaled) while a consumer assumes another—always reconcile against raw counts.

How to verify in SQL (Supabase SQL editor): Pick your market string (same as country / geo_market, e.g. 'United States') and inspect the last few days for one row:

-- Role bucket: compare job_count and stored growth_rate across dates
SELECT computed_date, role_category, job_count, company_count, growth_rate
FROM public.country_top_roles
WHERE country = 'United States' AND role_category = 'Software Engineer'
ORDER BY computed_date DESC
LIMIT 7;

-- Function bucket
SELECT computed_date, function_l2, job_count, company_count, growth_rate
FROM public.country_function_insights
WHERE geo_market = 'United States' AND function_l2 = 'Engineering'
ORDER BY computed_date DESC
LIMIT 7;

-- Manual implied growth from last two rows (if growth_rate is (curr-prev)/prev):
WITH x AS (
  SELECT computed_date, job_count,
         LAG(job_count) OVER (ORDER BY computed_date) AS prev_j
  FROM public.country_function_insights
  WHERE geo_market = 'United States' AND function_l2 = 'Engineering'
)
SELECT computed_date, job_count, prev_j,
       CASE WHEN prev_j IS NULL OR prev_j = 0 THEN NULL
            ELSE (job_count::float - prev_j) / prev_j END AS implied_ratio
FROM x
ORDER BY computed_date DESC
LIMIT 5;

If implied_ratio does not match growth_rate (within rounding), your ETL is using a different definition—or a different pair of dates—than a simple day-over-day ratio on job_count.

5. Known Limitations & Biases (What We Don't Claim)

We are transparent about the blind spots in our methodology:

Limitation	Explanation
Small Business Underrepresentation	Companies without a dedicated careers page or ATS feed (e.g., local restaurants, boutique consultancies) are not captured. Our index skews toward employers with formalized hiring processes.
Private / Hidden Listings	Roles filled exclusively through executive search firms or internal referrals are invisible to our crawler.
Geographic Gaps	Our coverage is strongest in North America and Europe. Emerging markets may have thinner coverage due to differing ATS ecosystems.
Linguistic Limitations	We process English-language job descriptions most reliably. Support for other languages varies by source.

We encourage users to treat the Hiring Index as a directional signal—a high-quality sample of the active labor market—rather than a census of every available job on earth.

6. How This Warehouse Relates to Product Features

The job warehouse and the AI layer serve different roles. Below is how each major feature actually uses data in the current architecture:

Feature	Relationship to `jobs` / index
Company Research	The production API (`/api/company/research`) is built around AI plus web search over public pages (careers sites, reviews, news). It does not stream a guaranteed join against every row in `public.jobs` for each request. Insights may still align with what you would see on careers pages because both draw from the same public hiring ecosystem.
Candidate Fit	Multi-stage AI: resume profiling, market/role intelligence (web research when enabled), and synthesis grounded in your pasted job description and resume. It is not a single opaque ATS keyword score. Optional retrieval or similarity steps may support consistency; user-facing copy stays in coaching language.
Career Match (Live)	Database-first: `POST /api/career-match/live` resolves AI-suggested employers to rows in `companies`, then loads active listings through `get_company_jobs` on `public.jobs`. It does not re-crawl ATS endpoints during the user session. Trend-style UI elsewhere may use aggregates such as `country_top_roles`.

7. Future Improvements

We are actively working on:

Expanded ATS connector coverage for international platforms.
Improved deduplication for roles with ambiguous titles (e.g., "Customer Success Manager" vs. "Client Partner").
Public API access for academic researchers and career services.

8. Questions or Feedback?

We welcome scrutiny. If you're a researcher, journalist, or curious user who wants to understand more about our data pipeline, reach out to us at help@resumaryai.com.

This methodology document was last updated on April 10, 2026.