Problem Statement
The goal of this study is to evaluate the potential biases in a proprietary CV parser, which automatically extracts skills from anonymized, raw resumes.
Note: We did not have access to the proprietary parser; accordingly our analysis was carried out only on the parsing process’s input (the raw CV text) and its output (the skills extracted by the proprietary parser).
Specifically, we aim to answer:
- Candidate Representation:
- How are candidates’ profiles represented in the parsing outputs?
- Are there systematic differences in the number or types of skills extracted across CVs?
- Gendered Skill Association:
- Do certain skills appear significantly more often in profiles inferred to be male versus female?
- Could these patterns reflect underlying parser assumptions or training data imbalances?
- Cultural and Geographical Bias:
- Are skill categories (
skill_type) that align with specific cultural or regional backgrounds overrepresented?
- Hard vs. Soft Skills Distribution:
- What is the relative frequency of hard skills versus soft skills in the parsed output?
- Does the parser under‐detect one category, potentially skewing candidate profiles?
- Demographic Underrepresentation:
- Are certain demographic groups (e.g., inferred by language, region, or other proxies) underrepresented in the overall skill set?