Results

In this section we walk through our key findings:

1. Error detection

As previously explained on the Pipeline page, our study first tried to highlight the errors the parser makes in categories such as Driving License, Language and Job Title.

While for the Driving License we treat the task as a binary has_any_driving_license label for each candidate, for the other two categories we use a finer granularity, examining the agreement between our method and the parser on every individual skill.

Results

Skill TP FP TN FN. Precision Recall
Driving License 1689 643 2927 2445 0.83 0.41
Language skills 10389 2392 132 5438 0.81 0.66
Job titles 7838 14976 326 6394 0.34 0.551

Demo

Here the demo of our bias-detection dashboard, where you can filter by driving license, language skills or job titles and immediately see extraction disagreement:

2. Key Bias Metrics

Due to the presence of numerous errors from the parser, we focused our analysis on the previously discussed demographic groups. To do so, we employed several metrics that specifically account for false negatives, since these result from exact matches and therefore offer high reliability.

To assess the model’s fairness, we employed the following bias detection metrics:

Metric Formula Interpretation
Equality of Opportunity (TPR parity) \[\text{TPR}_g = \frac{TP_g}{TP_g + FN_g} \] \(\text{TPR}_g\) equal for every \(g\) ensures that every individual who truly qualifies for a positive outcome has the same chance of being correctly identified, regardless of group membership.

Calibration (NPV)
\[\text{NPV}_g = \frac{TN_g}{TN_g + FN_g}\qquad \] \(\text{NPV}_g\) parity for every \(g\) ensures that when the model predicts a negative outcome, the probability of being correct is the same for every group.

Selection Rate
\[\text{SR}_g = \frac{TP_g + FP_g}{TP_g + FP_g + TN_g + FN_g} \]
Share of individuals in group \(g\) predicted positive (selected).

Disparate Impact (DI)
\[\displaystyle DI = \frac{\text{SR}_{\text{target}}}{\text{SR}_{\text{reference}}}\] Ratio of selection rates; values < 0.80 (four-fifths rule) indicate potential adverse impact against the target group.

All these metrics were computed for all the groups to detect and quantify possible bias in the selection process.

Driving License:

Gender TPR NPV DI
Male 0.42 0.53 1.00
Female 0.39 0.56 0.88
Region TPR NPV DI
North 0.40 0.54 1.00
Center 0.43 0.55 1.08
South 0.40 0.56 0.97
Lenght TPR NPV DI
Long 0.38 0.46 1.00
Medium 0.44 0.60 0.99
Short 0.47 0.79 0.75

Language skills:

Gender TPR NPV DI
Male 0.67 0.03 1.00
Female 0.64 0.02 0.95
Region TPR NPV DI
North 0.66 0.03 1.00
Center 0.66 0.01 1.01
South 0.65 0.02 0.99
Lenght TPR NPV DI
Long 0.63 0.01 1.00
Medium 0.69 0.03 1.11
Short 0.69 0.11 1.19

Job titles:

Gender TPR NPV DI
Male 0.53 0.06 1.00
Female 0.56 0.03 0.97
Location TPR NPV DI
North 0.55 0.05 1.00
Center 0.56 0.04 0.99
South 0.53 0.06 0.97
Lenght TPR NPV DI
Long 0.56 0.01 1.00
Medium 0.53 0.07 0.99
Short 0.51 0.28 0.97

3. Summary of Findings

Overall, the parser exhibits very high error rates, with low recall across all categories and a large number of false negatives, indicating that many true skills are missed by the system.

When we examine the metrics across demographic groups, no strong bias emerges; nonetheless, a few observations merit discussion:

  • Minimal gender disparity in Driving License: the DI for females is 0.88, slightly below that of males but still above the critical 0.80 threshold defined by the four-fifths rule.

  • Length-based imbalances: we observe that “Short” CVs are disadvantaged in driving license extraction (DI = 0.75) and simultaneously advantaged for language skill extraction (DI = 1.19). These opposite effects suggest the parser’s performance varies significantly with document length and deserve a deeper analysis to uncover the root causes of these imbalances and guide mitigation strategies.