Results

In this section we walk through our key findings:

1. Error detection

As previously explained on the Pipeline page, our study first tried to highlight the errors the parser makes in categories such as Driving License, Language and Job Title.

While for the Driving License we treat the task as a binary has_any_driving_license label for each candidate, for the other two categories we use a finer granularity, examining the agreement between our method and the parser on every individual skill.

Results

Skill	TP	FP	TN	FN.	Precision	Recall
Driving License	1689	643	2927	2445	0.83	0.41
Language skills	10389	2392	132	5438	0.81	0.66
Job titles	7838	14976	326	6394	0.34	0.551

Demo

Here the demo of our bias-detection dashboard, where you can filter by driving license, language skills or job titles and immediately see extraction disagreement:

2. Key Bias Metrics

Due to the presence of numerous errors from the parser, we focused our analysis on the previously discussed demographic groups. To do so, we employed several metrics that specifically account for false negatives, since these result from exact matches and therefore offer high reliability.

To assess the model’s fairness, we employed the following bias detection metrics:

Metric	Formula	Interpretation
Equality of Opportunity (TPR parity)	\[\text{TPR}_g = \frac{TP_g}{TP_g + FN_g} \]	\(\text{TPR}_g\) equal for every \(g\) ensures that every individual who truly qualifies for a positive outcome has the same chance of being correctly identified, regardless of group membership.
Calibration (NPV)	\[\text{NPV}_g = \frac{TN_g}{TN_g + FN_g}\qquad \]	\(\text{NPV}_g\) parity for every \(g\) ensures that when the model predicts a negative outcome, the probability of being correct is the same for every group.
Selection Rate	\[\text{SR}_g = \frac{TP_g + FP_g}{TP_g + FP_g + TN_g + FN_g} \]	Share of individuals in group \(g\) predicted positive (selected).
Disparate Impact (DI)	\[\displaystyle DI = \frac{\text{SR}_{\text{target}}}{\text{SR}_{\text{reference}}}\]	Ratio of selection rates; values < 0.80 (four-fifths rule) indicate potential adverse impact against the target group.

All these metrics were computed for all the groups to detect and quantify possible bias in the selection process.

Driving License:

Gender	TPR	NPV	DI
Male	0.42	0.53	1.00
Female	0.39	0.56	0.88

Region	TPR	NPV	DI
North	0.40	0.54	1.00
Center	0.43	0.55	1.08
South	0.40	0.56	0.97

Lenght	TPR	NPV	DI
Long	0.38	0.46	1.00
Medium	0.44	0.60	0.99
Short	0.47	0.79	0.75

Language skills:

Gender	TPR	NPV	DI
Male	0.67	0.03	1.00
Female	0.64	0.02	0.95

Region	TPR	NPV	DI
North	0.66	0.03	1.00
Center	0.66	0.01	1.01
South	0.65	0.02	0.99

Lenght	TPR	NPV	DI
Long	0.63	0.01	1.00
Medium	0.69	0.03	1.11
Short	0.69	0.11	1.19

Job titles:

Gender	TPR	NPV	DI
Male	0.53	0.06	1.00
Female	0.56	0.03	0.97

Location	TPR	NPV	DI
North	0.55	0.05	1.00
Center	0.56	0.04	0.99
South	0.53	0.06	0.97

Lenght	TPR	NPV	DI
Long	0.56	0.01	1.00
Medium	0.53	0.07	0.99
Short	0.51	0.28	0.97

3. Summary of Findings

Overall, the parser exhibits very high error rates, with low recall across all categories and a large number of false negatives, indicating that many true skills are missed by the system.

When we examine the metrics across demographic groups, no strong bias emerges; nonetheless, a few observations merit discussion:

Minimal gender disparity in Driving License: the DI for females is 0.88, slightly below that of males but still above the critical 0.80 threshold defined by the four-fifths rule.
Length-based imbalances: we observe that “Short” CVs are disadvantaged in driving license extraction (DI = 0.75) and simultaneously advantaged for language skill extraction (DI = 1.19). These opposite effects suggest the parser’s performance varies significantly with document length and deserve a deeper analysis to uncover the root causes of these imbalances and guide mitigation strategies.