%load_ext autoreload
%autoreload 2
import json
import os
import polars as pl
from huggingface_hub import login
from hiring_cv_bias.bias_detection.fuzzy.matcher import SemanticMatcher
from hiring_cv_bias.bias_detection.fuzzy.parser import JobParser
from hiring_cv_bias.bias_detection.rule_based.data import (
add_demographic_info,
)
from hiring_cv_bias.bias_detection.rule_based.evaluation.compare_parser import (
compute_candidate_coverage,
)
from hiring_cv_bias.bias_detection.rule_based.extractors import (
extract_driver_license,
extract_languages,
norm_driver_license,
norm_languages,
)
from hiring_cv_bias.bias_detection.rule_based.patterns import (
driver_license_pattern_eng,
jobs_pattern,
languages_pattern_eng,
normalized_jobs,
)
from hiring_cv_bias.bias_detection.rule_based.utils import (
print_highlighted_cv,
print_report,
)
from hiring_cv_bias.config import (
CANDIDATE_CVS_TRANSLATED_CLEANED_PATH,
CLEANED_REVERSE_MATCHING_PATH,
CLEANED_SKILLS,
DRIVING_LICENSE_FALSE_NEGATIVES_PATH,
JOB_TITLE_FALSE_NEGATIVES_PATH,
LANGUAGE_SKILL_FALSE_NEGATIVES_PATH,
)
from hiring_cv_bias.utils import load_data
pl.Config.set_tbl_cols(-1)
pl.Config.set_tbl_width_chars(200);Bias Detection
Main Objective:
Detect errors and biases introduced by the CV parser when extracting skills from raw CV text. We perform this inspection using both rule based (regular expressions) and semantic techniques.
- Error detection steps:
- Identify errors in candidates Driving Licenses and Language Skills using Regex.
- Uncover errors in candidates Job Experience using exact matching and semantic approach.
- Identify errors in candidates Driving Licenses and Language Skills using Regex.
- Bias detection:
- Analyze the errors identified in Step 1 for the groups previously examined (see
distribution_analysis.ipynb) to determine whether the parser has disadvantaged or advantaged any of them.
- Analyze the errors identified in Step 1 for the groups previously examined (see
os.environ["TOKENIZERS_PARALLELISM"] = "True"
with open("token.json", "r") as token:
login(token=json.load(token)["token"])
!python -m spacy download en_core_web_sm Collecting en-core-web-sm==3.8.0
Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.8/12.8 MB 3.9 MB/s eta 0:00:00a 0:00:01
✔ Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')
Load the data
df_cv_raw = load_data(CANDIDATE_CVS_TRANSLATED_CLEANED_PATH)
df_skills = load_data(CLEANED_SKILLS)
df_info_candidates = load_data(CLEANED_REVERSE_MATCHING_PATH)df_info_candidates = df_info_candidates.with_columns(
pl.when(pl.col("LATITUDE") > 44.5)
.then(pl.lit("NORTH"))
.when(pl.col("LATITUDE") < 42)
.then(pl.lit("SOUTH"))
.otherwise(pl.lit("CENTER"))
.alias("Location")
)
df_cv_raw = df_cv_raw.with_columns(
pl.when(pl.col("len_anon") < 1000)
.then(pl.lit("SHORT"))
.when(pl.col("len_anon") < 2500)
.then(pl.lit("MEDIUM"))
.otherwise(pl.lit("LONG"))
.alias("length")
)Bias detection for Driver Licences
Pre-processing step –> driving licence flag
We calladd_demographic_info()to add a Boolean column,has_driving_license, to the CV dataframe. This flag will help us compare what the regex detects in the raw CV text with what the parser extracted as driving licence for each candidate, allowing us to identify potential omissions in the parsing step.How the flag is generated
- A single case insensitive regex (
driver_license_pattern_eng) looks for common phrases such as “driving license B”, “C1 driving licence” or even “own car”.
- The helper function
extract_driver_license(text)returnsTrueif the regex matches anywhere in the CV text.
- A single case insensitive regex (
Resulting columns in
df_cv- Same as before
Gender,Location—> from the candidates sheethas_driving_license—>Trueif any licence is mentioned, otherwiseFalse
Note
For now we only care whether a candidate has any licence. (given that the driver license type column contains a handful of null values (seedata_cleaning.ipynb))The same regex already captures specific categories (A, B, C…), so the analysis could be extended later if we want to explore potential biases tied to particular licence types.
df_cv = add_demographic_info(df_cv_raw, df_info_candidates)
df_cv.head()| CANDIDATE_ID | CV_text_anon | Translated_CV | len_anon | length | Gender | Location | has_driving_license |
|---|---|---|---|---|---|---|---|
| i64 | str | str | i64 | str | str | str | bool |
| 7990324 | "CV anonimizzato: """ PROFILO D… | " profile graduated from elsa m… | 1445 | "MEDIUM" | "Female" | "NORTH" | true |
| 7974050 | "CV anonimizzato: """ Curricul… | " curriculum vitae personal inf… | 2148 | "MEDIUM" | "Female" | "CENTER" | true |
| 7965670 | "CV anonimizzato: """ ESPERIENZ… | " work experience 03/27/2023 – … | 4911 | "LONG" | "Female" | "NORTH" | true |
| 7960501 | "CV anonimizzato: """ Esperienz… | " work experience waiter and ba… | 680 | "SHORT" | "Male" | "NORTH" | false |
| 7960052 | "CV anonimizzato: """ Dat… | " date of birth: 03/26/1996 nat… | 5913 | "LONG" | "Female" | "CENTER" | false |
Comparing parser output and regex extraction
The compute_candidate_coverage() function evaluates how well the parsing system detects a specific category of skills by comparing it to our approach.
In this case, the chosen category is "DRIVERSLIC" and we use a custom regex based extractor applied directly to the raw CV text.
This step is crucial for measuring the parser’s coverage by quantifying false negatives. (skills that are mentioned in the CV but missed by the parser)
Output breakdown:
- Regex positive candidates: number of unique candidates flagged by our rule based extractor.
- Parser positive candidates: number of unique candidates flagged by the parser.
- Both regex & parser: candidates detected by both methods.
- Only regex: candidates our regex caught but the parser missed.
- Only parser: candidates the parser flagged but our regex did not.
Then print_report() displays the overall confusion matrix and derived metrics (accuracy, precision, recall, F1).
res_dl = compute_candidate_coverage(
df_cv=df_cv,
df_parser=df_skills,
skill_type="DRIVERSLIC",
extractor=extract_driver_license,
norm=norm_driver_license,
)
print("Confusion matrix:", res_dl.conf)Regex positive candidates : 4134
Parser positive candidates: 2032
- Both regex & parser : 1689
- Only regex : 2445
- Only parser : 343
Confusion matrix:
Counts -> TP:1689 FP:343 TN:2927 FN:2445
Metrics -> Precision:0.83 Recall:0.409 F1:0.548 Acc:0.623
The parser achieves high precision (~83 %) but low recall (~41 %).
In other words, when it flags a skill it is usually correct, yet it misses more than half of the skills that the regex finds.
Let’s dive into false negatives (FN)
* We’ll highlight in red the exact terms captured by the regex directly inside the CV text, making it easy to verify their presence at a glance.
df_fn = pl.DataFrame(res_dl.fn_rows)
sample = df_fn.sample(n=2, shuffle=True)
for row in sample.to_dicts():
print_highlighted_cv(row, pattern=driver_license_pattern_eng)CANDIDATE ID: 7090785 - GENERE: Male Reason: Rule-based extractor found skill but parser missed it. -------------------------------------------------------------------------------- lift driver in possession of a forklift driving license: use of electric and diesel forklifts s -------------------------------------------------------------------------------- CANDIDATE ID: 5108733 - GENERE: Male Reason: Rule-based extractor found skill but parser missed it. -------------------------------------------------------------------------------- stems. technical skills and competences b driving license or driving licenses I authorize the pro ls and competences b driving license or driving licenses I authorize the processing of personal --------------------------------------------------------------------------------
print(f"False negatives matching snippet pattern: {df_fn.height}")
df_fn.write_csv(DRIVING_LICENSE_FALSE_NEGATIVES_PATH, separator=";")
print("Saved filtered false negatives!")False negatives matching snippet pattern: 2445
Saved filtered false negatives!
Bias Detection Metrics
To assess the model’s fairness, we employed the following bias detection metrics:
| Metric | Formula | Interpretation |
|---|---|---|
| Equality of Opportunity (TPR parity) | \[\text{TPR}_g = \frac{TP_g}{TP_g + FN_g} \] | \(\text{TPR}_g\) equal for every \(g\) ensures that every individual who truly qualifies for a positive outcome has the same chance of being correctly identified, regardless of group membership. |
Calibration (NPV) |
\[\text{NPV}_g = \frac{TN_g}{TN_g + FN_g}\qquad \] | \(\text{NPV}_g\) parity for every \(g\) ensures that when the model predicts a negative outcome, the probability of being correct is the same for every group. |
Selection Rate |
\[\text{SR}_g = \frac{TP_g + FP_g}{TP_g + FP_g + TN_g + FN_g} \] | Share of individuals in group \(g\) predicted positive (selected). |
Disparate Impact (DI) |
\[\displaystyle DI = \frac{\text{SR}_{\text{target}}}{\text{SR}_{\text{reference}}}\] | Ratio of selection rates; values < 0.80 (four-fifths rule) indicate potential adverse impact against the target group. |
All these metrics were computed for the Gender, Location and CV length groups to detect and quantify possible bias in the extraction process.
print_report(
result=res_dl,
df_population=df_cv,
reference_col="Male",
group_col="Gender",
metrics=[
"equality_of_opportunity",
"calibration_npv",
],
)TP: 1689, FP: 343, TN: 2927, FN: 2445
Accuracy: 0.623, Precision: 0.831, Recall: 0.409, F1: 0.548
Error and rates by Gender:
| Gender | total | tp | fp | fn | tn | total_skills | fp_rate | fn_rate | equality_of_opportunity | calibration_npv | disparate_impact |
|---|---|---|---|---|---|---|---|---|---|---|---|
| str | u32 | u32 | u32 | u32 | u32 | u32 | f64 | f64 | f64 | f64 | f64 |
| "Male" | 3984 | 966 | 189 | 1316 | 1513 | 3984 | 0.04744 | 0.330321 | 0.423313 | 0.534818 | 1.0 |
| "Female" | 3420 | 723 | 154 | 1129 | 1414 | 3420 | 0.045029 | 0.330117 | 0.390389 | 0.556036 | 0.884526 |
print_report(
result=res_dl,
df_population=df_cv,
reference_col="NORTH",
group_col="Location",
metrics=[
"equality_of_opportunity",
"calibration_npv",
],
)TP: 1689, FP: 343, TN: 2927, FN: 2445
Accuracy: 0.623, Precision: 0.831, Recall: 0.409, F1: 0.548
Error and rates by Location:
| Location | total | tp | fp | fn | tn | total_skills | fp_rate | fn_rate | equality_of_opportunity | calibration_npv | disparate_impact |
|---|---|---|---|---|---|---|---|---|---|---|---|
| str | u32 | u32 | u32 | u32 | u32 | u32 | f64 | f64 | f64 | f64 | f64 |
| "SOUTH" | 1060 | 229 | 51 | 341 | 439 | 1060 | 0.048113 | 0.321698 | 0.401754 | 0.562821 | 0.968674 |
| "NORTH" | 5343 | 1217 | 240 | 1789 | 2097 | 5343 | 0.044919 | 0.334831 | 0.404857 | 0.539629 | 1.0 |
| "CENTER" | 1001 | 243 | 52 | 315 | 391 | 1001 | 0.051948 | 0.314685 | 0.435484 | 0.553824 | 1.080721 |
print_report(
result=res_dl,
df_population=df_cv,
reference_col="LONG",
group_col="length",
metrics=[
"equality_of_opportunity",
"calibration_npv",
],
)TP: 1689, FP: 343, TN: 2927, FN: 2445
Accuracy: 0.623, Precision: 0.831, Recall: 0.409, F1: 0.548
Error and rates by length:
| length | total | tp | fp | fn | tn | total_skills | fp_rate | fn_rate | equality_of_opportunity | calibration_npv | disparate_impact |
|---|---|---|---|---|---|---|---|---|---|---|---|
| str | u32 | u32 | u32 | u32 | u32 | u32 | f64 | f64 | f64 | f64 | f64 |
| "LONG" | 3789 | 901 | 161 | 1470 | 1257 | 3789 | 0.042491 | 0.387965 | 0.380008 | 0.460946 | 1.0 |
| "SHORT" | 610 | 90 | 39 | 103 | 378 | 610 | 0.063934 | 0.168852 | 0.466321 | 0.785863 | 0.754501 |
| "MEDIUM" | 3005 | 698 | 143 | 872 | 1292 | 3005 | 0.047587 | 0.290183 | 0.444586 | 0.597043 | 0.998508 |
Bias detection for Language Skill
Just as we applied a simple presence check for driving licenses, we handle languages with a more granular, ad-hoc normalization and extraction pipeline that recognizes each specific language individually rather than just flagging “has any language”.
Main steps for doing this are:
- Build a reverse lookup map (
_reverse_language_map) by iterating over each language code inLANGUAGE_VARIANTS(populated from pycountry with English name variantsalpha_2) and all its known name variants, storing entries like “english” -> “en”, “italian” -> “it”, etc. - Apply
norm_languagesto every extracted mention from our regex based extractor so that each occurrence like “English B2” is mapped to a clean ISO code “en”. - Once all language mentions have been normalized to ISO codes via
norm_languages, we invoke the coverage routine to quantify how well the parser matches our “ground truth” extractions.
res_lg = compute_candidate_coverage(
df_cv=df_cv,
df_parser=df_skills,
skill_type="Language_Skill",
extractor=extract_languages,
norm=norm_languages,
)
print("Confusion matrix:", res_lg.conf)Regex positive candidates : 6387
Parser positive candidates: 6500
- Both regex & parser : 5615
- Only regex : 772
- Only parser : 885
Confusion matrix:
Counts -> TP:10389 FP:2392 TN:132 FN:5438
Metrics -> Precision:0.81 Recall:0.656 F1:0.726 Acc:0.573
df_fn = pl.DataFrame(res_lg.fn_rows)
sample = df_fn.sample(n=2, shuffle=True)
for row in sample.to_dicts():
print_highlighted_cv(row, pattern=languages_pattern_eng)CANDIDATE ID: 5184724 - GENERE: Female Reason: Rule-based extractor found skill but parser missed it. -------------------------------------------------------------------------------- tute adele pellitteri contacts teaching Italian to foreign students and observation of on of actual improvements. nationality: italian palermo, italy gender: female 09/19/201 y doctorate in linguistic mediation and italian as l2 university of palermo 09/20/2004 china certificate of attendance of the chinese language course dalian university of fo age course dalian university of foreign languages chinese language courselanguage skills native l courselanguage skills native language: italian english listening reading speaking oral anguage skills native language: italian english listening reading speaking oral interac oral interaction writing c1 c1 c1 c1 c1 chinese listening reading speaking oral interac oral interaction writing b1 b1 b1 b1 b1 french listening reading speaking oral interac -------------------------------------------------------------------------------- CANDIDATE ID: 1087369 - GENERE: Female Reason: Rule-based extractor found skill but parser missed it. -------------------------------------------------------------------------------- nuovo di napoli (italy) personal skills italian mother tongue production foreign langua eraction oral production a2 a2 a2 a2 a2 english 24/1/19 © european union, 2002-2019 | h ge 2 / 3curriculum vitae a1 a1 a1 a1 a1 french levels: a1 and a2: basic user - b1 and --------------------------------------------------------------------------------
print(f"False negatives matching snippet pattern: {df_fn.height}")
df_fn.write_csv(LANGUAGE_SKILL_FALSE_NEGATIVES_PATH, separator=";")
print("Saved filtered false negatives to false_negative.csv")False negatives matching snippet pattern: 5438
Saved filtered false negatives to false_negative.csv
print_report(
result=res_lg,
df_population=df_cv,
reference_col="Male",
group_col="Gender",
metrics=[
"equality_of_opportunity",
"calibration_npv",
],
)TP: 10389, FP: 2392, TN: 132, FN: 5438
Accuracy: 0.573, Precision: 0.813, Recall: 0.656, F1: 0.726
Error and rates by Gender:
| Gender | total | tp | fp | fn | tn | total_skills | fp_rate | fn_rate | equality_of_opportunity | calibration_npv | disparate_impact |
|---|---|---|---|---|---|---|---|---|---|---|---|
| str | u32 | u32 | u32 | u32 | u32 | u32 | f64 | f64 | f64 | f64 | f64 |
| "Female" | 3420 | 4969 | 1141 | 2830 | 55 | 8995 | 0.126848 | 0.314619 | 0.637133 | 0.019064 | 0.952663 |
| "Male" | 3984 | 5420 | 1251 | 2608 | 77 | 9356 | 0.133711 | 0.278752 | 0.675137 | 0.028678 | 1.0 |
print_report(
result=res_lg,
df_population=df_cv,
reference_col="NORTH",
group_col="Location",
metrics=[
"equality_of_opportunity",
"calibration_npv",
],
)TP: 10389, FP: 2392, TN: 132, FN: 5438
Accuracy: 0.573, Precision: 0.813, Recall: 0.656, F1: 0.726
Error and rates by Location:
| Location | total | tp | fp | fn | tn | total_skills | fp_rate | fn_rate | equality_of_opportunity | calibration_npv | disparate_impact |
|---|---|---|---|---|---|---|---|---|---|---|---|
| str | u32 | u32 | u32 | u32 | u32 | u32 | f64 | f64 | f64 | f64 | f64 |
| "SOUTH" | 1060 | 1474 | 348 | 791 | 14 | 2627 | 0.13247 | 0.301104 | 0.650773 | 0.017391 | 0.997271 |
| "CENTER" | 1001 | 1431 | 343 | 733 | 10 | 2517 | 0.136273 | 0.29122 | 0.661275 | 0.013459 | 1.013434 |
| "NORTH" | 5343 | 7484 | 1701 | 3914 | 108 | 13207 | 0.128795 | 0.296358 | 0.656606 | 0.026852 | 1.0 |
print_report(
result=res_lg,
df_population=df_cv,
reference_col="LONG",
group_col="length",
metrics=[
"equality_of_opportunity",
"calibration_npv",
],
)TP: 10389, FP: 2392, TN: 132, FN: 5438
Accuracy: 0.573, Precision: 0.813, Recall: 0.656, F1: 0.726
Error and rates by length:
| length | total | tp | fp | fn | tn | total_skills | fp_rate | fn_rate | equality_of_opportunity | calibration_npv | disparate_impact |
|---|---|---|---|---|---|---|---|---|---|---|---|
| str | u32 | u32 | u32 | u32 | u32 | u32 | f64 | f64 | f64 | f64 | f64 |
| "LONG" | 3789 | 5715 | 925 | 3371 | 43 | 10054 | 0.092003 | 0.335289 | 0.62899 | 0.012595 | 1.0 |
| "SHORT" | 610 | 489 | 421 | 219 | 27 | 1156 | 0.364187 | 0.189446 | 0.690678 | 0.109756 | 1.19194 |
| "MEDIUM" | 3005 | 4185 | 1046 | 1848 | 62 | 7141 | 0.146478 | 0.258787 | 0.693685 | 0.032461 | 1.109166 |
Bias detection for Job Title
Loading list of jobs from ESCO and filtering out those that are too specific (length > 3).
df_skills_cleaned = df_skills.with_columns(
pl.col("Skill")
.str.to_lowercase()
.str.replace_all("(m/f)", "", literal=True)
.str.strip_chars()
.alias("Skill")
)The bias detection pipeline for job titles consists of two main components:
JobParser: a class that extracts job experiences listed in ESCO from raw CV texts, using SpaCy’s
PhraseMatcher.SemanticMatcher: once job experiences are extracted, the
SemanticMatcher(all‑MiniLM‑L6‑v2) is then used exclusively to determine which of these experiences match the parser extracted skills, employing semantic embeddings. Pairwise cosine similarity is calculated between the embeddings of JobParser skills and Parser skills. Matches are established when this similarity exceeds a specified threshold. This matching step is used only to compute metrics.
parser = JobParser(normalized_jobs)
matcher = SemanticMatcher()res_job = compute_candidate_coverage(
df_cv,
df_skills_cleaned,
"Job_title",
parser.parse_with_n_grams,
matcher=matcher.semantic_comparison,
)
print("Confusion matrix:", res_job.conf)Regex positive candidates : 5614
Parser positive candidates: 6763
- Both regex & parser : 5299
- Only regex : 315
- Only parser : 1464
Confusion matrix:
Counts -> TP:7838 FP:14976 TN:326 FN:6394
Metrics -> Precision:0.34 Recall:0.551 F1:0.423 Acc:0.276
df_fn = pl.DataFrame(res_job.fn_rows)
sample = df_fn.sample(n=2, shuffle=True)
for row in sample.to_dicts():
print_highlighted_cv(row, pattern=jobs_pattern)CANDIDATE ID: 6646832 - GENERE: Female Reason: Rule-based extractor found skill but parser missed it. -------------------------------------------------------------------------------- , milan fashion week participation as a dancer for the zegna promotional video under t hotel serena majestic (pe) work with a hostess and dancer contract. further informatio a majestic (pe) work with a hostess and dancer contract. further information • equippe -------------------------------------------------------------------------------- CANDIDATE ID: 6712718 - GENERE: Female Reason: Rule-based extractor found skill but parser missed it. -------------------------------------------------------------------------------- nd date of birth] professional profile: secretary, receptionist and front office assistan birth] professional profile: secretary, receptionist and front office assistant. I obtained nt of the service. may - september 2016 babysitter for private individuals in orio al seri of the service. October 2005 - May 2007 babysitter for private individuals in Stezzano dur --------------------------------------------------------------------------------
print(f"False negatives matching snippet pattern: {df_fn.height}")
df_fn.write_csv(JOB_TITLE_FALSE_NEGATIVES_PATH, separator=";")
print("Saved filtered false negatives to false_negative.csv")False negatives matching snippet pattern: 6394
Saved filtered false negatives to false_negative.csv
print_report(
result=res_job,
df_population=df_cv,
reference_col="Male",
group_col="Gender",
metrics=[
"equality_of_opportunity",
"calibration_npv",
],
)TP: 7838, FP: 14976, TN: 326, FN: 6394
Accuracy: 0.276, Precision: 0.344, Recall: 0.551, F1: 0.423
Error and rates by Gender:
| Gender | total | tp | fp | fn | tn | total_skills | fp_rate | fn_rate | equality_of_opportunity | calibration_npv | disparate_impact |
|---|---|---|---|---|---|---|---|---|---|---|---|
| str | u32 | u32 | u32 | u32 | u32 | u32 | f64 | f64 | f64 | f64 | f64 |
| "Female" | 3420 | 4265 | 6534 | 3273 | 111 | 14183 | 0.460692 | 0.230769 | 0.5658 | 0.032801 | 0.972811 |
| "Male" | 3984 | 3573 | 8442 | 3121 | 215 | 15351 | 0.549932 | 0.203309 | 0.533762 | 0.064448 | 1.0 |
print_report(
result=res_job,
df_population=df_cv,
reference_col="NORTH",
group_col="Location",
metrics=[
"equality_of_opportunity",
"calibration_npv",
],
)TP: 7838, FP: 14976, TN: 326, FN: 6394
Accuracy: 0.276, Precision: 0.344, Recall: 0.551, F1: 0.423
Error and rates by Location:
| Location | total | tp | fp | fn | tn | total_skills | fp_rate | fn_rate | equality_of_opportunity | calibration_npv | disparate_impact |
|---|---|---|---|---|---|---|---|---|---|---|---|
| str | u32 | u32 | u32 | u32 | u32 | u32 | f64 | f64 | f64 | f64 | f64 |
| "NORTH" | 5343 | 5549 | 10890 | 4519 | 228 | 21186 | 0.514019 | 0.213301 | 0.551152 | 0.04803 | 1.0 |
| "SOUTH" | 1060 | 1078 | 1956 | 937 | 62 | 4033 | 0.484999 | 0.232333 | 0.534988 | 0.062062 | 0.969529 |
| "CENTER" | 1001 | 1211 | 2130 | 938 | 36 | 4315 | 0.493627 | 0.217381 | 0.563518 | 0.036961 | 0.997859 |
print_report(
result=res_job,
df_population=df_cv,
reference_col="LONG",
group_col="length",
metrics=[
"equality_of_opportunity",
"calibration_npv",
],
)TP: 7838, FP: 14976, TN: 326, FN: 6394
Accuracy: 0.276, Precision: 0.344, Recall: 0.551, F1: 0.423
Error and rates by length:
| length | total | tp | fp | fn | tn | total_skills | fp_rate | fn_rate | equality_of_opportunity | calibration_npv | disparate_impact |
|---|---|---|---|---|---|---|---|---|---|---|---|
| str | u32 | u32 | u32 | u32 | u32 | u32 | f64 | f64 | f64 | f64 | f64 |
| "MEDIUM" | 3005 | 2360 | 5122 | 2075 | 167 | 9724 | 0.526738 | 0.21339 | 0.532131 | 0.074487 | 0.991761 |
| "SHORT" | 610 | 266 | 783 | 254 | 97 | 1400 | 0.559286 | 0.181429 | 0.511538 | 0.276353 | 0.965788 |
| "LONG" | 3789 | 5212 | 9071 | 4065 | 62 | 18410 | 0.492721 | 0.220804 | 0.56182 | 0.015023 | 1.0 |
%%bash --bg
cd ..
# for Unix users
.venv/bin/python -m streamlit run hiring_cv_bias/bias_detection/rule_based/app/fn_app.py
# for Windows users
#.venv/Scripts/python.exe -m streamlit run hiring_cv_bias/bias_detection/rule_based/app/fn_app.pySummary of Findings
Overall, the parser exhibits very high error rates, with low recall across all categories and a large number of false negatives, indicating that many true skills are missed by the system.
When we examine the metrics across demographic groups, no strong bias emerges; nonetheless, a few observations merit discussion:
Minimal gender disparity in Driving License: the DI for females is 0.88, slightly below that of males but still above the critical 0.80 threshold defined by the four-fifths rule.
Length-based imbalances: we observe that “Short” CVs are disadvantaged in driving license extraction (DI = 0.75) and simultaneously advantaged for language skill extraction (DI = 1.19). These opposite effects suggest the parser’s performance varies significantly with document length and deserve a deeper analysis to uncover the root causes of these imbalances and guide mitigation strategies.