Data Cleaning

Main Objective:
In this notebook we inspect and clean missing or duplicate values in the three main datasets:

Raw CVs (original, unstructured CV text)
Parsed Skills (the cv_skills DataFrame that feeds the matcher)
Matcher Results (output CSV from the matching algorithm)

%load_ext autoreload 
%autoreload 2

import polars as pl

from hiring_cv_bias.cleaning.common import (
    cramers_v,
    filter_out_candidate_ids,
    find_garbage_skill_rows,
    inspect_missing,
    plot_cramer_matrix,
)
from hiring_cv_bias.cleaning.raw_cv import (
    add_length_column,
    assess_translation_completeness,
    detect_corrupted_cvs,
    detect_repetitive_cvs,
    detect_vocab_sparsity,
    filter_placeholder_tails,
    find_and_print_short_cvs,
    is_this_language,
    plot_length_histogram,
)
from hiring_cv_bias.config import (
    CANDIDATE_CVS_TRANSLATED_PATH,
    PARSED_DATA_PATH,
    REVERSE_MATCHING_PATH,
)
from hiring_cv_bias.utils import (
    filter_unknown_and_other_rows,
    load_data,
    load_excel_sheets,
)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

1. Raw CVs Cleaning Steps

1. Schema & Load

Define expected columns and data types
Load CSV into Polars DataFrame, verify row count matches the number of uniques Candidate IDs.

raw_cv = load_data(CANDIDATE_CVS_TRANSLATED_PATH)
print("DataFrame schema:", raw_cv.schema)
print(f"Loaded: {raw_cv.height} CVs")

################

total_rows = raw_cv.height
unique_ids = raw_cv.select(pl.col("CANDIDATE_ID")).unique().height
print(f"Unique CANDIDATE_ID: {unique_ids}")
display(raw_cv.sample(5))

DataFrame schema: Schema({'CANDIDATE_ID': Int64, 'CV_text_anon': String, 'Translated_CV': String})
Loaded: 7769 CVs
Unique CANDIDATE_ID: 7769

shape: (5, 3)

CANDIDATE_ID	CV_text_anon	Translated_CV
i64	str	str
6892877	"CV anonimizzato: """ Traduttri…	" translator en<>it, es&l…
5216553	"CV anonimizzato: """ Formato …	" European format for CV person…
6965925	"CV anonimizzato: """ ㅡ Possied…	" ㅡ I have a certificate of low…
7329099	"CV anonimizzato: """ F O…	" europeanformatforcurriculumvi…
1698394	"CV anonimizzato: """ Operaia I…	" worker contact information da…

2. Missing Value Inspection

Count nulls per column, compute percentage of missing values and report any columns with > 0% missing.

inspect_missing(raw_cv)

Missing value summary:

shape: (3, 3)

column	n_missing	pct_missing
str	u32	f64
"CANDIDATE_ID"	0	0.0
"CV_text_anon"	0	0.0
"Translated_CV"	0	0.0


No missing values detected in any column.

3. Short or Empty CV Detection

Under this step we will:

Empty / Whitespace-Only/ Very Short Text
- Filter out any records where CV_text_anon is below a chosen threshold (e.g. 300 characters), showing their IDs and snippets for manual review.
High Repetition
- Split each CV into non-empty lines, count n_lines and unique_lines.
- Compute repetition_ratio = 1 − (unique_lines / n_lines).
- Flag and drop any CVs with repetition_ratio above a threshold (e.g. > 0.7).
- These checks ensure we remove any CVs that are dominated by boilerplate before proceeding with further analysis.

plot_length_histogram(raw_cv, text_col="CV_text_anon", bin_size=300, max_bin=3000)

Remove Empty / Whitespace-Only/ Very Short Text

raw_cv = add_length_column(raw_cv, text_col="CV_text_anon", length_col="len_anon")


too_short_df = find_and_print_short_cvs(
    raw_cv,
    length_col="len_anon",
    threshold=300,
    id_col="CANDIDATE_ID",
    text_col="CV_text_anon",
)

Found 77 CVs with CV_text_anon < 300 chars. Showing up to 10:
 1. ID 7560889: 263 chars
 2. ID 7456101: 29 chars
 3. ID 7437776: 279 chars
 4. ID 7414638: 110 chars
 5. ID 7373536: 58 chars
 6. ID 7364629: 202 chars
 7. ID 7347338: 82 chars
 8. ID 7330315: 237 chars
 9. ID 7230306: 238 chars
 10. ID 7163148: 253 chars

Example full CV for ID 5475487:
'CV anonimizzato:\n"""\n03/2006 - 01/2010\n02/1995 - 01/2006\n04/1987 - 01/1995\n10/1981 - 01/1987Autorizzo il trattamento dei miei dati personali presenti nel curriculum vitae ai sensi del Decreto Legislativo 30 giugno 2003, n. 196 e del GDPR (Regolamento UE 2016/679).\n"""'

short_cvs_ids = too_short_df.select(pl.col("CANDIDATE_ID")).to_series().to_list()
raw_cv_cleaned = filter_out_candidate_ids(
    raw_cv, short_cvs_ids, df_name="CVs", description="under 300 characters"
)

CVs with under 300 characters: 77 out of 7769 total
That is 0.99% of all cvs
CVs -> Original rows: 7769
CVs -> Dropped rows:  77 (IDs in provided list)
CVs -> Remaining rows: 7692

Remove High Repetition / Low Structure CVs

repetitive_cvs = detect_repetitive_cvs(
    raw_cv_cleaned, text_col="CV_text_anon", max_repetition=0.5
)
print(f"Found {repetitive_cvs.height} repetitive CVs:")
display(repetitive_cvs.select(["CANDIDATE_ID", "n_lines", "repetition_ratio"]))

Found 39 repetitive CVs:

shape: (39, 3)

CANDIDATE_ID	n_lines	repetition_ratio
i64	i64	f64
7259455	244	0.553279
7026944	105	0.971429
6981854	134	0.835821
6876738	195	0.548718
6788013	103	0.514563
…	…	…
164616	401	0.992519
153773	58	0.706897
125099	172	0.523256
122003	205	0.965854
121482	151	0.695364

repetitive_cvs_ids = repetitive_cvs.select(pl.col("CANDIDATE_ID")).to_series().to_list()
raw_cv_cleaned = filter_out_candidate_ids(
    raw_cv_cleaned, repetitive_cvs_ids, df_name="CVs", description="repetitive CVs"
)

CVs with repetitive CVs: 39 out of 7692 total
That is 0.51% of all cvs
CVs -> Original rows: 7692
CVs -> Dropped rows:  39 (IDs in provided list)
CVs -> Remaining rows: 7653

4. Text Quality Checks

Furthermore we filter out “bad” or malformed CVs, for doing that we will perform:

Token & Vocabulary Richness
- Calculate the unique words / total words ratio to measure lexical variety.
- Filter out CVs with very low number of unique words (e.g < 20) or very low ratio (e.g. < 0.2).
Detect Placeholder Tails
- Flag any CV whose text ends with a long run of the same character (e.g. “XXXXXXXXXX…”), seen after manual inspection. These indicate no usable content and should be excluded.
Detect Corrupted CVs
- Compute the fraction of “unusual” characters in each CV. Flag and remove any CV where this fraction exceeds a small threshold (e.g. 3%), catching heavily distorted documents.

These additional checks will help us catch CVs that are too short, overly repetitive, or structurally invalid.

Token & Vocabulary Richness

sparse_cvs = detect_vocab_sparsity(
    raw_cv_cleaned, text_col="CV_text_anon", min_words=30, min_ttr=0.3
)

print(f"CVs to discard based on vocabulary sparsity: {sparse_cvs.height}")
display(
    sparse_cvs.select(["CANDIDATE_ID", "total_words", "unique_words", "ttr"]).sort(
        pl.col("ttr"), descending=False
    )
)

CVs to discard based on vocabulary sparsity: 48

shape: (48, 4)

CANDIDATE_ID	total_words	unique_words	ttr
i64	i64	i64	f64
6206994	1291	242	0.187452
7080072	755	149	0.197351
6015186	931	191	0.205156
6809297	462	101	0.218615
6987708	1621	375	0.231339
…	…	…	…
6501990	17	16	0.941176
4479547	17	16	0.941176
3583383	22	21	0.954545
6056802	27	26	0.962963
6099154	14	14	1.0

sparse_ids = sparse_cvs.select(pl.col("CANDIDATE_ID")).to_series().to_list()
raw_cv_cleaned = filter_out_candidate_ids(
    raw_cv_cleaned, sparse_ids, df_name="CVs", description="low lexical variety"
)

CVs with low lexical variety: 48 out of 7653 total
That is 0.63% of all cvs
CVs -> Original rows: 7653
CVs -> Dropped rows:  48 (IDs in provided list)
CVs -> Remaining rows: 7605

Detect Placeholder Tails

placeholder_cvs = filter_placeholder_tails(
    raw_cv_cleaned, text_col="CV_text_anon", char="X", min_run=20
)
print(f"Found {placeholder_cvs.height} CVs with trailing X placeholders:")
display(placeholder_cvs.select(["CANDIDATE_ID"]))

Found 52 CVs with trailing X placeholders:

shape: (52, 1)

CANDIDATE_ID
i64
7471471
7423990
7386710
7295445
7228164
…
166297
156390
151354
146771
118642

placeholder_ids = placeholder_cvs.select(pl.col("CANDIDATE_ID")).to_series().to_list()
raw_cv_cleaned = filter_out_candidate_ids(
    raw_cv_cleaned,
    placeholder_ids,
    df_name="CVs",
    description="trailing X placeholders",
)

CVs with trailing X placeholders: 52 out of 7605 total
That is 0.68% of all cvs
CVs -> Original rows: 7605
CVs -> Dropped rows:  52 (IDs in provided list)
CVs -> Remaining rows: 7553

Detect Corrupted CVs

corrupted_cvs = detect_corrupted_cvs(
    raw_cv_cleaned, text_col="CV_text_anon", max_unusual_frac=0.02
)
print(f"Corrupted CVs to discard: {corrupted_cvs.height}")
display(
    corrupted_cvs.select(["CANDIDATE_ID", "unusual_frac"]).sort(
        "unusual_frac", descending=True
    )
)

Corrupted CVs to discard: 14

shape: (14, 2)

CANDIDATE_ID	unusual_frac
i64	f64
7265646	0.71831
7442026	0.683837
2963734	0.218973
4624374	0.044884
6745849	0.043478
…	…
7242049	0.021448
7133277	0.021265
6899819	0.021021
5400351	0.020385
6056462	0.020049

corrupted_ids = corrupted_cvs.select(pl.col("CANDIDATE_ID")).to_series().to_list()
raw_cv_cleaned = filter_out_candidate_ids(
    raw_cv_cleaned, corrupted_ids, df_name="CVs", description="corrupted symbols"
)

CVs with corrupted symbols: 14 out of 7553 total
That is 0.19% of all cvs
CVs -> Original rows: 7553
CVs -> Dropped rows:  14 (IDs in provided list)
CVs -> Remaining rows: 7539

5. Translation Completeness

Filter Out Poor Translations
Compute len_ratio = len(Translated_CV) / len(CV_text_anon) and flag any CV with
len_ratio < 0.7 as an empty/missing translation.

Since these cases are very rare, we drop them outright instead of attempting a re-translation.

translation_stats_df = assess_translation_completeness(raw_cv_cleaned)
display(
    translation_stats_df.filter(pl.col("len_ratio") < 0.7)
    .select(["CANDIDATE_ID", "orig_len", "trans_len", "len_ratio"])
    .sort("len_ratio", descending=False)
)

shape: (18, 4)

CANDIDATE_ID	orig_len	trans_len	len_ratio
i64	i64	i64	f64
7164534	6328	107	0.016909
158927	2700	139	0.051481
127945	1736	798	0.459677
6661398	3130	1731	0.553035
7303615	1134	661	0.582892
…	…	…	…
5964952	2017	1325	0.656916
5871851	5450	3636	0.667156
6278055	330	223	0.675758
1009832	4821	3332	0.691143
6200024	1442	997	0.691401

low_translation_ids = (
    translation_stats_df.filter(pl.col("len_ratio") < 0.7)
    .select("CANDIDATE_ID")
    .to_series()
    .to_list()
)
raw_cv_cleaned = filter_out_candidate_ids(
    raw_cv_cleaned,
    low_translation_ids,
    df_name="CVs",
    description="low translation completeness",
)

CVs with low translation completeness: 18 out of 7539 total
That is 0.24% of all cvs
CVs -> Original rows: 7539
CVs -> Dropped rows:  18 (IDs in provided list)
CVs -> Remaining rows: 7521

Since our analysis relies exclusively on the English translations, we will now verify that the Translated_CV column truly contains English text before proceeding with our analyses.

not_english_df = (
    raw_cv_cleaned.with_columns(
        [
            pl.col("Translated_CV")
            .map_elements(
                lambda s: not is_this_language(s or "", "en"),
                return_dtype=pl.Boolean,
            )
            .alias("not_english")
        ]
    )
    .select(["CANDIDATE_ID", "Translated_CV", "not_english"])
    .filter(pl.col("not_english"))
)
print(f"Found {not_english_df.height} CVs not in English.")
display(not_english_df.head())

Found 16 CVs not in English.

shape: (5, 3)

CANDIDATE_ID	Translated_CV	not_english
i64	str	bool
7546519	" gender male birthdate 18/10/1…	true
7292989	" work experience 10/15/2015 – …	true
7282933	" f o r m a t o e u r o p e o p…	true
7130948	" professional experience ener …	true
6871809	" experience bartender/barista …	true

print("Sample CVs not in English:")
print(not_english_df.sample(1)["Translated_CV"].item())

Sample CVs not in English:
 work experience 10/15/2015 – 01/18/2016 Vercelli, Italy Hertz mechanic 1. tire fitter 2. coupons address via fagiolini 11, 13100, Vercelli, Italy 09/01/2016 – 12/01/2016 Vercelli, Italy Hertz mechanic 1. tire fitter date of birth: 09/09/1995 2. coupons nationality: Italian address via fagiolini 11, 13100, Vercelli, Italy gender: male 04/02/2017 – 05/31/2017 Crescentino, Italy tire fitter Pneuscar sas di Mauro d&#39;Altoè e c. contacts 1. tire dealer address via giotto 54, 13044, crescentino, italy via umberto i 22  27030 palestro, italy (other) 01/07/2017 – 15/06/2018 saluggia, italy manuelboni31@gmail.com  electrician electrician momo massimo (+39) 3515326965 1. electrician  2. removal worker address via taiola, 13040, saluggia, italy 22/10/2018 – 07/12/2018 crescentino, italy tire dealer pneuscar sas di mauro d&#39;altoè e c. 1. tire dealer address via giotto 54, 13044, crescentino, italy 04/19/2019 – 06/15/2019 crescentino, italy tire dealer pneuscar sas di mauro d&#39;altoè e c. 1. tire dealer address via giotto 54, 13044, crescentino, italy 07/14/2020 – 10/18/2022 desana, italy blacksmith il ferraccio di balbo fabrizio 1. metalworker 2. maintenance and repairer of agricultural vehicles at &quot;claas agricoltura vercelli&quot; address via gioberti 64, 13034, desana, italy 10/19/2022 – 12/31/2022 vercelli, italy worker ecam srl 1. worker 2. assistant mechanical repairer of trucks 1 / 23. blacksmith address via f. donato 11/a, 13100, vercelli, italy education and training 09/2006 – 06/2012 vercelli, italy middle school diploma state school of first degree &quot;l. verga&quot; address via trino 32, vercelli, italy language skills native language: italian additional information driving license 05/25/2015 – 09/09/2025 driving license: b I authorize the processing of my personal data in the CV pursuant to art. 13 legislative decree 30 June 2003 n. 196 - &quot;personal data protection code&quot; and art. 13 gdpr 679/16 - &quot;european regulation on the protection of personal data&quot;. 2 / 2

We found only 16 records flagged as non-English. After a manual review, we discovered that these CVs are indeed written in English but include many Italian place names, addresses and organization titles which skew the language detector’s statistics. Since the underlying text is English and these cases are few, we will not drop them.

Note: going forward, all regex or pattern-based quality checks should be applied only to the English Translated_CV field.

6. Handling Empty CV Records

During our inspection we found many corrupted CVs.
In each dedicated section, to ensure data quality and consistency across all our analyses, we will:

Exclude their parsed skills
Filter out any rows in our parsed skills DataFrame (cv_skills) corresponding to those same CANDIDATE_IDs.
Omit their entries in the Reverse Matching results
Drop records in the Reverse Matching dataset (ReverseMatching.xlsx) for those candidate IDs.

2. Parsed Skills

1. Schema & Load

raw_cv = load_data(CANDIDATE_CVS_TRANSLATED_PATH)
raw_skills = load_data(PARSED_DATA_PATH)
print("DataFrame schema:", raw_skills.schema)
print(f"Loaded: {raw_skills.height} Skills")

################

total_rows = raw_skills.height
unique_ids = raw_skills.select(pl.col("CANDIDATE_ID")).unique().height
print(f"Unique CANDIDATE_ID: {unique_ids}")

DataFrame schema: Schema({'CANDIDATE_ID': Int64, 'Skill': String, 'Skill_Type': String})
Loaded: 132985 Skills
Unique CANDIDATE_ID: 7769

Exclude parsed skills of the candidates with corrupted CVs.

raw_cv_ids = raw_cv.select(pl.col("CANDIDATE_ID")).to_series().to_list()
print("Number of CVs loaded ->", len(raw_cv_ids))

raw_cv_cleaned_ids = raw_cv_cleaned.select(pl.col("CANDIDATE_ID")).to_series().to_list()
print(
    "Number of CVs kept after cleaning ->",
    len(raw_cv_cleaned_ids),
)

raw_cv_deleted_ids = (
    raw_cv.filter(~pl.col("CANDIDATE_ID").is_in(raw_cv_cleaned_ids))
    .select(pl.col("CANDIDATE_ID"))
    .to_series()
    .to_list()
)
print(
    "Number of CVs deleted ->",
    len(raw_cv_deleted_ids),
)

Number of CVs loaded -> 7769
Number of CVs kept after cleaning -> 7521
Number of CVs deleted -> 248

raw_skills_cleaned = filter_out_candidate_ids(
    raw_skills, raw_cv_deleted_ids, df_name="Skills", description="CVs deleted"
)

##################

# Sanity check on the new number of unique candidate IDs.

total_rows = raw_skills_cleaned.height
unique_ids = raw_skills_cleaned.select(pl.col("CANDIDATE_ID")).unique().height
print(f"Unique CANDIDATE_ID: {unique_ids}")

Skills with CVs deleted: 4086 out of 132985 total
That is 3.07% of all skills
Skills -> Original rows: 132985
Skills -> Dropped rows:  4086 (IDs in provided list)
Skills -> Remaining rows: 128899
Unique CANDIDATE_ID: 7521

2. Missing Value Inspection

inspect_missing(raw_skills_cleaned)

Missing value summary:

shape: (3, 3)

column	n_missing	pct_missing
str	u32	f64
"CANDIDATE_ID"	0	0.0
"Skill"	82	0.06
"Skill_Type"	0	0.0


Columns with > 0% missing values:

shape: (1, 3)

column	n_missing	pct_missing
str	u32	f64
"Skill"	82	0.06

Inspecting missing values we’ve found that 82 rows are missing a Skill value. Diving into the data, we see that these rows are all associated with the value DRIVERSLIC in the Skill_Type column.
This indicates that these rows are likely placeholders for driver license information, which is however useful to our analysis.
We will not drop these rows from the dataset.

Remove Missing or Null Skills

invalid_rows = raw_skills_cleaned.filter(
    pl.col("Skill").is_null() & (pl.col("Skill_Type") != "DRIVERSLIC")
)
print(f"Found {invalid_rows.height} rows with missing Skill (excluding DRIVERSLIC):")
display(invalid_rows)

##################

raw_skills_cleaned = raw_skills_cleaned.filter(
    ~(pl.col("Skill").is_null() & (pl.col("Skill_Type") != "DRIVERSLIC"))
)
print(f"After dropping invalid rows, {raw_skills_cleaned.height} skill records remain.")

Found 1 rows with missing Skill (excluding DRIVERSLIC):

shape: (1, 3)

CANDIDATE_ID	Skill	Skill_Type
i64	str	str
5107653	null	"Language_Skill"

After dropping invalid rows, 128898 skill records remain.

3. Find exact duplicate

total_before = raw_skills_cleaned.height
unique_before = raw_skills_cleaned.unique(subset=["CANDIDATE_ID", "Skill"]).height

print(f"Total rows: {total_before}")
print(f"Unique (ID, Skill, Skill_Type): {unique_before}")

##################

if total_before != unique_before:
    raw_skills_cleaned = raw_skills_cleaned.unique(subset=["CANDIDATE_ID", "Skill"])
    print(f"Dropped {total_before - unique_before} duplicates.")
else:
    print("No duplicates found, skipping deduplication.")

Total rows: 128898
Unique (ID, Skill, Skill_Type): 128898
No duplicates found, skipping deduplication.

4. Validate Skill Types

Ensure Skill_Type only takes one of your known categories (IT_Skill, Job_title, Language_Skill, Professional_Skill, DRIVERSLIC).

VALID_TYPES = [
    "IT_Skill",
    "Job_title",
    "Language_Skill",
    "Professional_Skill",
    "DRIVERSLIC",
]

all_types = raw_skills_cleaned.select("Skill_Type").unique().to_series().to_list()
invalid_types = [t for t in all_types if t not in VALID_TYPES]

##############

if invalid_types:
    print("Found invalid Skill_Type values:")
    for t in invalid_types:
        print("  -", t)
else:
    print("All Skill_Type values are valid.")

All Skill_Type values are valid.

5. Filter Out Garbage Skill Tokens

Drop any Skill entries that are obviously malformed (e.g. length < 2 or > 100, only punctuation or digits).
Optionally remove entries matching placeholder patterns (like a long run of X).

garbage_skills = find_garbage_skill_rows(raw_skills)
print(f"Dropped {garbage_skills.height} skill rows:")
display(garbage_skills.head(10))

Dropped 88 skill rows:

shape: (10, 3)

CANDIDATE_ID	Skill	Skill_Type
i64	str	str
7747553	null	"DRIVERSLIC"
7583771	null	"DRIVERSLIC"
7527802	null	"DRIVERSLIC"
7484890	null	"DRIVERSLIC"
7456557	null	"DRIVERSLIC"
7422185	null	"DRIVERSLIC"
7413665	null	"DRIVERSLIC"
7347469	null	"DRIVERSLIC"
7164548	null	"DRIVERSLIC"
7099403	null	"DRIVERSLIC"

display(garbage_skills.filter(pl.col("Skill_Type") == "DRIVERSLIC"))

shape: (87, 3)

CANDIDATE_ID	Skill	Skill_Type
i64	str	str
7747553	null	"DRIVERSLIC"
7583771	null	"DRIVERSLIC"
7527802	null	"DRIVERSLIC"
7484890	null	"DRIVERSLIC"
7456557	null	"DRIVERSLIC"
…	…	…
124012	null	"DRIVERSLIC"
123094	null	"DRIVERSLIC"
117943	null	"DRIVERSLIC"
117855	null	"DRIVERSLIC"
42289	null	"DRIVERSLIC"

All the garbage_skills would be DRIVERSLIC, so we will not drop any rows from the skills, as previously discussed.

3. Matcher Results

1. Schema & Load

reversed_skills_matching_dict = load_excel_sheets(
    REVERSE_MATCHING_PATH, sheets=["Candidates"]
)
reversed_skills_matching = reversed_skills_matching_dict["Candidates"]

print("DataFrame schema:", reversed_skills_matching.schema)
print(f"Loaded: {reversed_skills_matching.height} candidates")

################

total_rows = reversed_skills_matching.height
unique_ids = reversed_skills_matching.select(pl.col("CANDIDATE_ID")).unique().height
print(f"Unique CANDIDATE_ID: {unique_ids}")

DataFrame schema: Schema({'CANDIDATE_ID': Int64, 'Gender': String, 'Age_bucket': String, 'professional_categories_int': String, 'matterknowledges': String, 'languages': String, 'regulatedtrainings': String, 'candidatecity': String, 'drivinglicenses': String, 'LONGITUDE': Float64, 'LATITUDE': Float64, 'experience': String})
Loaded: 7769 candidates
Unique CANDIDATE_ID: 7769

reversed_skills_matching = filter_out_candidate_ids(
    reversed_skills_matching,
    raw_cv_deleted_ids,
    df_name="Reverse Matchhing Candidates",
    description="CVs deleted",
)

##################

# Sanity check on the new number of unique candidate IDs.

total_rows = reversed_skills_matching.height
unique_ids = reversed_skills_matching.select(pl.col("CANDIDATE_ID")).unique().height
print(f"Unique CANDIDATE_ID: {unique_ids}")

Reverse Matchhing Candidates with CVs deleted: 248 out of 7769 total
That is 3.19% of all reverse matchhing candidates
Reverse Matchhing Candidates -> Original rows: 7769
Reverse Matchhing Candidates -> Dropped rows:  248 (IDs in provided list)
Reverse Matchhing Candidates -> Remaining rows: 7521
Unique CANDIDATE_ID: 7521

2. Missing Value Inspection

inspect_missing(reversed_skills_matching)

Missing value summary:

shape: (12, 3)

column	n_missing	pct_missing
str	u32	f64
"CANDIDATE_ID"	0	0.0
"Gender"	0	0.0
"Age_bucket"	0	0.0
"professional_categories_int"	7	0.09
"matterknowledges"	3818	50.76
…	…	…
"candidatecity"	0	0.0
"drivinglicenses"	3586	47.68
"LONGITUDE"	0	0.0
"LATITUDE"	0	0.0
"experience"	2	0.03


Columns with > 0% missing values:

shape: (6, 3)

column	n_missing	pct_missing
str	u32	f64
"professional_categories_int"	7	0.09
"matterknowledges"	3818	50.76
"languages"	4011	53.33
"regulatedtrainings"	4143	55.09
"drivinglicenses"	3586	47.68
"experience"	2	0.03

3. Analyze demographic data

Next, we examine the demographic columns (age, gender, and location) to spot correlations and identify any redundant fields.

reversed_skills_matching = reversed_skills_matching.with_columns(
    pl.when(pl.col("LATITUDE") > 44.5)
    .then(pl.lit("NORTH"))
    .when(pl.col("LATITUDE") < 42)
    .then(pl.lit("SOUTH"))
    .otherwise(pl.lit("CENTER"))
    .alias("Location")
)

After classifying candidates as North, Center, or South by latitude, we select the columns of interest and examine their inter-correlations to decide which ones to analyze next.

reversed_skills_matching_filtered = filter_unknown_and_other_rows(
    reversed_skills_matching
)
reversed_skills_matching_filtered = reversed_skills_matching_filtered.select(
    "Gender",
    "Age_bucket",
    "Location",
)
reversed_skills_matching_filtered = reversed_skills_matching_filtered.drop_nulls()
display(reversed_skills_matching_filtered)

shape: (7_404, 3)

Gender	Age_bucket	Location
str	str	str
"Female"	"55-74"	"NORTH"
"Female"	"55-74"	"CENTER"
"Female"	"55-74"	"NORTH"
"Male"	"25-34"	"NORTH"
"Female"	"55-74"	"CENTER"
…	…	…
"Female"	"55-74"	"NORTH"
"Male"	"25-34"	"CENTER"
"Male"	"25-34"	"SOUTH"
"Male"	"25-34"	"NORTH"
"Female"	"55-74"	"NORTH"

Building a Cramér’s V Association Matrix

To find inter correlations, we compute a square matrix whose (i, j) entry is the Cramér’s V between column i and column j in reversed_skills_matching. (computed with polars‑ds)

It normalises the χ² test of independence to the range [0, 1]:

\[ V \;=\; \sqrt{\frac{\chi^{2}}{\,n\bigl(k_{\min}-1\bigr)}} \]

where

χ² is the chi-square statistic,
n is the number of rows,
\(k_{\min}\) is the smaller of the two variables’ unique value counts.

Thus V = 0 means no association, V = 1 means a perfect link.

cramer_corrs = [
    [
        cramers_v(reversed_skills_matching_filtered, c1, c2)
        for c2 in reversed_skills_matching_filtered.columns
    ]
    for c1 in reversed_skills_matching_filtered.columns
]

cramer_matrix = (
    pl.DataFrame(cramer_corrs, schema=reversed_skills_matching_filtered.columns)
    .with_columns(pl.Series("index", reversed_skills_matching_filtered.columns))
    .unpivot(index="index")
    .pivot(index="index", on="variable", values="value")
    .sort("index")
)

Attention!
The heat-map reveals a striking finding: Age Bucket and Gender are perfectly correlated (Cramér’s V = 1.0).
Since either variable fully determines the other, we will keep only one in subsequent analyses to avoid redundancy.

plot_cramer_matrix(cramer_matrix)

The tables below makes this explicit: every female candidate is aged 55 – 74, while every male candidate is aged 25 – 34.

female_candidates = reversed_skills_matching_filtered.filter(
    pl.col("Gender") == "Female"
)
display(female_candidates["Age_bucket"].value_counts())

shape: (1, 2)

Age_bucket	count
str	u32
"55-74"	3420

male_candidates = reversed_skills_matching_filtered.filter(pl.col("Gender") == "Male")
display(male_candidates["Age_bucket"].value_counts())

shape: (1, 2)

Age_bucket	count
str	u32
"25-34"	3984

Exporting the Cleaned Data

Export the cleaned CVs, the cleaned parsed skills, the cleaned matcher result to a new CSV file, ensuring the schema and data types are preserved.

raw_cv_cleaned.write_csv(
    "../data/Adecco_Dataset_cleaned/CV_translated_cleaned.csv", separator=";"
)
raw_skills_cleaned.write_csv(
    "../data/Adecco_Dataset_cleaned/Skills_cleaned.csv", separator=";"
)
reversed_skills_matching.write_csv(
    "../data/Adecco_Dataset_cleaned/reversed_skills_matching_candidate.csv",
    separator=";",
)