Resume Parsing in 2024 / by Owein Reese

You'd think that with over 1,000 ATS (Applicant Tracking System) vendors operating in the world today, and with the same file formats (PDF, txt, docx, etc.) being used for over 20 years, along with the rise of AI, this wouldn't be a problem anymore. However, you would be wrong. It’s much, much harder than you’d believe.

The often shared advice to developers when submitting a resume to an agency, recruiter, or job site is use a PDF. PDFs are great at formatting and arranging images, video, audio, text, interactive elements, etc. in a document. They focus on layout and spacing. They do not embed semantic linkages between text elements; something not needed for layout but required to properly parse information. All the clues are visual.

Hence, even LLMs struggle with deciphering the contents of a PDF. The best technique is still based on a rule engine and heuristics to group PDF datum. As such we wind up with “Clark Shipping, MA” as the address, a blank for the company name and still call it a win.