Reading & Sorting PDFs

This problem arose to help students prepare for their exams.  The idea was to create a collection of problems sorted by topic and sub-topic.  So for example the folder  Electromagnetism / Magnets would contain all problems about drawing magnetic fields, hard and soft magnetic materials, etc.  But how do we extract a given exercise out of a PDF and find the corresponding folder?

Available were 22 PDFs with exams from the last 10 years.  Each file containing about 25 pages.  So all in all 550 pages.   The job was to break-up those 550 pages into 1,2 or 3 page sub-documents, and save them to one of 20 folders corresponding to their topic.

The automation was not very sophisticated, but effective enough for its purpose & budget (pro-bono!).  The robot would do the following:

  • Open the PDF & turn it into just text
  • Find where problems start and end
  • Split the PDF into 10 to 15 new PDFs (one per problem)
  • Check a list of  keywords to give it a score:
for example: If it contains the words "Force", "Acceleration", "accelerates", "distance-time" it probably belongs in Mechanics.

Then store it in the corresponding folder.  For more sophisticated jobs with tens of thousands of documents, a Naive Bayesian Classifier is a more accurate and effective approach.

The original solution to this problem can be found on Github and a more up-to-date version on repl.it.

Show Comments