A PDF file defines instructions to place characters at precise x,y coordinates relative to the bottom-left corner of the page. Words are simulated by placing some characters closer than others. Spaces are simulated by placing words relatively far apart. And finally tables are simulated by placing words as they would appear in a spreadsheet. The format has no internal representation of a table structure.
The Portable Document Format was not designed for tabular data. Sadly, a lot of open data is shared as PDFs and getting tables out for analysis is a pain. A simple copy-and-paste doesn't work. Excalibur makes PDF table extraction very easy, by automatically detecting tables in PDFs and letting you save them into CSVs and Excel files through a web interface.
There are both open and closed-source tools that are widely used for PDF table extraction. They either give a nice output or fail miserably. Excalibur is powered by Camelot which gives users additional settings to tweak table extraction and get the best results. You can see how it performs better than other open-source tools and libraries in this comparison.
You get complete control over your data, since all file storage and processing happens on your own local or remote machine. Excalibur can also be configured with MySQL and Celery to execute table extraction jobs in a parallel and distributed manner. By default, jobs are executed sequentially.
You can upload a PDF using the web interface. You can also interact with previous uploads.
You can guide the tool by drawing table areas and column separators in cases where the tables are buried deep inside the text and autodetection fails.
You can save table extraction settings for a PDF once, and apply them on new PDFs to extract tables with similar structures.
Copyright © Camelot Dev 2018
Made with in New Delhi, India