PDF To XML Conversion

A tool to convert PDF files to structured XHTML. A basic knowledge of the HTML language is required. Note that this is a proof of concept.

Please read the limitations below before you proceed. Upload is limited to 25MB and 400 pages.

Privacy

The uploaded PDF files and related data remain completely private. Everything is destroyed around 5 hours after the last activity. Unless you explicitely choose to share your file with the developpers team to report a bug or ask for an improvement.

This site does not use cookies.

What's New ?

09/09/2020 Tables management: column widths, detection of row and column spans.

08/01/2020 Various bug fixes. Some help content added.

07/14/2020 The first version is online.

01/02/2021 The server burned. All is lost.

Limitations

THIS WEB SERVICE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

What to expect ?

This will never be a perfect tool but very good results can be obtained when specific conditions are met. In most cases, this will be a help to a document conversion to structured XML but a lot of manual work will still be necessary. In some cases a valuable result is simply impossible.

Beta Version

There is room for many improvements.

Audience

This is a conversion tool destined to technicians. It takes many configuration steps and sometimes many trials before a valuable result can be obtained (when such a result is possible).

Technical Limitations

Sometimes PDF files are made in such a way that prevents any possibility of conversion. For example, the characters may be encoded as drawings instead of character codes. In that case, the resulting document will simply contain nothing.