Text mining python or r [closed]

1

I'm trying to extract information from PDF files to popular a table without having to read the PDF. I just can not find any reference to how to do this.

I need, for example, to find the authors and date of publication of this article:

link

I would like package / function tips in python or r.

    
asked by anonymous 09.10.2018 / 18:50

1 answer

1

PDF files can have special fields to store this data, such as author and date, but I opened the PDF you sent and in them these fields are not filled:

Sothere'snomagic,you'llneedtoparsethetextandextractthedatadirectly,sincethePDFdoesnotprovidethisdatainanorganizedway.

Ifyoudonotknowtheexacttexttobesearchedfor,youcanmakepossi-bilitiesandmakeyourprogramtryeverypossibilityuntilyoufindonethatcangetthedata.

Forexample,inthePDFlisted,youcantrycomparingeachlinetothePDFnametofindthefulltitle,andconsiderthenextlineastheauthor.

AnotheroptionistolookuptheISSNacronym,andifyoufindit,youcanpickupthenumberandlookforsiteslike link and extract the data you want from the site rather than grabbing the PDF.

    
09.10.2018 / 19:05