Import and manipulate json in Python

1

I'm trying to import an .json file with the following structure:

short_description:She left her husband. He killed their children. Just 
another day in America.
headline:There Were 2 Mass Shootings In Texas Last Week, But Only 1 On TV
date:2018-05-26
link:https://www.huffingtonpost.com/entry/texas-amanda-painter-mass-shooting_us_5b081ab4e4b0802d69caad89
authors:Melissa Jeltsen
category:CRIME

But apparently json is not formatted properly (the file is here ), then not I was able to import using Pandas like this:

df = pd.read_json('../input/news-category-dataset/News_Category_Dataset.json', lines=True)

I was able to do this:

data = []
for line in open("News_Category_Dataset.json",'r'):
    data.append(json.loads(line))

But from what I understand, in this way it's like a file and you lose the json structure (is that right?), so I wanted to understand if the structure is really wrong, if you have read with the same Pandas so and / or if reading as file has to handle easily.

EDIT: A longer section of the file

{"short_description": "She left her husband. He killed their children. Just another day in America.", "headline": "There Were 2 Mass Shootings In Texas Last Week, But Only 1 On TV", "date": "2018-05-26", "link": "https://www.huffingtonpost.com/entry/texas-amanda-painter-mass-shooting_us_5b081ab4e4b0802d69caad89", "authors": "Melissa Jeltsen", "category": "CRIME"}
{"short_description": "Of course it has a song.", "headline": "Will Smith Joins Diplo And Nicky Jam For The 2018 World Cup's Official Song", "date": "2018-05-26", "link": "https://www.huffingtonpost.com/entry/will-smith-joins-diplo-and-nicky-jam-for-the-official-2018-world-cup-song_us_5b09726fe4b0fdb2aa541201", "authors": "Andy McDonald", "category": "ENTERTAINMENT"}
{"short_description": "The actor and his longtime girlfriend Anna Eberstein tied the knot in a civil ceremony.", "headline": "Hugh Grant Marries For The First Time At Age 57", "date": "2018-05-26", "link": "https://www.huffingtonpost.com/entry/hugh-grant-marries_us_5b09212ce4b0568a880b9a8c", "authors": "Ron Dicker", "category": "ENTERTAINMENT"}
{"short_description": "The actor gives Dems an ass-kicking for not fighting hard enough against Donald Trump.", "headline": "Jim Carrey Blasts 'Castrato' Adam Schiff And Democrats In New Artwork", "date": "2018-05-26", "link": "https://www.huffingtonpost.com/entry/jim-carrey-adam-schiff-democrats_us_5b0950e8e4b0fdb2aa53e675", "authors": "Ron Dicker", "category": "ENTERTAINMENT"}
{"short_description": "The \"Dietland\" actress said using the bags is a \"really cathartic, therapeutic moment.\"", "headline": "Julianna Margulies Uses Donald Trump Poop Bags To Pick Up After Her Dog", "date": "2018-05-26", "link": "https://www.huffingtonpost.com/entry/julianna-margulies-trump-poop-bag_us_5b093ec2e4b0fdb2aa53df70", "authors": "Ron Dicker", "category": "ENTERTAINMENT"}
{"short_description": "\"It is not right to equate horrific incidents of sexual assault with misplaced compliments or humor,\" he said in a statement.", "headline": "Morgan Freeman 'Devastated' That Sexual Harassment Claims Could Undermine Legacy", "date": "2018-05-26", "link": "https://www.huffingtonpost.com/entry/morgan-freeman-devastated-sexual-misconduct_us_5b096319e4b0802d69cba298", "authors": "Ron Dicker", "category": "ENTERTAINMENT"}
{"short_description": "It's catchy, all right.", "headline": "Donald Trump Is Lovin' New McDonald's Jingle In 'Tonight Show' Bit", "date": "2018-05-26", "link": "https://www.huffingtonpost.com/entry/donald-trump-mcondalds-tonight-show_us_5b093561e4b0fdb2aa53daba", "authors": "Ron Dicker", "category": "ENTERTAINMENT"}
{"short_description": "There's a great mini-series joining this week.", "headline": "What To Watch On Amazon Prime That\u2019s New This Week", "date": "2018-05-26", "link": "https://www.huffingtonpost.com/entry/amazon-prime-what-to-watch_us_5b044625e4b0c0b8b23ec14f", "authors": "Todd Van Luling", "category": "ENTERTAINMENT"}
{"short_description": "Myer's kids may be pushing for a new \"Powers\" film more than anyone.", "headline": "Mike Myers Reveals He'd 'Like To' Do A Fourth Austin Powers Film", "date": "2018-05-26", "link": "https://www.huffingtonpost.com/entry/mike-myers-reveals-he-wants-to-do-a-fourth-austin-powers-film_us_5b096198e4b0802d69cb9f15", "authors": "Andy McDonald", "category": "ENTERTAINMENT"}
{"short_description": "You're getting a recent Academy Award-winning movie.", "headline": "What To Watch On Hulu That\u2019s New This Week", "date": "2018-05-26", "link": "https://www.huffingtonpost.com/entry/hulu-what-to-watch_us_5b0445bae4b0c0b8b23ec046", "authors": "Todd Van Luling", "category": "ENTERTAINMENT"}
{"short_description": "The pop star also wore a \"Santa Fe Strong\" shirt at his show in Houston.", "headline": "Justin Timberlake Visits Texas School Shooting Victims", "date": "2018-05-26", "link": "https://www.huffingtonpost.com/entry/justin-timberlake-visits-texas-school-shooting-victims_us_5b098161e4b0fdb2aa54167e", "authors": "Sebastian Murdock", "category": "ENTERTAINMENT"}
{"short_description": "The two met to pave the way for a summit between North Korean and the U.S.", "headline": "South Korean President Meets North Korea's Kim Jong Un To Talk Trump Summit", "date": "2018-05-26", "link": "https://www.huffingtonpost.com/entry/south-korean-president-meets-north-koreas-kim-jong-un_us_5b094ebae4b0fdb2aa53e504", "authors": "", "category": "WORLD NEWS"}
{"short_description": "The revolution is coming to rural New Brunswick.", "headline": "With Its Way Of Life At Risk, This Remote Oyster-Growing Region Called In Robots", "date": "2018-05-26", "link": "https://www.huffingtonpost.com/entry/remote-oyster-growing-region-called-in-robots_us_5b083658e4b0fdb2aa53415d", "authors": "Karen Pinchin", "category": "IMPACT"}
{"short_description": "Last month a Health and Human Services official revealed the government was unable to locate nearly 1,500 children who had been released from its custody.", "headline": "Trump's Crackdown On Immigrant Parents Puts More Kids In An Already Strained System", "date": "2018-05-26", "link": "https://www.huffingtonpost.com/entry/immigrant-children-separated-from-parents_us_5b087b90e4b0802d69cb4070", "authors": "Elise Foley and Roque Planas", "category": "POLITICS"}
{"short_description": "The wiretaps feature conversations between Alexander Torshin and Alexander Romanov, a convicted Russian money launderer.", "headline": "'Trump's Son Should Be Concerned': FBI Obtained Wiretaps Of Putin Ally Who Met With Trump Jr.", "date": "2018-05-26", "link": "https://www.huffingtonpost.com/entry/fbi-wiretaps-putin-ally-trump-jr_us_5b08bf56e4b0568a880b7859", "authors": "Michael Isikoff, Yahoo News", "category": "POLITICS"}
    
asked by anonymous 30.10.2018 / 00:15

1 answer

3

Your file structure is a variant of JSON, called JSON Lines . The file extension should be .jsonl .

It's a very simple format, exactly the same as JSON, but instead of a single JSON in the whole file, this format uses one JSON object per file line. To read it you can do it in several ways: using pandas , or as in your example, reading each line separately from the file and then decoding with the module json normal. There are also specific libraries to read this format .

  

I could not import using Pandas

I downloaded the complete file (it was necessary to register in the site) and then I imported it in pandas normally, using lines=True which is the pandas parameter that allows reading jsonl :

>>> df = pd.read_json('News_Category_Dataset.json', lines=True)
>>> df.describe()
       authors  category        ...                                                      link short_description
count   124989    124989        ...                                                    124989            124989
unique   19250        31        ...                                                    124964            103905
top             POLITICS        ...         https://www.huffingtonpost.comhttps://www.publ...                  
freq     14151     32739        ...                                                         2             19590
first      NaN       NaN        ...                                                       NaN               NaN
last       NaN       NaN        ...                                                       NaN               NaN

No problem here, as you can see above ... if you can not read using pandas I suggest editing the question and adding the complete error message including traceback because something else must be wrong.

  

But from what I understand, this way it's just like any file and json's structure is lost (is that right?)

This question is confusing. A JSON file is also an "any" file, because, after all, every file is "any file" . The structure is not lost because the data continues to be read in a structured way, so much so that you can separate, for example, the description of the category, normally.

The only difference would be that instead of using the ready-made function that comes in pandas to interpret the format, you are doing a part of the interpretation yourself. Most of the time, using a ready implementation of a known library is a better solution, however, it may be that for a particular specific use, it is better to read manually. it all depends on what you intend to do with the structure afterwards, ie how you are going to manipulate this data.

    
30.10.2018 / 04:02