Adding grouped total sum to a new DataFrame pyspark column

0

I have a dataframe with the following columns:

COL1    COL2    COL3    NEW_COL*
A       asd      1         8
B       adf      2         9
A       adg      8         1
B       adh      9         2
C       adj      7         7
D       adk      1         1

Being NEW_COL = (sum total of col1 by type - the value of col3) / (total qtd of col1 by type - 1)

In this column I need help, does anyone know how I can do it in a DataFrame with pyspark?

Thanks!

    
asked by anonymous 30.05.2018 / 23:32

1 answer

1

Adriana, I did not understand the calculation to make her new column. If NEW_COL = (sum total of col1 by type - the value of col3) / (total qtd of col1 by type - 1) then for first line of NEW_COL would be:

  • Total sum of col1 by type = 2, since there are two occurrences of A in col1
  • Value of col3: 1
  • Total amount of col1 per type: 2, as they are two occurrences of A in col1

So the first line would be: (2-1) / (2-1) = 1, so I did not understand why the result was 8. Could you explain me with a more detailed example of the calculation?

    
01.06.2018 / 01:16