Using Group By in SAS Guide

1

I'm having trouble using the command GROUP BY , in SAS . By definition, SAS only accepts GROUP BY if a summary function is used in SELECT . Until then, okay. Turning Proc below, the result exits correctly:

proc sql;
    create table tabela_nova as
    select  nr_prpt_vcld,
            (max(ult_acl_atv)) as ult_acl_atv format ddmmyy10.
    from    tabela_base
    group by nr_prpt_vcld
    ;
quit;

nr_prpt_vcld ult_acl_atv
23157 02/01/2014
27175 07/03/2014
29350 2/17/2014

However, adding two more fields to SELECT , (as below) to Proc returns the following result:

proc sql;
    create table tabela_nova as
    select  nr_prpt_vcld,
            nr_cvn_srvc,
            cd_cli_srvc,
            (max(ult_acl_atv)) as ult_acl_atv format ddmmyy10.
    from    tabela_base
    group by nr_prpt_vcld
    ;
quit;

nr_prpt_vcld nrcvn_srvc cd_cli_srvc ult_acl_atv
23157 140594 504217876 02/01/2014
23157 147914 510003171 02/01/2014
23157 153120 511284856 02/01/2014
27175 140676 203940185 07/03/2014
27175 141805 931007593 07/03/2014
27175 150291 510955695 07/03/2014
29350 137242 508122116 2/17/2014
29350 145502 30139405 2/17/2014

I've tried adding new fields to GROUP BY , using HAVING , but nothing solved. I still can not figure out how GROUP BY in SAS works. Can anyone help me?

The expected result is as follows:

nr_prpt_vcld   nr_cvn_srvc       cd_cli_srvc ult_acl_atv 
    23157           153120        511284856  02/01/2014 
    27175           150291        510955695  03/07/2014 
    29350           137242        508122116  17/02/2014 

That is, the values nr_cvn_srvc and cd_cli_srvc that are on the most recent date (ult_acl_atv), grouped by nr_prpt_vcld.

    
asked by anonymous 31.07.2014 / 17:35

1 answer

2

TL; DR - (Workaround) use subqueries:

select  t1.nr_prpt_vcld                               as nr_prpt_vcld,
        (select t2.nrcvn_srvc
             from tabela_base t2
             where t2.nr_prpt_vcld = t1.nr_prpt_vcld
             order by t2.ult_acl_atv desc
             limit 1
        )                                             as nrcvn_srvc,
        (select t2.cd_cli_srvc
             from tabela_base t2
             where t2.nr_prpt_vcld = t1.nr_prpt_vcld
             order by t2.ult_acl_atv desc
             limit 1
        )                                             as cd_cli_srvc,
        (max(t1.ult_acl_atv))                         as ult_acl_atv
    from    tabela_base t1
    group by t1.nr_prpt_vcld

Example in SQLFiddle.

I have little familiarity with SAS, but as far as I understand, it has compliance with SQL standards. I will respond in a more didactic way, so that you can identify where you may be wrong and what to do to correct:

SQL queries overview

In an abstract way, it can be said that a query performs the following steps, in that order:

  • Find the table (s) to be consulted, making the applicable joins;

    Eg: from tabela_base

    nr_prpt_vcld   nrcvn_srvc   cd_cli_srvc   ult_acl_atv foo
    23157           140594        504217876  30/12/2013   bar
    23157           147914        510003171  01/01/2014   bar
    23157           153120        511284856  02/01/2014   bar
    27175           140676        203940185  01/07/2014   bar
    27175           141805        931007593  02/07/2014   bar
    27175           150291        510955695  03/07/2014   bar
    29350           137242        508122116  16/02/2014   bar
    29350           145502        30139405   17/02/2014   bar
    99999           999999        999999999  18/02/2014   baz
    
  • Filters the returned rows, according to condition where ;

    Eg: where foo = bar

    nr_prpt_vcld   nrcvn_srvc   cd_cli_srvc   ult_acl_atv foo
    23157           140594        504217876  31/12/2013   bar
    23157           147914        510003171  01/01/2014   bar
    23157           153120        511284856  02/01/2014   bar
    27175           140676        203940185  01/07/2014   bar
    27175           141805        931007593  02/07/2014   bar
    27175           150291        510955695  03/07/2014   bar
    29350           137242        508122116  16/02/2014   bar
    29350           145502        30139405   17/02/2014   bar
    -----           ------        --------   ----------   baz (X)
    
  • Group according to the criterion of the group by clause

    Eg: group by no_prpt_vcld

    nr_prpt_vcld   nrcvn_srvc   cd_cli_srvc   ult_acl_atv foo  tupla de agrupamento
    
    23157           140594        504217876  31/12/2013   bar \
    23157           147914        510003171  01/01/2014   bar  (23157)
    23157           153120        511284856  02/01/2014   bar /
    
    27175           140676        203940185  01/07/2014   bar \
    27175           141805        931007593  02/07/2014   bar  (27175)
    27175           150291        510955695  03/07/2014   bar /
    
    29350           137242        508122116  16/02/2014   bar \(29350)
    29350           145502        30139405   17/02/2014   bar /
    
  • "Achata" groups, each transforming into a single row, adding columns according to the aggregation functions

    Eg: (max(ult_acl_atv)) as ult_acl_atv

    nr_prpt_vcld   nrcvn_srvc   cd_cli_srvc   ult_acl_atv foo
    23157          [Achatado]   [Achatado]    02/01/2014  [Achatado]
    27175          [Achatado]   [Achatado]    03/07/2014  [Achatado]
    29350          [Achatado]   [Achatado]    17/02/2014  [Achatado]
    
  • Filter again, now the "flat" lines (not the originals)

    Ex: having ult_acl_atv < '06/06/2014' (imprecise syntax)

    nr_prpt_vcld   nrcvn_srvc   cd_cli_srvc   ult_acl_atv foo
    23157          [Achatado]   [Achatado]    02/01/2014  [Achatado]
    -----          ----------   ----------    03/07/2014  ---------- (X)
    29350          [Achatado]   [Achatado]    17/02/2014  [Achatado]
    
  • Choose the columns to return; you can only use grouped columns (i.e. that were part of group by or aggregations of other columns (i.e. that had some value defined after "flattening").)

    Eg: select nr_prpt_vcld, (...) as ult_acl_atv

    nr_prpt_vcld   ult_acl_atv
    23157          02/01/2014
    29350          17/02/2014
    
  • How does this apply to you?

    As you can see, you can not add columns nrcvn_srvc and cd_cli_srvc because they have been "flattened" when you get several rows from the table and transform into one. The most you can do is add them too, if that's what you want. Ex.:

    select  nr_prpt_vcld,
            (min(nrcvn_srvc)) as nrcvn_srvc,
            (sum(cd_cli_srvc)) as cd_cli_srvc,
            (max(ult_acl_atv)) as ult_acl_atv format ddmmyy10.
    
        nr_prpt_vcld   nrcvn_srvc   cd_cli_srvc   ult_acl_atv
        23157          140594       1525505903    02/01/2014
        29350          137242       538261521     17/02/2014
    

    Or , add one (or more) of them to the group - but doing so, the grouping will be different. Ex.:

    group by nr_prpt_vcld, nrcvn_srvc
    
        nr_prpt_vcld   nrcvn_srvc     cd_cli_srvc ult_acl_atv  foo           tupla de agrupamento
    
        23157           140594        [Achatado]  31/12/2013   [Achatado] => (23157,140594)
    
        23157           147914        [Achatado]  01/01/2014   [Achatado] => (23157,147914)
    
        23157           153120        [Achatado]  02/01/2014   [Achatado] => (23157,153120)
    
        27175           140676        [Achatado]  01/07/2014   [Achatado] => (27175,140676)
    
        27175           141805        [Achatado]  02/07/2014   [Achatado] => (27175,141805)
    
        27175           150291        [Achatado]  03/07/2014   [Achatado] => (27175,150291)
    
        29350           137242        [Achatado]  16/02/2014   [Achatado] => (29350,137242)
    
        29350           145502        [Achatado]  17/02/2014   [Achatado] => (29350,145502)
    

    Etc. If you add other fields in the group by clause, it will group by tuple fields - and each different combination will generate a grouping, which will later be flattened and may have its aggregator values.

    Typically, a query like the one you did would generate an error in SQL. But apparently , what SAS did was to implicitly add the other fields in the grouping condition, so that their query was done as if group by was like this:

    group by nr_prpt_vcld, nrcvn_srvc, cd_cli_srvc
    

    So only rows that have these 3 identical values would be part of the same grouping. If there is no more than one row with these conditions, the result is as if it had not grouped anything, simply returned each row separately ...

    Workaround

    I do not know if what you want to do is possible within the group by clause itself: get the value of a column corresponding to the maximum of other column. A workaround would use two subqueries, one for each column; in each of them, the correct column value corresponding to the maximum date value is taken, for a specific code:

    select t2.nrcvn_srvc
        from tabela_base t2
        where t2.nr_prpt_vcld = ### um código fixo ###
        order by t2.ult_acl_atv desc
        limit 1
    

    This is done in the select section - when the grouping has already been done and the row is being assembled with the results. The rows are sorted in descending order by date (so the first one is the most recent), and the amount of results is limited to 1 , so only the largest of them will be returned. Get the field you want and use it as the column value:

    select  t1.nr_prpt_vcld       as nr_prpt_vcld,
            (select ...)          as nrcvn_srvc,
            (select ...)          as cd_cli_srvc,
            (max(t1.ult_acl_atv)) as ult_acl_atv
    

    Final result (note that I gave the original table a nickname):

    select  t1.nr_prpt_vcld                               as nr_prpt_vcld,
            (select t2.nrcvn_srvc
                 from tabela_base t2
                 where t2.nr_prpt_vcld = t1.nr_prpt_vcld
                 order by t2.ult_acl_atv desc
                 limit 1
            )                                             as nrcvn_srvc,
            (select t2.cd_cli_srvc
                 from tabela_base t2
                 where t2.nr_prpt_vcld = t1.nr_prpt_vcld
                 order by t2.ult_acl_atv desc
                 limit 1
            )                                             as cd_cli_srvc,
            (max(t1.ult_acl_atv))                         as ult_acl_atv
        from    tabela_base t1
        group by t1.nr_prpt_vcld
    
        
    31.07.2014 / 20:02