GROUP BY in Python

When it comes to a certain class of data problem, my mind reaches for SQL… which is a problem when the data’s in Python. Obviously I could create an in-memory sqlite database just for the purpose of storing the data and then retrieving it with SQL. But that would be mild overkill. One such example is grouping data by, say, the first letter. I’ve known about the itertools.groupby function for a while but for some reason whenever I came to look at it, it never quite seemed to find my brain. Having now made the breakthrough I’m reminding myself here for future purposes:

SELECT
  LEFT (words.word, 1),
  COUNT (*)
FROM
(
  SELECT
    word = LOWER (w.word)
  FROM
    words AS w
) AS words
WHERE
  LEN (words.word) >= 2
GROUP BY
  LEFT (words.word, 1)

translates to

import os, sys
import itertools
import operator
import re

first_letter = operator.itemgetter (0)

text = open (os.path.join (sys.prefix, "LICENSE.txt")).read ()
words = set (w.lower () for w in re.findall (r"\w{2,}", text))
groups = itertools.groupby (sorted (words), first_letter)

for k, v in groups:
  print k, "=>", len (list (v))

4 Comments so far »

  1. Abe said,

    Wrote on April 6, 2009 @ 8:32 pm

    Please provide full content of your posts to planet.python.org.

  2. tim said,

    Wrote on April 7, 2009 @ 1:16 pm

    @Abe: I’ve done that, Abe. If it feels too intrusive (given my prolix style) I may turn it off again later :).

  3. OpenIDhttp://gedmin.as/ said,

    Wrote on April 7, 2009 @ 4:11 pm

    Spaces in front of opening parentheses look weird in Python code (and PEP-8 recommends against them).

    I’d like to second the request to provide full entries on Planet Python.

  4. tim said,

    Wrote on April 7, 2009 @ 5:31 pm

    @gedmin.as
    Thanks for the comments. I’ve switched to full posts on planet.python now. (Doesn’t seem to be retrospective, tho’. Not sure there’s anything I can do about that).

    re opening brackets: this seems to be one of those things you either like or don’t. I’ve had this discussion with a dozen people since I’ve been posting code. (And I’ve posted a lot of code). Frankly, *not* having spaces looks weird to me. PEP 8 applies to Python core code, not to random code someone wants to post on the internet except as a general “if you’ve no other plan, here’s a guideline”.

    If I contribute patches to Python, or to any project, I’ll naturally follow their style guidelines. If I post my own code, I’ll naturally follow the style I’ve used across several languages over more than twenty years of writing code. Feel free to disagree.

    FWIW I have *occasionally* switched styles when I felt it politic to do so. My WMI tutorial, for example, removes spaces before brackets. Feels cramped to me, but someone asked nicely so I decided to go for it.

Comment RSS · TrackBack URI

Leave a Comment

OpenID

Sign in with your OpenID ?

Anonymous

Name: (Required)

E-mail: (Required)

Website:

Comment: