#Get Comfortable with List Comprehensions and Regex

##List Comprehensions

List comprehensions are based on set builder notation:

Given [a for b in list]:
a = the output function (default is append)
b = the variable
c = the input set

Examples:

1
2
3
4
5
6
7
8
9
10
11
12
13
	list = []
	for i in range(100):
		list.append(i)
	
	# becomes:
	list = [i for i in range(100)]

	list = [1, 2, 3, 4]
	for n in list:
		double_list.append(n*2)
	
	# becomes:
	double_list = [x*2 for x in list]

Add an if clause:

1
2
3
4
5
6
7
	double_list = []
	for n in list:
		if (n > 1):
			double_list.append(n * 2)
	
	# becomes:
	double_list = [n*2 for n in list if (n > 1)]

Loops of loops:

1
2
3
4
5
6
7
8
9
10
	list_one = [a, b]
	list_two = [1, 2]
	list_three = []
	for l in list_one:
		for m in list_two:
			list_three.append(l, m)
	== [(a, 1), (a, 2), (b, 1), (b, 2)]
	
	As a list comprehension:
	list_three = [(l, m) for l in list_one for m in list_two]   

You probably don’t want to get any fancier than that with your list comprehensions…no matter the advantages listed below. Otherwise, your code will become too difficult to read.

###So why use list comprehensions?

List comprehensions aren’t just a fancy way of abbreviating code–they also have some tangible benefits:

i. Conciseness

ii. Provide a kind of introduction to using generators (we’ll address these in a later workshop)

iii. They have an impact on the order of magnitude (O(n)), a shorthand for calculating the relative performance of different Python data structures and algorithms.

For example, a list comprehension

1
[item for item in items]

is up to twice as fast as the equivalent

1
for item in items: append(item)


###Big-O values for some common python list functions (from lightest–heaviest):

###And for dictionaries:

For more on this topic, see http://interactivepython.org’s Data and Algorithms course.


##Regular Expressions

Regular expressions are a tool for matching text patterns in strings of varying length and content. Regexes give you the flexibility to run searches on/match patterns beyond literal fixed characters.

The Python module that provides Regex support is called “re”. Search with the re.search() method:

1
text_to_match = re.search(pattern, string)

…the method will take the pattern you give it and will search against the string you’ve passed in.

Search patterns are often appended by an “r”:

1
re.search(r'pattern', string)

…to denote that the string is “raw”, meaning that nothing in the string should be escaped. You should generally use the “r” in your Regexes to avoid parsing issues.

Searching and matching works by looking for the complete pattern in each string; running through from start to finish; and stopping as soon as a match is found, returning the match object (or None if not found).

###Basic patterns:

Square Brackets

Square brackets can be used to search for one character from amongst a group (ex. [xyz] matches x or y or z), or to exclude characters (if beginning with a “^”; [^ xyz] means any character but x, y, or z). Square brackets also have slightly different rules, in that a period inside them == a literal period (and not a meta-character); a “-“ dash at the beginning or end is treated literally; and backslash escapes are not allowed.

For example, you might use square brackets to search for an email address:

1
email = re.search(r'[\w.]+@[\w.][com|org|net|edu]', string)

…or, since email addresses can get notoriously convoluted, just check for an @ and a “.”:

1
email = re.search(r'/.+@.+\..+/i', string)

Let’s break this last one down:

  1. r = raw
  2. / = delimits the regex (so that the “.+” wildcard that appears next won’t search an infinite number of characters
  3. .+ = any number of any kind of character for email name
  4. \@ = literal \@
  5. .+ = any number of characters for email service
  6. . = escaped period==literal period
  7. .+ = any characters of any kind (for com, edu, net, org, etc.)
  8. /i = the whole regex is case-insensitive

###Other methods besides search():

group() : won’t change matching behavior, but gives you the ability to extract the pattern captured inside as a logical group ex:

1
email = re.search(r'(/.+)@(.+\..+)/1', string)

would give you the ability to extract either the email name or email host from the pattern match, where:

1
2
3
4
5
6
7
	print match.group() = whole email address

	print match.group(1) = email name

	print match.group(2) = email host

	# note that the outer parentheses are treated as the first, default grouping   

.

findall() : a very useful re module function that finds all of the matches and returns them as a list of strings

ex:

1
emails = re.findall(r'/.+@.+\..+/i', string)

Then, you can iterate over the list to do something with the emails, like:

1
2
for email in emails:
	print email

You can even use this function to find all occurrences of a pattern in a file:

1
2
f = open('file.txt', 'r')
strings = re.findall(r'text pattern', f.read())

You can combine re.findall() and groups() together for even more granular data manipulation

sub() : is another useful function in the re module that allows you to substitute in values:

ex:

1
re.sub(pattern, replacement, string)
will return a new string with each pattern match replaced by the replacement value

…see also the workshop notes and code files and exercises (link coming soon)