Monday, March 11, 2013

The 3 things you absolutely need to know about regexp

1. By default all qualifiers are greedy  which means that they match as more text as they can! However if you want to match several instances of the same pattern add the ? identifier


2. By default newlines are not matched in std regexp so you can:

either remove the newlines with

re.findall('\begin{itemize}.*?\end{itemize}', page.replace('\n', '')


re.findall('\begin{itemize}.*?\end{itemize}', page, re.DOTALL)

3. To get a number (float or integer) you can use again ? but this time to make a character optional:


