Saturday, November 3, 2018

Data Science with Functional Programming on Python

Data Science with Functional Programming

Overview

On this article, I’ll show some functional programming approach to data science with Python. With functional approach, some pre-processing can be concise. Especially when you are reluctant to use pandas library on some situation, this kind of approach can lead to code-readability.




Library

For the functional programming approach, on Python, we can use some libraries. Here, on this article, I’ll use one of them, funcy. You can install this simply by pip.
pip install funcy
The link below is the official document of funcy.
If you want, you can check just a cheat sheet.

Let’s try!

To grasp the way of using funcy, I’ll show some examples.
Here, as data, I’ll use the text below. This is from one of my blog articles. The original text was bit too long. This one is the shortened version of the original. So, the content itself doesn’t have any meaning.

text = """The image is from https://arxiv.org/abs/1512.03385. \n
But, the way of being merged in Dense block is different from the one in Residual module. In Dense block, it is by concatenation that the data which skipped layers is merged to the layer's input. \n
improved flow of information and gradients, making it easy to train the model\n
working well with few number of parameters\n
"""

The text above is composed of some lines. Our purpose is to get features from them. For that, make the text split into lines.

data = text.split('\n')

The text was split. But, with this, the list has empty factors and some lines end with space.

data
['The image is from https://arxiv.org/abs/1512.03385. ',
 '',
 "But, the way of being merged in Dense block is different from the one in Residual module. In Dense block, it is by concatenation that the data which skipped layers is merged to the layer's input. ",
 '',
 'improved flow of information and gradients, making it easy to train the model',
 '',
 'working well with few number of parameters',
 '',
 '']

From now on, by functional approach, I’ll clean the data.

import funcy

As we saw, the data has some empty factors. We don’t need these factors.

filter() function from funcy can be used to extract the factors which fulfils the conditions.

list(funcy.filter(lambda x: x!='', data))

On the case above, funcy.filter() picks up the factors which are not empty. The output of this is as below.

['The image is from https://arxiv.org/abs/1512.03385. ',
 "But, the way of being merged in Dense block is different from the one in Residual module. In Dense block, it is by concatenation that the data which skipped layers is merged to the layer's input. ",
 'improved flow of information and gradients, making it easy to train the model',
 'working well with few number of parameters']

Next, some lines of the data end with space. We don’t want that. With funcy, we can drop the spaces by the code below.

funcy.walk(lambda x: x.strip(), data)

By this, from each line, the space of the end drops.

['The image is from https://arxiv.org/abs/1512.03385.',
 '',
 "But, the way of being merged in Dense block is different from the one in Residual module. In Dense block, it is by concatenation that the data which skipped layers is merged to the layer's input.",
 '',
 'improved flow of information and gradients, making it easy to train the model',
 '',
 'working well with few number of parameters',
 '',
 '']

On those example, I used lambda expression(anonymous function). Of course this part can be replaced with named function.

def original_strip(sentense):
    return sentense.strip()

funcy.walk(original_strip, data)

Until now, we saw funcy.filter() and funcy.walk(). I think those two, especially funcy.walk(), will be used most frequently.

Anyway, now, we know how to drop empty factors and to strip the space of the end of factors. If we do those with one line program, it becomes as following.

list(funcy.walk(original_strip, funcy.filter(lambda x: x!= '', data)))

The output of this is below. You can see that there is no empty factor and each line doesn’t end with space.

['The image is from https://arxiv.org/abs/1512.03385.',
 "But, the way of being merged in Dense block is different from the one in Residual module. In Dense block, it is by concatenation that the data which skipped layers is merged to the layer's input.",
 'improved flow of information and gradients, making it easy to train the model',
 'working well with few number of parameters']

After doing those cleaning, you want to convert each line to list whose factors are each words. If we add the function for that, the code becomes below.

list(funcy.walk(lambda x:x.split(' '), funcy.walk(original_strip, funcy.filter(lambda x: x!= '', data))))

Of course, this is right, but visually noisy. Especially when you need to do many sequential process, on this way, the code becomes devastative. There are some solutions.
One, you can do the process one by one.

without_empty = funcy.filter(lambda x: x!= '', data)
without_space_on_end = funcy.walk(original_strip, without_empty)
list(funcy.walk(lambda x:x.split(' '), without_space_on_end))

Also, funcy has the function to make functions together. On this case, the function, original_strip() and lambda x:x.split(' ') are sequencially used. With funcy.rcompose() function, we can make those together.

do_everything = funcy.rcompose(original_strip, lambda x: x.split(' '))

list(funcy.walk(do_everything, funcy.filter(lambda x: x!= '', data)))

By using the knowledge which is shown until now, let’s clean the data.

cleaning = funcy.rcompose(lambda x: x.strip(' .'), lambda x: x.split(' '))
cleaned_data = list(funcy.walk(cleaning, funcy.filter(lambda x: x!='', data)))

This cleaned_data is like below.

[['The', 'image', 'is', 'from', 'https://arxiv.org/abs/1512.03385'],
 ['But,',
  'the',
  'way',
  'of',
  'being',
  'merged',
  'in',
  'Dense',
  'block',
  'is',
  'different',
  'from',
  'the',
  'one',
  'in',
  'Residual',
  'module.',
  'In',
  'Dense',
  'block,',
  'it',
  'is',
  'by',
  'concatenation',
  'that',
  'the',
  'data',
  'which',
  'skipped',
  'layers',
  'is',
  'merged',
  'to',
  'the',
  "layer's",
  'input'],
 ['improved',
  'flow',
  'of',
  'information',
  'and',
  'gradients,',
  'making',
  'it',
  'easy',
  'to',
  'train',
  'the',
  'model'],
 ['working', 'well', 'with', 'few', 'number', 'of', 'parameters']]

Just imagine, after making this form, nested list, you noticed that you needed to make all the letters lower. But the data is nested list.

funcy.walk() will walk on the iterable item. So, by nested funcy.walk(), we can process the nested items.

really_cleaned_data = list(funcy.walk(lambda x: funcy.walk(lambda y: y.lower(), x), cleaned_data))

About this point, there are some other ways. I’ll show you later.
Anyway, we already have the cleaned data. From this, let’s make features.

length = funcy.walk(len, really_cleaned_data)
is_there_of = funcy.walk(lambda x: 1 if 'of' in x else 0, really_cleaned_data)

Simply, I made two features. length is the feature which indicates the number of the words in one data point. is_there_of shows if the data point has the word ‘of’ or not. The features are like below.

print('length: {}'.format(length))
print('is_there_of: {}'.format(is_there_of))
length: [5, 36, 13, 7]
is_there_of: [0, 1, 1, 1]

As you remember, with funcy, we can make the sequential process together by funcy.rcompose(). If we want to use some functions to the same data and get the output of each, we can use funcy.juxt() function. On the code below, by two functions, len() and lambda x: 1 if 'of' in x else 0, the function,all_function is made and used. The output of this is the generator object’s list.

all_function = funcy.juxt(len, lambda x: 1 if 'of' in x else 0)

funcy.walk(all_function, really_cleaned_data)
[<generator object juxt.<locals>.<lambda>.<locals>.<genexpr> at 0x1054c63b8>,
 <generator object juxt.<locals>.<lambda>.<locals>.<genexpr> at 0x1054c6518>,
 <generator object juxt.<locals>.<lambda>.<locals>.<genexpr> at 0x1054c6360>,
 <generator object juxt.<locals>.<lambda>.<locals>.<genexpr> at 0x1054c64c0>]

To see the output well, I added tuple() to the all_function() sequentially.

all_function_tuple = funcy.rcompose(all_function, tuple)
funcy.walk(all_function_tuple, really_cleaned_data)
[(5, 0), (36, 1), (13, 1), (7, 1)]

We can see the outputs of each functions.
By combination of those functions of funcy, we can clean the data and do feature extractions.

Appendix

As an appendix, I’ll touch two topics, currying and dealing with nested iterable items.
First, about currying, I don’t touch detail here. I’ll just show how to deal with it on funcy.
Let's see the example below. We have the simple data and function. The different point of this function from before is that this accepts two arguments. Until now, all the function I used accepts just one argument. But this one is different.

easy_data = ['a', 'b', 'c']

def repeat(letter, times):
    return letter * times


funcy.walk() accepts function which takes just one argument. So, If we use this as before, it will raise the error.

funcy.walk(repeat, easy_data)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-31-67d377e71732> in <module>
----> 1 funcy.walk(repeat, easy_data)

~/.pyenv/versions/3.6.5/envs/blog/lib/python3.6/site-packages/funcy/colls.py in walk(f, coll)
    137     """Walks the collection transforming its elements with f.
    138        Same as map, but preserves coll type."""
--> 139     return _factory(coll)(xmap(f, iteritems(coll)))
    140 
    141 def walk_keys(f, coll):

TypeError: repeat() missing 1 required positional argument: 'times'

When we use this kind of function, we need to make the function which takes just one argument from the original function by currying. funcy supplies some functions for that. On the case below, the curried_repeat(times=2) returns the function which already contains times value inside.

curried_repeat = funcy.autocurry(repeat)
funcy.walk(curried_repeat(times=2), easy_data)
['aa', 'bb', 'cc']

Second, about dealing with nested items. Let’s assume that we have the following data and we want to make the factor of the list capital.

easy_nested = [['a', 'b'], ['c', 'd']]

If we don’t care about the visual complexity, we can use funcy.walk() inside funcy.walk().

funcy.walk(lambda x :funcy.walk(lambda y:y.upper(), x), easy_nested)
[['A', 'B'], ['C', 'D']]

As an another solution, we can use a decorator.

def adapt_to_list(func):
    def wrapper(*args, **kwargs):
        return funcy.walk(func, args[0])
    
    return wrapper

With this, without visual noise, we can adapt the function to list.

@adapt_to_list
def make_it_upper(target):
    return target.upper()

funcy.walk(make_it_upper, easy_nested)
[['A', 'B'], ['C', 'D']]