2Answers
  • 6
name

A PHP Error was encountered

Severity: Notice

Message: Undefined index: userid

Filename: views/question.php

Line Number: 191

Backtrace:

File: /home/prodcxja/public_html/questions/application/views/question.php
Line: 191
Function: _error_handler

File: /home/prodcxja/public_html/questions/application/controllers/Questions.php
Line: 433
Function: view

File: /home/prodcxja/public_html/questions/index.php
Line: 315
Function: require_once

name Punditsdkoslkdosdkoskdo

How to subclass pandas DataFrame?

Subclassing pandas classes seems a common need but I could not find references on the subject. (It seems that pandas developers are still working on it: https://github.com/pydata/pandas/issues/60).

There are some SO threads on the subject, but I am hoping that someone here can provide a more systematic account on currently the best way to subclass pandas.DataFrame that satisfies two, I think, general requirements:

import numpy as np
import pandas as pd

class MyDF(pd.DataFrame):
    # how to subclass pandas DataFrame?
    pass

mydf = MyDF(np.random.randn(3,4), columns=['A','B','C','D'])
print type(mydf)  # <class '__main__.MyDF'>

# Requirement 1: Instances of MyDF, when calling standard methods of DataFrame,
# should produce instances of MyDF.
mydf_sub = mydf[['A','C']]
print type(mydf_sub)  # <class 'pandas.core.frame.DataFrame'>

# Requirement 2: Attributes attached to instances of MyDF, when calling standard 
# methods of DataFrame, should still attach to the output.
mydf.myattr = 1
mydf_cp1 = MyDF(mydf)
mydf_cp2 = mydf.copy()
print hasattr(mydf_cp1, 'myattr')  # False
print hasattr(mydf_cp2, 'myattr')  # False

And is there any significant differences for subclassing pandas.Series? Thank you.

There is now an official guide on how to subclass Pandas data structures, which includes DataFrame as well as Series.

The guide is available here: https://pandas.pydata.org/pandas-docs/stable/development/extending.html#extending-subclassing-pandas

The guide mentions this subclassed DataFrame from the Geopandas project as a good example: https://github.com/geopandas/geopandas/blob/master/geopandas/geodataframe.py

As in HYRY's answer, it seems there are two things you're trying to accomplish:

  1. When calling methods on an instance of your class, return instances of the correct type (your type). For this, you can just add the _constructor property which should return your type.
  2. Adding attributes which will be attached to copies of your object. To do this, you need to store the names of these attributes in a list, as the special _metadata attribute.

Here's an example:

class SubclassedDataFrame(DataFrame):
    _metadata = ['added_property']
    added_property = 1  # This will be passed to copies

    @property
    def _constructor(self):
        return SubclassedDataFrame
  • 26
Reply Report

For Requirement 1, just define _constructor:

import pandas as pd
import numpy as np

class MyDF(pd.DataFrame):
    @property
    def _constructor(self):
        return MyDF


mydf = MyDF(np.random.randn(3,4), columns=['A','B','C','D'])
print type(mydf)

mydf_sub = mydf[['A','C']]
print type(mydf_sub)

I think there is no simple solution for Requirement 2, I think you need define __init__, copy, or do something in _constructor, for example:

import pandas as pd
import numpy as np

class MyDF(pd.DataFrame):
    _attributes_ = "myattr1,myattr2"

    def __init__(self, *args, **kw):
        super(MyDF, self).__init__(*args, **kw)
        if len(args) == 1 and isinstance(args[0], MyDF):
            args[0]._copy_attrs(self)

    def _copy_attrs(self, df):
        for attr in self._attributes_.split(","):
            df.__dict__[attr] = getattr(self, attr, None)

    @property
    def _constructor(self):
        def f(*args, **kw):
            df = MyDF(*args, **kw)
            self._copy_attrs(df)
            return df
        return f

mydf = MyDF(np.random.randn(3,4), columns=['A','B','C','D'])
print type(mydf)

mydf_sub = mydf[['A','C']]
print type(mydf_sub)

mydf.myattr1 = 1
mydf_cp1 = MyDF(mydf)
mydf_cp2 = mydf.copy()
print mydf_cp1.myattr1, mydf_cp2.myattr1
  • 14
Reply Report
      • 1
    • It seems to me that you'd often what to have a corresponding subclass of Series at the same time (i.e. have them MyDF and MyS link in some way so e.g. mydf.sum() returns a MyS...)

Trending Tags