Skip to content

issue-62 [DDFG] Complete MAR missingness generation#87

Open
jeanielim wants to merge 2 commits into
eltonlaw:masterfrom
jeanielim:master
Open

issue-62 [DDFG] Complete MAR missingness generation#87
jeanielim wants to merge 2 commits into
eltonlaw:masterfrom
jeanielim:master

Conversation

@jeanielim
Copy link
Copy Markdown

naive implementation of corruptor.mar() & test case

Copy link
Copy Markdown
Owner

@eltonlaw eltonlaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test looks good, added a few comments.

model = sm.OLS(y, x).fit()

# Use model to get predictions for where data is missing
corrupt_data = pd.DataFrame(corrupt_data)
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not averse to adding more dependencies to the project if required, but I don't think Pandas is needed in this case. We have a python implementation to get the indices of all NaN's take a look at impyute.util.find_null

x = np.array(complete_data_mar[:, 0]).reshape((-1, 1))
y = np.array(complete_data_mar[:, 1])
x = sm.add_constant(x)
model = sm.OLS(y, x).fit()
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the comment on the use of pandas, statsmodel isn't a dependency, though we do use scikit-learn which has an OLS implementation as well.

import statsmodels.api as sm

def test_corrupt():
# Generate data
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add all this data generation stuff to conftest.py? Check out pytest fixtures

from impyute.dataset.corrupt import Corruptor
import statsmodels.api as sm

def test_corrupt():
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docstring please

full_model_coef = [full_model.params[1]-full_model.bse[1], full_model.params[1]+full_model.bse[1]]
imp_model_coef = imp_model.params[1]

assert full_model_coef[0] <= imp_model_coef <= full_model_coef[1], "MAR coefficient not within true values"
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a pretty neat test!

pass
output = self.data.copy()
nrow, ncol = self.data.shape
for i in range(0, nrow):
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't a blocker, since I guess it works but two nested for loops is pretttttty messy, You can achieve the same thing if you put the percentile & assign stuff in a separate function and do a numpy.apply_along_axis

In [4]: np.apply_along_axis(lambda x: x+1, -1, np.array([[1, 2, 3], [4, 5, 6]]))
Out[4]: 
array([[2, 3, 4],
       [5, 6, 7]])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants