Skip to content

Hangs using with large files #14

@baranberkay96

Description

@baranberkay96
  • occupationcoder version: 0.2.0
  • Python version: Python 3.9.5
  • Operating System: MacOS Big Sur Version 11.2.3

Description

pip3 freeze

Here the output:

alabaster==0.7.12
appdirs==1.4.4
Babel==2.9.1
beautifulsoup4==4.9.3
bleach==3.3.0
bump2version==1.0.1
certifi==2020.12.5
chardet==4.0.0
click==8.0.1
cloudpickle==1.6.0
colorama==0.4.4
coverage==5.5
dask==2021.5.0
distlib==0.3.1
docutils==0.16
filelock==3.0.12
flake8==3.9.0
fsspec==2021.5.0
idna==2.10
imagesize==1.2.0
importlib-metadata==4.3.0
Jinja2==3.0.1
joblib==1.0.1
keyring==23.0.1
locket==0.2.1
MarkupSafe==2.0.1
mccabe==0.6.1
nltk==3.6.2
numpy==1.20.3
occupationcoder @ file:///Users/baranberkaybarakcin/Documents/learning/occupation-coder/occupationcoder/dist/occupationcoder-0.2.0.tar.gz
packaging==20.9
pandas==1.2.4
partd==1.2.0
pkginfo==1.7.0
pluggy==0.13.1
py==1.10.0
pycodestyle==2.7.0
pyflakes==2.3.1
Pygments==2.9.0
pyparsing==2.4.7
python-dateutil==2.8.1
pytz==2021.1
PyYAML==5.4.1
readme-renderer==29.0
regex==2021.4.4
requests==2.25.1
requests-toolbelt==0.9.1
rfc3986==1.5.0
scikit-learn==0.24.2
scipy==1.6.3
six==1.16.0
snowballstemmer==2.1.0
soupsieve==2.2.1
Sphinx==3.5.4
sphinxcontrib-applehelp==1.0.2
sphinxcontrib-devhelp==1.0.2
sphinxcontrib-htmlhelp==2.0.0
sphinxcontrib-jsmath==1.0.1
sphinxcontrib-qthelp==1.0.3
sphinxcontrib-serializinghtml==1.1.5
threadpoolctl==2.1.0
toml==0.10.2
toolz==0.11.1
tox==3.23.0
tqdm==4.61.0
twine==3.4.1
urllib3==1.26.5
virtualenv==20.4.7
watchdog==2.0.2
webencodings==0.5.1
zipp==3.4.1

We run this snippet:

import pandas as pd
from occupationcoder.coder import coder
myCoder = coder.Coder()

if __name__ == '__main__':

    df = pd.read_csv('construction.csv')
    df['job_sector'] = "Construction & Property"
    df = myCoder.codedataframe(df)
    df.head()

construction.csv is a relatively large file. It has approx. 40K row.

When we try to run the code with 'construction.csv', it hangs and never finishes. I think that it can be related with dask multithread count, however couldn't find the solution. I'll be glad if you can help me. Have a nice day :)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions