Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: String Columns with only numeric values are converted to integer in read_file() #3431

Open
1 of 3 tasks
SeppBerkner opened this issue Sep 27, 2024 · 7 comments
Open
1 of 3 tasks

Comments

@SeppBerkner
Copy link

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of geopandas.

  • (optional) I have confirmed this bug exists on the main branch of geopandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

from owslib.wfs import WebFeatureService
import geopandas as gpd

wfs = WebFeatureService(url="https://dienste.gdi-sh.de/WFS_SH_ALKIS_VWG_OpenGBD", version="2.0.0", timeout=180)

response = wfs.getfeature(
                typename="ave:VerwaltungsEinheit"
            )

gpd.read_file(response)

Problem description

I have a column that stores the 8-digit German adiminstrative code ("Amtlicher Gemeindeschlüssel"), which is a numeric code. Depending on the German state (which is encoded in the first two digits) a leading zero may exist (e.g. 01 for the state Schleswig-Holstein). Because of the numeric nature of this string field in the WFS, the read_file() function transforms this field into an integer field. This also leads to the removal of the leading zero and, thus, the wrong identifier for my data.

Expected Output

Geopandas would actually recognize the type as a string (in QGIS it is recognized as a string and the XML response also still stores the 0).

As an alternative it would be at least great to pass dtype parameter to the read_file() function to tell fiona / pyogrio to define a specific dtype to a specific column.

Output of geopandas.show_versions()

SYSTEM INFO

python : 3.12.4 | packaged by conda-forge | (main, Jun 17 2024, 10:11:10) [Clang 16.0.6 ]
executable : /Users/jbe/anaconda3/envs/lbd_data_2_env/bin/python
machine : macOS-14.7-x86_64-i386-64bit

GEOS, GDAL, PROJ INFO

GEOS : 3.12.2
GEOS lib : None
GDAL : 3.9.2
GDAL data dir: /Users/jbe/anaconda3/envs/lbd_data_2_env/share/gdal/
PROJ : 9.4.1
PROJ data dir: /Users/jbe/anaconda3/envs/lbd_data_2_env/share/proj

PYTHON DEPENDENCIES

geopandas : 1.0.1
numpy : 2.1.0
pandas : 2.2.2
pyproj : 3.6.1
shapely : 2.0.6
pyogrio : 0.9.0
geoalchemy2: None
geopy : None
matplotlib : 3.9.2
mapclassify: 2.8.0
fiona : None
psycopg : None
psycopg2 : 2.9.9 (dt dec pq3 ext lo64)
pyarrow : 17.0.0

@theroggy
Copy link
Member

theroggy commented Sep 27, 2024

For GML (~WFS), GDAL automatically tries to determine the data types, but sometimes this goes ~wrong. The easiest option is to disable this by specifying the configuration option GML_FIELDTYPES=ALWAYS_STRING.

Sample script:

import os
from owslib.wfs import WebFeatureService
import geopandas as gpd

wfs = WebFeatureService(url="https://dienste.gdi-sh.de/WFS_SH_ALKIS_VWG_OpenGBD", version="2.0.0", timeout=180)
response = wfs.getfeature(typename="ave:VerwaltungsEinheit")
os.environ["GML_FIELDTYPES"] = "ALWAYS_STRING"
print(gpd.read_file(response).dtypes)

Result:

gml_id          object
identifier      object
oid             object
aktualit        object
art             object
name            object
rs              object
ags             object
uebobjekt       object
ueboname        object
geometry      geometry
dtype: object

@SeppBerkner
Copy link
Author

Hi @theroggy 👋

Thank you for the explanation! Now I know where the issue is coming from.

But the solution you suggest feels more like creating more issues at other places (e.g. when having integer or floats). I may need to transform more types afterwards (the example above is only one out of almost 100 sources I have in my scripts). So if there are other options it would be really great... :)

@theroggy
Copy link
Member

theroggy commented Sep 27, 2024

Another option that I'm aware of (but never really used) that might be closer to what you are looking for is to use the system of GFS files. It is meant to be able to specify e.g. column types for GML files.

If you open a .gml file, GDAL will create such a file with the default settings, but if they don't please you, you can edit the file to change the settings.

The following script illustrates this. If you run it the first time, GDAL will create a response.gfs file alongside the response.gml file. You can open and edit that file, e.g. by changing the data type of the ags column to String. If you run the script again, the change will be taken in account by read_file...

from pathlib import Path
from owslib.wfs import WebFeatureService
import geopandas as gpd

# Save the WFS response to a .gml file so a .gfs file is saved once 
# it is read with `read_file`.
path = Path("c:/temp/VerwaltungsEinheit.gml")
if not path.exists():

    wfs = WebFeatureService(url="https://dienste.gdi-sh.de/WFS_SH_ALKIS_VWG_OpenGBD", version="2.0.0", timeout=180)
    response = wfs.getfeature(typename="ave:VerwaltungsEinheit")
    with open(path, "wb") as f:
        f.write(response.getbuffer())

    print(gpd.read_file(path).dtypes)

Depending on your exact needs, you can also pass something like GFS_TEMPLATE="VerwaltungsEinheit.gfs" as extra parameter to read_file, and then that configuration file is also used, without having to save the file first.

# Read directly from the WFS response, but specify a .gfs template to specify
# the data types.
from owslib.wfs import WebFeatureService
import geopandas as gpd

wfs = WebFeatureService(url="https://dienste.gdi-sh.de/WFS_SH_ALKIS_VWG_OpenGBD", version="2.0.0", timeout=180)
response = wfs.getfeature(typename="ave:VerwaltungsEinheit")

print(
    gpd.read_file(
        wfs.getfeature(typename="ave:VerwaltungsEinheit"),
        GFS_TEMPLATE="C:/Temp/VerwaltungsEinheit.gfs"
    ).dtypes, 
)

@SeppBerkner
Copy link
Author

I see. Thank you so much to take your time for this. It is great to know this. But I guess the overhead it creates is larger than the solution of transforming the data back to its correct state in later steps of the pipeline. With several hundred of tables (and increasing) I try to avoid as many intervention that I need to add in my pipelines as possible. I just use a function for now that checks and retransforms the ags columns in my data.

@theroggy
Copy link
Member

theroggy commented Oct 1, 2024

I was planning to do some more follow up on this to explore your first possible solution as well (have it automagically be correct) but - what are the odds - apparently a very similar question was asked on the same day via a "parallel track".

And... that already led to a change in GDAL that should solve your problem: OSGeo/gdal#10902

The change will be backported to GDAL 3.9 as well...

@SeppBerkner
Copy link
Author

Haha! What a coincidence, truly :) The person who opened the issue is also German. Perhaps because the European High Value Datasets caused a rise of using data from the government now and, thus, many people may encounter this issue now in Germany. Happy it is going to be resolved soon!

@jratike80
Copy link

WFS users can read the schema with DescribeFeatureType. For example
https://dienste.gdi-sh.de/WFS_SH_ALKIS_VWG_OpenGBD?service=WFS&version=2.0.0&typenames=ave:VerwaltungsEinheit&request=DescribeFeatureType

...
<sequence>
<element name="oid" type="string"/>
<element name="aktualit" type="date"/>
<element name="geometrie" type="gml:MultiSurfacePropertyType"/>
<element name="art" type="ave:Verwaltungseinheit_ArtType"/>
<element name="name" type="string"/>
<element minOccurs="0" name="rs" type="string"/>
<element minOccurs="0" name="ags" type="string"/>
<element minOccurs="0" name="uebobjekt" type="string"/>
<element minOccurs="0" name="ueboname" type="string"/>
</sequence>
...

Resolving the schema from the GML file that is downloaded from WFS into a file may be complicated because the DescribeFeatureType link is not always included in the file, or user may not have internet connection for reading the schema. But if the DescribeFeatureType is available and the contents are correct, but user just does not read it, I would say that user is guilty as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants