BUG: String Columns with only numeric values are converted to integer in read_file() #3431

SeppBerkner · 2024-09-27T10:13:09Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of geopandas.
(optional) I have confirmed this bug exists on the main branch of geopandas.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

from owslib.wfs import WebFeatureService
import geopandas as gpd

wfs = WebFeatureService(url="https://dienste.gdi-sh.de/WFS_SH_ALKIS_VWG_OpenGBD", version="2.0.0", timeout=180)

response = wfs.getfeature(
                typename="ave:VerwaltungsEinheit"
            )

gpd.read_file(response)

Problem description

I have a column that stores the 8-digit German adiminstrative code ("Amtlicher Gemeindeschlüssel"), which is a numeric code. Depending on the German state (which is encoded in the first two digits) a leading zero may exist (e.g. 01 for the state Schleswig-Holstein). Because of the numeric nature of this string field in the WFS, the read_file() function transforms this field into an integer field. This also leads to the removal of the leading zero and, thus, the wrong identifier for my data.

Expected Output

Geopandas would actually recognize the type as a string (in QGIS it is recognized as a string and the XML response also still stores the 0).

As an alternative it would be at least great to pass dtype parameter to the read_file() function to tell fiona / pyogrio to define a specific dtype to a specific column.

Output of `geopandas.show_versions()`

SYSTEM INFO

python : 3.12.4 | packaged by conda-forge | (main, Jun 17 2024, 10:11:10) [Clang 16.0.6 ]
executable : /Users/jbe/anaconda3/envs/lbd_data_2_env/bin/python
machine : macOS-14.7-x86_64-i386-64bit

GEOS, GDAL, PROJ INFO

GEOS : 3.12.2
GEOS lib : None
GDAL : 3.9.2
GDAL data dir: /Users/jbe/anaconda3/envs/lbd_data_2_env/share/gdal/
PROJ : 9.4.1
PROJ data dir: /Users/jbe/anaconda3/envs/lbd_data_2_env/share/proj

PYTHON DEPENDENCIES

geopandas : 1.0.1
numpy : 2.1.0
pandas : 2.2.2
pyproj : 3.6.1
shapely : 2.0.6
pyogrio : 0.9.0
geoalchemy2: None
geopy : None
matplotlib : 3.9.2
mapclassify: 2.8.0
fiona : None
psycopg : None
psycopg2 : 2.9.9 (dt dec pq3 ext lo64)
pyarrow : 17.0.0

The text was updated successfully, but these errors were encountered:

theroggy · 2024-09-27T16:03:15Z

For GML (~WFS), GDAL automatically tries to determine the data types, but sometimes this goes ~wrong. The easiest option is to disable this by specifying the configuration option GML_FIELDTYPES=ALWAYS_STRING.

Sample script:

import os
from owslib.wfs import WebFeatureService
import geopandas as gpd

wfs = WebFeatureService(url="https://dienste.gdi-sh.de/WFS_SH_ALKIS_VWG_OpenGBD", version="2.0.0", timeout=180)
response = wfs.getfeature(typename="ave:VerwaltungsEinheit")
os.environ["GML_FIELDTYPES"] = "ALWAYS_STRING"
print(gpd.read_file(response).dtypes)

Result:

gml_id          object
identifier      object
oid             object
aktualit        object
art             object
name            object
rs              object
ags             object
uebobjekt       object
ueboname        object
geometry      geometry
dtype: object

SeppBerkner · 2024-09-27T19:35:02Z

Hi @theroggy 👋

Thank you for the explanation! Now I know where the issue is coming from.

But the solution you suggest feels more like creating more issues at other places (e.g. when having integer or floats). I may need to transform more types afterwards (the example above is only one out of almost 100 sources I have in my scripts). So if there are other options it would be really great... :)

theroggy · 2024-09-27T20:57:22Z

Another option that I'm aware of (but never really used) that might be closer to what you are looking for is to use the system of GFS files. It is meant to be able to specify e.g. column types for GML files.

If you open a .gml file, GDAL will create such a file with the default settings, but if they don't please you, you can edit the file to change the settings.

The following script illustrates this. If you run it the first time, GDAL will create a response.gfs file alongside the response.gml file. You can open and edit that file, e.g. by changing the data type of the ags column to String. If you run the script again, the change will be taken in account by read_file...

from pathlib import Path
from owslib.wfs import WebFeatureService
import geopandas as gpd

# Save the WFS response to a .gml file so a .gfs file is saved once 
# it is read with `read_file`.
path = Path("c:/temp/VerwaltungsEinheit.gml")
if not path.exists():

    wfs = WebFeatureService(url="https://dienste.gdi-sh.de/WFS_SH_ALKIS_VWG_OpenGBD", version="2.0.0", timeout=180)
    response = wfs.getfeature(typename="ave:VerwaltungsEinheit")
    with open(path, "wb") as f:
        f.write(response.getbuffer())

    print(gpd.read_file(path).dtypes)

Depending on your exact needs, you can also pass something like GFS_TEMPLATE="VerwaltungsEinheit.gfs" as extra parameter to read_file, and then that configuration file is also used, without having to save the file first.

# Read directly from the WFS response, but specify a .gfs template to specify
# the data types.
from owslib.wfs import WebFeatureService
import geopandas as gpd

wfs = WebFeatureService(url="https://dienste.gdi-sh.de/WFS_SH_ALKIS_VWG_OpenGBD", version="2.0.0", timeout=180)
response = wfs.getfeature(typename="ave:VerwaltungsEinheit")

print(
    gpd.read_file(
        wfs.getfeature(typename="ave:VerwaltungsEinheit"),
        GFS_TEMPLATE="C:/Temp/VerwaltungsEinheit.gfs"
    ).dtypes, 
)

SeppBerkner · 2024-09-30T12:04:04Z

I see. Thank you so much to take your time for this. It is great to know this. But I guess the overhead it creates is larger than the solution of transforming the data back to its correct state in later steps of the pipeline. With several hundred of tables (and increasing) I try to avoid as many intervention that I need to add in my pipelines as possible. I just use a function for now that checks and retransforms the ags columns in my data.

theroggy · 2024-10-01T15:21:32Z

I was planning to do some more follow up on this to explore your first possible solution as well (have it automagically be correct) but - what are the odds - apparently a very similar question was asked on the same day via a "parallel track".

And... that already led to a change in GDAL that should solve your problem: OSGeo/gdal#10902

The change will be backported to GDAL 3.9 as well...

SeppBerkner · 2024-10-01T15:31:44Z

Haha! What a coincidence, truly :) The person who opened the issue is also German. Perhaps because the European High Value Datasets caused a rise of using data from the government now and, thus, many people may encounter this issue now in Germany. Happy it is going to be resolved soon!

jratike80 · 2024-10-08T06:40:11Z

WFS users can read the schema with DescribeFeatureType. For example
https://dienste.gdi-sh.de/WFS_SH_ALKIS_VWG_OpenGBD?service=WFS&version=2.0.0&typenames=ave:VerwaltungsEinheit&request=DescribeFeatureType

...
<sequence>
<element name="oid" type="string"/>
<element name="aktualit" type="date"/>
<element name="geometrie" type="gml:MultiSurfacePropertyType"/>
<element name="art" type="ave:Verwaltungseinheit_ArtType"/>
<element name="name" type="string"/>
<element minOccurs="0" name="rs" type="string"/>
<element minOccurs="0" name="ags" type="string"/>
<element minOccurs="0" name="uebobjekt" type="string"/>
<element minOccurs="0" name="ueboname" type="string"/>
</sequence>
...

Resolving the schema from the GML file that is downloaded from WFS into a file may be complicated because the DescribeFeatureType link is not always included in the file, or user may not have internet connection for reading the schema. But if the DescribeFeatureType is available and the contents are correct, but user just does not read it, I would say that user is guilty as well.

SeppBerkner added bug needs triage labels Sep 27, 2024

theroggy mentioned this issue Oct 7, 2024

Way to override field definitions at the OGR driver level? OSGeo/gdal#10943

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: String Columns with only numeric values are converted to integer in read_file() #3431

BUG: String Columns with only numeric values are converted to integer in read_file() #3431

SeppBerkner commented Sep 27, 2024

SYSTEM INFO

GEOS, GDAL, PROJ INFO

PYTHON DEPENDENCIES

theroggy commented Sep 27, 2024 •

edited

Loading

SeppBerkner commented Sep 27, 2024

theroggy commented Sep 27, 2024 •

edited

Loading

SeppBerkner commented Sep 30, 2024

theroggy commented Oct 1, 2024 •

edited

Loading

SeppBerkner commented Oct 1, 2024

jratike80 commented Oct 8, 2024

BUG: String Columns with only numeric values are converted to integer in read_file() #3431

BUG: String Columns with only numeric values are converted to integer in read_file() #3431

Comments

SeppBerkner commented Sep 27, 2024

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of geopandas.show_versions()

SYSTEM INFO

GEOS, GDAL, PROJ INFO

PYTHON DEPENDENCIES

theroggy commented Sep 27, 2024 • edited Loading

SeppBerkner commented Sep 27, 2024

theroggy commented Sep 27, 2024 • edited Loading

SeppBerkner commented Sep 30, 2024

theroggy commented Oct 1, 2024 • edited Loading

SeppBerkner commented Oct 1, 2024

jratike80 commented Oct 8, 2024

Output of `geopandas.show_versions()`

theroggy commented Sep 27, 2024 •

edited

Loading

theroggy commented Sep 27, 2024 •

edited

Loading

theroggy commented Oct 1, 2024 •

edited

Loading