-
-
Notifications
You must be signed in to change notification settings - Fork 935
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: String Columns with only numeric values are converted to integer in read_file() #3431
Comments
For GML (~WFS), GDAL automatically tries to determine the data types, but sometimes this goes ~wrong. The easiest option is to disable this by specifying the configuration option GML_FIELDTYPES=ALWAYS_STRING. Sample script: import os
from owslib.wfs import WebFeatureService
import geopandas as gpd
wfs = WebFeatureService(url="https://dienste.gdi-sh.de/WFS_SH_ALKIS_VWG_OpenGBD", version="2.0.0", timeout=180)
response = wfs.getfeature(typename="ave:VerwaltungsEinheit")
os.environ["GML_FIELDTYPES"] = "ALWAYS_STRING"
print(gpd.read_file(response).dtypes) Result:
|
Hi @theroggy 👋 Thank you for the explanation! Now I know where the issue is coming from. But the solution you suggest feels more like creating more issues at other places (e.g. when having integer or floats). I may need to transform more types afterwards (the example above is only one out of almost 100 sources I have in my scripts). So if there are other options it would be really great... :) |
Another option that I'm aware of (but never really used) that might be closer to what you are looking for is to use the system of GFS files. It is meant to be able to specify e.g. column types for GML files. If you open a .gml file, GDAL will create such a file with the default settings, but if they don't please you, you can edit the file to change the settings. The following script illustrates this. If you run it the first time, GDAL will create a response.gfs file alongside the response.gml file. You can open and edit that file, e.g. by changing the data type of the ags column to String. If you run the script again, the change will be taken in account by from pathlib import Path
from owslib.wfs import WebFeatureService
import geopandas as gpd
# Save the WFS response to a .gml file so a .gfs file is saved once
# it is read with `read_file`.
path = Path("c:/temp/VerwaltungsEinheit.gml")
if not path.exists():
wfs = WebFeatureService(url="https://dienste.gdi-sh.de/WFS_SH_ALKIS_VWG_OpenGBD", version="2.0.0", timeout=180)
response = wfs.getfeature(typename="ave:VerwaltungsEinheit")
with open(path, "wb") as f:
f.write(response.getbuffer())
print(gpd.read_file(path).dtypes) Depending on your exact needs, you can also pass something like # Read directly from the WFS response, but specify a .gfs template to specify
# the data types.
from owslib.wfs import WebFeatureService
import geopandas as gpd
wfs = WebFeatureService(url="https://dienste.gdi-sh.de/WFS_SH_ALKIS_VWG_OpenGBD", version="2.0.0", timeout=180)
response = wfs.getfeature(typename="ave:VerwaltungsEinheit")
print(
gpd.read_file(
wfs.getfeature(typename="ave:VerwaltungsEinheit"),
GFS_TEMPLATE="C:/Temp/VerwaltungsEinheit.gfs"
).dtypes,
) |
I see. Thank you so much to take your time for this. It is great to know this. But I guess the overhead it creates is larger than the solution of transforming the data back to its correct state in later steps of the pipeline. With several hundred of tables (and increasing) I try to avoid as many intervention that I need to add in my pipelines as possible. I just use a function for now that checks and retransforms the ags columns in my data. |
I was planning to do some more follow up on this to explore your first possible solution as well (have it automagically be correct) but - what are the odds - apparently a very similar question was asked on the same day via a "parallel track". And... that already led to a change in GDAL that should solve your problem: OSGeo/gdal#10902 The change will be backported to GDAL 3.9 as well... |
Haha! What a coincidence, truly :) The person who opened the issue is also German. Perhaps because the European High Value Datasets caused a rise of using data from the government now and, thus, many people may encounter this issue now in Germany. Happy it is going to be resolved soon! |
WFS users can read the schema with DescribeFeatureType. For example
Resolving the schema from the GML file that is downloaded from WFS into a file may be complicated because the DescribeFeatureType link is not always included in the file, or user may not have internet connection for reading the schema. But if the DescribeFeatureType is available and the contents are correct, but user just does not read it, I would say that user is guilty as well. |
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of geopandas.
(optional) I have confirmed this bug exists on the main branch of geopandas.
Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.
Code Sample, a copy-pastable example
Problem description
I have a column that stores the 8-digit German adiminstrative code ("Amtlicher Gemeindeschlüssel"), which is a numeric code. Depending on the German state (which is encoded in the first two digits) a leading zero may exist (e.g.
01
for the state Schleswig-Holstein). Because of the numeric nature of this string field in the WFS, theread_file()
function transforms this field into an integer field. This also leads to the removal of the leading zero and, thus, the wrong identifier for my data.Expected Output
Geopandas would actually recognize the type as a string (in QGIS it is recognized as a string and the XML response also still stores the
0
).As an alternative it would be at least great to pass
dtype
parameter to theread_file()
function to tell fiona / pyogrio to define a specific dtype to a specific column.Output of
geopandas.show_versions()
SYSTEM INFO
python : 3.12.4 | packaged by conda-forge | (main, Jun 17 2024, 10:11:10) [Clang 16.0.6 ]
executable : /Users/jbe/anaconda3/envs/lbd_data_2_env/bin/python
machine : macOS-14.7-x86_64-i386-64bit
GEOS, GDAL, PROJ INFO
GEOS : 3.12.2
GEOS lib : None
GDAL : 3.9.2
GDAL data dir: /Users/jbe/anaconda3/envs/lbd_data_2_env/share/gdal/
PROJ : 9.4.1
PROJ data dir: /Users/jbe/anaconda3/envs/lbd_data_2_env/share/proj
PYTHON DEPENDENCIES
geopandas : 1.0.1
numpy : 2.1.0
pandas : 2.2.2
pyproj : 3.6.1
shapely : 2.0.6
pyogrio : 0.9.0
geoalchemy2: None
geopy : None
matplotlib : 3.9.2
mapclassify: 2.8.0
fiona : None
psycopg : None
psycopg2 : 2.9.9 (dt dec pq3 ext lo64)
pyarrow : 17.0.0
The text was updated successfully, but these errors were encountered: