Open
Description
After playing around a bit with gastrodon, I think I have bumped into some problems regarding column types. To reproduce:
Preliminaries
from gastrodon import RemoteEndpoint, inline
import pandas as pd
prefixes = inline("""
@prefix : <http://dbpedia.org/resource/> .
@prefix dbp: <http://dbpedia.org/ontology/> .
@prefix pr: <http://dbpedia.org/property/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
""").graph
endpoint = RemoteEndpoint(
"http://dbpedia.org/sparql/"
,default_graph="http://dbpedia.org"
,prefixes=prefixes
,base_uri="http://dbpedia.org/resource/"
)
Error with dates
endpoint.select("""
SELECT DISTINCT ?personName ?bDay
WHERE {
?person a dbp:Person .
?person foaf:name ?nombrePersona .
?person dbp:birthDate ?bDay .
}
LIMIT 10
""")
Output:
Traceback (most recent call last):
File "<stdin>", line 9, in <module>
File "/opt/conda/lib/python3.6/site-packages/gastrodon/__init__.py", line 502, in select
frame=self._dataframe(result)
File "/opt/conda/lib/python3.6/site-packages/gastrodon/__init__.py", line 397, in _dataframe
column[key] = self._normalize_column_type(column[key])
File "/opt/conda/lib/python3.6/site-packages/gastrodon/__init__.py", line 376, in _normalize_column_type
return [None if x==None else int(x) for x in column]
File "/opt/conda/lib/python3.6/site-packages/gastrodon/__init__.py", line 376, in <listcomp>
return [None if x==None else int(x) for x in column]
TypeError: int() argument must be a string, a bytes-like object or a number, not 'datetime.date'
Issue (casting floats)
endpoint.select("""
SELECT DISTINCT ?starName ?mass
WHERE {
?star a dbp:Star .
?star foaf:name ?starName .
?star pr:mass ?mass
} LIMIT 1
""")
Output:
starName mass
0 61 Cygni 0
Expected output:
starName mass
0 61 Cygni 0.63
(see this)
Possible cause
I believe that they are coming from _normalize_column_type
:
pd.datetime
is not considered, so when trying to doint(x)
wherex
is apd.datetime
, the above error appears.- If all elements in the column are
float
type, they are silently casted intoint
, as shown in the issue above.
My question now is: is it really necessary to normalize the columns?
pandas
is usually smart enough to accommodate column types and cast if needed.
If I skip the _normalize_column_type()
in the code, in the previous issue with the stars example, the mass is not casted to int
, and if needed to cast to str
, it does:
endpoint.select("""
SELECT DISTINCT ?starName ?mass
WHERE {
?star a dbp:Star .
?star foaf:name ?starName .
?star pr:mass ?mass
} LIMIT 100
""").head()
_.mass.dtype
Output:
starName mass
0 61 Cygni 0.63
1 61 Cygni 0.7
2 70 Virginis 1.12
3 70 Virginis >7.49 ± 0.61
4 Albireo 3.2
dtype('O')
Python 3.6.6
gastrodon 0.9.3
pandas 0.23.4
Metadata
Metadata
Assignees
Labels
No labels