Skip to content

Problems with column types #6

Open
@iipr

Description

@iipr

After playing around a bit with gastrodon, I think I have bumped into some problems regarding column types. To reproduce:

Preliminaries

from gastrodon import RemoteEndpoint, inline
import pandas as pd

prefixes = inline("""
    @prefix : <http://dbpedia.org/resource/> .
    @prefix dbp: <http://dbpedia.org/ontology/> .
    @prefix pr: <http://dbpedia.org/property/> .
    @prefix foaf: <http://xmlns.com/foaf/0.1/> .
""").graph
endpoint = RemoteEndpoint(
    "http://dbpedia.org/sparql/"
    ,default_graph="http://dbpedia.org"
    ,prefixes=prefixes
    ,base_uri="http://dbpedia.org/resource/"
)

Error with dates

endpoint.select("""
SELECT DISTINCT ?personName ?bDay
WHERE {
    ?person a dbp:Person .
    ?person foaf:name ?nombrePersona .
    ?person dbp:birthDate ?bDay .
    }
    LIMIT 10
""")

Output:

Traceback (most recent call last):
  File "<stdin>", line 9, in <module>
  File "/opt/conda/lib/python3.6/site-packages/gastrodon/__init__.py", line 502, in select
    frame=self._dataframe(result)
  File "/opt/conda/lib/python3.6/site-packages/gastrodon/__init__.py", line 397, in _dataframe
    column[key] = self._normalize_column_type(column[key])
  File "/opt/conda/lib/python3.6/site-packages/gastrodon/__init__.py", line 376, in _normalize_column_type
    return [None if x==None else int(x) for x in column]
  File "/opt/conda/lib/python3.6/site-packages/gastrodon/__init__.py", line 376, in <listcomp>
    return [None if x==None else int(x) for x in column]
TypeError: int() argument must be a string, a bytes-like object or a number, not 'datetime.date'

Issue (casting floats)

endpoint.select("""
SELECT DISTINCT ?starName ?mass
WHERE {
    ?star a dbp:Star .
    ?star foaf:name ?starName .
    ?star pr:mass ?mass
} LIMIT 1
""")

Output:

    starName  mass
0  61 Cygni     0

Expected output:

    starName  mass
0  61 Cygni   0.63

(see this)

Possible cause

I believe that they are coming from _normalize_column_type:

  1. pd.datetime is not considered, so when trying to do int(x) where x is a pd.datetime, the above error appears.
  2. If all elements in the column are float type, they are silently casted into int, as shown in the issue above.

My question now is: is it really necessary to normalize the columns?
pandas is usually smart enough to accommodate column types and cast if needed.
If I skip the _normalize_column_type() in the code, in the previous issue with the stars example, the mass is not casted to int, and if needed to cast to str, it does:

endpoint.select("""
SELECT DISTINCT ?starName ?mass
WHERE {
    ?star a dbp:Star .
    ?star foaf:name ?starName .
    ?star pr:mass ?mass
} LIMIT 100
""").head()

_.mass.dtype

Output:

      starName          mass
0     61 Cygni          0.63
1     61 Cygni           0.7
2  70 Virginis          1.12
3  70 Virginis  >7.49 ± 0.61
4      Albireo           3.2

dtype('O')

Python 3.6.6
gastrodon 0.9.3
pandas 0.23.4

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions