Skip to content

Investigate deployment-level models for mode inference based on trajectory characteristics #1100

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
shankari opened this issue Dec 11, 2024 · 34 comments

Comments

@shankari
Copy link
Contributor

In e-mission, we currently use two methods for determining the mode automatically:

  • user level model to infer the (rich) mode from the user's prior travel history
  • ad-hoc set of rules to infer a simplified set of base modes (walk, bike, car, bus, train). We do this by using the motion activity to distinguish between non-motorized (walk, bike) and motorized (car, bus, train) modes and then a GIS integration to split the bus, train modes based on OSM bus routes.

However, as we have deployments with more complex modes, neither of these are sufficient. For example, in Laos, we want to be able to distinguish between cars and motorcycles, both of which are motorized modes without fixed routes. To this effect, we want to have the ability to build a deployment-specific model that uses sensor-level data to predict the rich modes that are relevant to this deployment.

The steps to this are fairly simple:

  • Pick a featurization that depends on the sensor characteristics and not the user
  • Create training and test sets.
    • Note that we should have at least one scenario in which the test set does not contain any user overlap with the training set
  • Build an RF model
  • Cross-validate

The initial featurization we should use is from

Zheng, Y., Li, Q., Chen, Y., Xie, X., Ma, W.-Y., 2008. Understanding mobility based on GPS data. In: Proceedings of the 10th international conference on Ubiquitous computing, Association for Computing Machinery, Association for Computing Machinery, Seoul, South Korea. pp. 312–321.
https://dl.acm.org/doi/10.1145/1409635.1409677

I implemented this featurization in https://github.com/e-mission/e-mission-server/tree/random-forest-mode-detection
and in https://github.com/e-mission/e-mission-server/blob/random-forest-mode-detection/emission/analysis/classification/inference/mode/pipeline.py in particular

Please put the initial results here and then we can decide how to proceed

@shankari
Copy link
Contributor Author

@iantei @Abby-Wheelis for visibility

@iantei
Copy link
Contributor

iantei commented Dec 16, 2024

I went through the paper which talks about the features, and we have incorporated all of these features with addition of a few more consideration like in the https://github.com/e-mission/e-mission-server/blob/random-forest-mode-detection/emission/analysis/classification/inference/mode/pipeline.py:

  1. both start and end close to bus stop
  2. both start and end close to train station
  3. both start and end close to airport
    alongside others.

I followed the instructions mentioned in https://github.com/e-mission/e-mission-docs/blob/master/docs/install/manual_install.md to run the server.

source setup/activate.sh
export DB_HOST=mongodb://db/openpath_prod_usaid_laos_ev
./e-mission-py.bash bin/<script_name.py>

while loading the docker-compose for the public-dashboard with exposed ports as 27017.

@iantei
Copy link
Contributor

iantei commented Dec 16, 2024

My first initial idea was to run the script with a different dataset i.e. Laos EV dataset, since it would have both cars and motorcycles data.

However, I came across the below error code while executing the script intake_multiprocess.py:

Elaborated call stack -

(emission) ashrest2-41625s:e-mission-server ashrest2$ export DB_HOST=mongodb://db/openpath_prod_usaid_laos_ev
(emission) ashrest2-41625s:e-mission-server ashrest2$ ./e-mission-py.bash bin/intake_multiprocess.py 4
storage not configured, falling back to sample, default configuration
URL not formatted, defaulting to "Stage_database"
Connecting to database URL localhost
analysis.debug.conf.json not configured, falling back to sample, default configuration
google maps key not configured, falling back to nominatim
nominatim not configured either, place decoding must happen on the client
The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!
analysis.trip_model.conf.json not configured, falling back to sample, default configuration
expectations.conf.json not configured, falling back to sample, default configuration
ERROR:root:habitica not configured, game functions not supported
Traceback (most recent call last):
  File "/Users/ashrest2/NREL/2024_RF_Server/e-mission-server/emission/net/ext_service/habitica/proxy.py", line 22, in <module>
    key_file = open('conf/net/ext_service/habitica.json')
FileNotFoundError: [Errno 2] No such file or directory: 'conf/net/ext_service/habitica.json'
storage not configured, falling back to sample, default configuration
URL not formatted, defaulting to "Stage_database"
Connecting to database URL localhost
storage not configured, falling back to sample, default configuration
URL not formatted, defaulting to "Stage_database"
Connecting to database URL localhost
storage not configured, falling back to sample, default configuration
URL not formatted, defaulting to "Stage_database"
Connecting to database URL localhost
storage not configured, falling back to sample, default configuration
URL not formatted, defaulting to "Stage_database"
Connecting to database URL localhost
analysis.debug.conf.json not configured, falling back to sample, default configuration
analysis.debug.conf.json not configured, falling back to sample, default configuration
analysis.debug.conf.json not configured, falling back to sample, default configuration
analysis.debug.conf.json not configured, falling back to sample, default configuration
google maps key not configured, falling back to nominatim
google maps key not configured, falling back to nominatim
nominatim not configured either, place decoding must happen on the client
nominatim not configured either, place decoding must happen on the client
google maps key not configured, falling back to nominatim
nominatim not configured either, place decoding must happen on the client
google maps key not configured, falling back to nominatim
nominatim not configured either, place decoding must happen on the client
The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

analysis.trip_model.conf.json not configured, falling back to sample, default configuration
analysis.trip_model.conf.json not configured, falling back to sample, default configuration
The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!
analysis.trip_model.conf.json not configured, falling back to sample, default configuration
The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!
analysis.trip_model.conf.json not configured, falling back to sample, default configuration
expectations.conf.json not configured, falling back to sample, default configuration
expectations.conf.json not configured, falling back to sample, default configuration
expectations.conf.json not configured, falling back to sample, default configuration
expectations.conf.json not configured, falling back to sample, default configuration
ERROR:root:habitica not configured, game functions not supported
Traceback (most recent call last):
  File "/Users/ashrest2/NREL/2024_RF_Server/e-mission-server/emission/net/ext_service/habitica/proxy.py", line 22, in <module>
    key_file = open('conf/net/ext_service/habitica.json')
FileNotFoundError: [Errno 2] No such file or directory: 'conf/net/ext_service/habitica.json'
ERROR:root:habitica not configured, game functions not supported
Traceback (most recent call last):
  File "/Users/ashrest2/NREL/2024_RF_Server/e-mission-server/emission/net/ext_service/habitica/proxy.py", line 22, in <module>
    key_file = open('conf/net/ext_service/habitica.json')
FileNotFoundError: [Errno 2] No such file or directory: 'conf/net/ext_service/habitica.json'
ERROR:root:habitica not configured, game functions not supported
Traceback (most recent call last):
  File "/Users/ashrest2/NREL/2024_RF_Server/e-mission-server/emission/net/ext_service/habitica/proxy.py", line 22, in <module>
    key_file = open('conf/net/ext_service/habitica.json')
FileNotFoundError: [Errno 2] No such file or directory: 'conf/net/ext_service/habitica.json'
ERROR:root:habitica not configured, game functions not supported
Traceback (most recent call last):
  File "/Users/ashrest2/NREL/2024_RF_Server/e-mission-server/emission/net/ext_service/habitica/proxy.py", line 22, in <module>
    key_file = open('conf/net/ext_service/habitica.json')
FileNotFoundError: [Errno 2] No such file or directory: 'conf/net/ext_service/habitica.json'
(emission) ashrest2-41625s:e-mission-server ashrest2$ 

It complains about habitica.json not being available.

FileNotFoundError: [Errno 2] No such file or directory: 'conf/net/ext_service/habitica.json'
ERROR:root:habitica not configured, game functions not supported

Also, am not sure if it's taking up the right Laos database instead of Stage_database.
We have 'habitica.json.sample' in the directory inside the repo.

Now, in the following file -
https://github.com/e-mission/e-mission-server/blob/random-forest-mode-detection/emission/analysis/classification/inference/mode/pipeline.py
Inside main function,

  modeInferPipeline = ModeInferencePipeline()
  modeInferPipeline.runPipeline()

There is no runPipeline() function in the class ModeInferencePipeline().

@shankari Could you please help me understand the following:

  1. Am I loading the right MongoDB database correctly?
  2. Am I using the right script to run the pipeline for inference for a new dataset?

It seems the above pipeline uses the mode being trained in
https://github.com/e-mission/e-mission-server/blob/52adee205f686d87e167bd4b1d166098938870c6/emission/analysis/classification/inference/mode/seed/pipeline.py

which needs to incorporate the split of training/test data and cross validation aspects.

Can I build up the above requirements based in the above pipeline code, once I am able to run the pipeline with Laos EV dataset?

@iantei iantei moved this to Questions for Shankari in OpenPATH Tasks Overview Dec 17, 2024
@iantei
Copy link
Contributor

iantei commented Dec 19, 2024

There needs to be a slight modification in the approach while loading MongoDB database correctly -
After loading the mongo dataset into the public dashboard, and running the container for public dashboard.

Since the mongoDB is exposed in the port 27017, we need to assign it to localhost:27017 instead of db.

source setup/activate.sh
export DB_HOST=mongodb://localhost:27017/openpath_prod_usaid_laos_ev
./e-mission-py.bash bin/<script_name.py>

Running with this change, it is loading the right database i.e. openpath_prod_usaid_laos_ev. [Tested this changes with the latest - Server code; there needs to be additional changes in the existing RF branch, likely introducing db.conf file which is missing]

@iantei
Copy link
Contributor

iantei commented Dec 19, 2024

Can I build up the above requirements based in the above pipeline code, once I am able to run the pipeline with Laos EV dataset?

This is an old branch, which doesn't have the updates in the latest master branch.

I am inclining towards the idea of - understanding the workflow in which the pipeline of the RF model is being trained, model is saved and run with the new dataset & porting these changes to the master and working in a new PR altogether.
Reason: There are likely changes which needs to be made, which are apparently fixed in the Master branch, and the whole process of porting this changes are everything is finalized seems redundant.

@iantei
Copy link
Contributor

iantei commented Dec 19, 2024

We are not using habitica.json in the project anymore, so the error/ warning can be ignored. Or any code which specifies the usage of it can be analysed and removed.

@shankari
Copy link
Contributor Author

I am inclining towards the idea of - understanding the workflow in which the pipeline of the RF model is being trained, model is saved and run with the new dataset & porting these changes to the master and working in a new PR altogether.

I am not sure where model training and evaluation is planned in here. I do not anticipate significant changes for porting - the data model has been unchanged for the past 5 years.

I would encourage you to build and evaluate the model before porting so we can the impactful results quickly and don't spin our wheels on porting something we won't use.

@iantei
Copy link
Contributor

iantei commented Dec 19, 2024

I am not sure where model training and evaluation is planned in here.

Yes, the model training is not done in https://github.com/e-mission/e-mission-server/blob/random-forest-mode-detection/emission/analysis/classification/inference/mode/pipeline.py

Rather it just loads the model being saved here -
https://github.com/e-mission/e-mission-server/blob/random-forest-mode-detection/emission/analysis/classification/inference/mode/seed/pipeline.py

where training of the model is being done:

  def buildModelStep(self):
    from sklearn import ensemble
    forestClf = ensemble.RandomForestClassifier()
    model = forestClf.fit(self.selFeatureMatrix, self.cleanedResultVector)
    return model

I couldn't find a place where the model evaluation is being done. But it does refer to 70% accuracy in one of the comment in mode/pipeline.py

# we documented to have ~ 70% accuracy in the 2014 e-mission paper.

@shankari
Copy link
Contributor Author

Yes, the model training is not done in https://github.com/e-mission/e-mission-server/blob/random-forest-mode-detection/emission/analysis/classification/inference/mode/pipeline.py

I know that we don't do model training in the code. My comment was more around the model training step in your plan. Can you write out your proposed plan of action in bullet points (e..g 1. I will do this first, 2. I will do this next, ...)

I couldn't find a place where the model evaluation is being done. But it does refer to 70% accuracy in one of the comment in mode/pipeline.py

We generally don't do model evaluation in a production system (!!) that is done offline. The 70% accuracy is on the ad-hoc dataset that I collected as part of a class project - in 2014, I had just started the PhD program. The model accuracy will depend on the input dataset, which will depend on the deployment.

You will need to evaluate this on a few deployments, starting with Laos.

@iantei
Copy link
Contributor

iantei commented Dec 19, 2024

Here's my proposed plan of actions (on a high level):

Work on the random-forest-mode-detection branch.

  1. Understand the way in which I can run the model/pipeline.py from the pre-existing scripts. Likely, intake_multiprocess.py.
  2. Write an offline model evaluation script and run the existing model with the Loas dataset.
  3. Document the evaluation results.
  4. Discuss about the evaluation results, and proceed if it would be necessary to re-train the model.
  5. After the model is deemed seems right, port these changes to the Master branch.

Please let me know about your thoughts on this.

Note: I am trying to figure out the issue with db.conf file (config file issues), referring to the initial PR we had some discussion about in op-admin repository.

@iantei
Copy link
Contributor

iantei commented Dec 19, 2024

Probably not an elegant solution to the issue, but I updated the db.conf.sample to the following:

{
    "timeseries": {
        "url": "mongodb://localhost:27017/openpath_prod_usaid_laos_ev",
        "result_limit": 250000
    }
}

This change helped me use the right mongo db. I am not looking for any clean fix here, since there is no issue with updated Master branch, where we can `export DB_HOST= and works as expected.

@iantei
Copy link
Contributor

iantei commented Dec 19, 2024

What happens when we run the intake_multiprocess.py script, after configuring the changes to db.conf.sample as mentioned above, as the following:
(emission) $ ./e-mission-py.bash bin/intake_multiprocess.py 4

Documenting the steps such that it helps me recall things better and easily later too.

import emission.pipeline.scheduler as eps
import emission.pipeline.intake_stage as epi
import emission.pipeline.classification.inference.mode.pipeline as eacimp

1.
eps.dispatch(split_lists, args.skip_if_no_new_data)
    dispatch():
          p = ctx.Process(target=epi.run_intake_pipeline, args=(pid, uuid_list, skip_if_no_new_data))

2.
epi.run_intake_pipeline():
        with ect.Timer() as crt:
logging.info("*" * 10 + "UUID %s: inferring transportation mode" % uuid + "*" * 10)
print(str(arrow.now()) + "*" * 10 + "UUID %s: inferring transportation mode" % uuid + "*" * 10)
      eacimp.predict_mode(uuid)

3.
 eacimp.predict_mode():
        mip = ModeInferencePipeline()
        mip.runPredictionPipeline(user_id, time_query)
              runPredictionPipeline(self, user_id, timerange):
                    self.loadModelStage()

4.
loadModelStage():
    import emission.analysis.classification.inference.mode.seed.pipeline as seedp
    self.model = seedp.ModeInferencePipelineMovesFormat.loadModel()

5.
SAVED_MODEL_FILENAME = 'seed_model.json'

loadModel():
   @staticmethod
  def loadModel():
    fd = open(SAVED_MODEL_FILENAME, "r")
    model_rep = fd.read()
    fd.close()
    return jpickle.loads(model_rep)

Apparently, we do not have seed_model.json.

This led to some interesting error logs in the terminal, but it well continued, and completed the execution of script without crash:

/Users/ashrest2/NREL/2024_RF_Server/e-mission-server/emission/analysis/intake/cleaning/cleaning_methods/speed_outlier_detection.py:27: FutureWarning: The default value of numeric_only in DataFrame.quantile is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  quartile_vals = df_to_use.quantile([0.25, 0.75]).speed
2024-12-19T15:21:45.188388-07:00**********UUID <UUID_ID>: inferring transportation mode**********
Error while inferring modes, timestamp is unchanged
Traceback (most recent call last):
  File "/Users/ashrest2/NREL/2024_RF_Server/e-mission-server/emission/analysis/classification/inference/mode/pipeline.py", line 41, in predict_mode
    mip.runPredictionPipeline(user_id, time_query)
  File "/Users/ashrest2/NREL/2024_RF_Server/e-mission-server/emission/analysis/classification/inference/mode/pipeline.py", line 139, in runPredictionPipeline
    self.loadModelStage()
  File "/Users/ashrest2/NREL/2024_RF_Server/e-mission-server/emission/analysis/classification/inference/mode/pipeline.py", line 156, in loadModelStage
    self.model = seedp.ModeInferencePipelineMovesFormat.loadModel()
  File "/Users/ashrest2/NREL/2024_RF_Server/e-mission-server/emission/analysis/classification/inference/mode/seed/pipeline.py", line 93, in loadModel
    fd = open(SAVED_MODEL_FILENAME, "r")
FileNotFoundError: [Errno 2] No such file or directory: 'seed_model.json'
...
While getting section summary, section length = 0. This should never happen, but let's not crash if it does
While getting section summary, section length = 0. This should never happen, but let's not crash if it does

@iantei
Copy link
Contributor

iantei commented Dec 20, 2024

Well, I tried to run another script bin/analysis/create_static_model.py.

(emission) ashrest2-41625s:e-mission-server ashrest2$ ./e-mission-py.bash bin/analysis/create_static_model.py
storage not configured, falling back to sample, default configuration
Connecting to database URL mongodb://localhost:27017/openpath_prod_usaid_laos_ev
analysis.debug.conf.json not configured, falling back to sample, default configuration
DEBUG:root:START TRAINING DATA STEP
DEBUG:root:Section data set size = 0
DEBUG:root:Getting dataset size took 0.0019478797912597656
DEBUG:root:Querying confirmedSections 2024-12-19 16:53:19.226176
DEBUG:root:Querying confirmedSection took 0.0004892349243164062
DEBUG:root:Querying stage modes 2024-12-19 16:53:19.226729
DEBUG:root:Querying stage modes took 0.0005838871002197266
DEBUG:root:Section query with ground truth 2024-12-19 16:53:19.227361
DEBUG:root:Training set total size = 0
DEBUG:root:Getting section query with ground truth took 0.00034809112548828125
DEBUG:root:confirmedSectionCount = 0
INFO:root:initial loadTrainingDataStep DONE
DEBUG:root:finished loading current training set, now loading from backup!
DEBUG:root:START TRAINING DATA STEP
DEBUG:root:Section data set size = 0
DEBUG:root:Getting dataset size took 0.003607034683227539
DEBUG:root:Querying confirmedSections 2024-12-19 16:53:19.231980
DEBUG:root:Querying confirmedSection took 0.0003788471221923828
DEBUG:root:Querying stage modes 2024-12-19 16:53:19.232415
DEBUG:root:Querying stage modes took 0.0003960132598876953
DEBUG:root:Section query with ground truth 2024-12-19 16:53:19.232852
DEBUG:root:Training set total size = 0
DEBUG:root:Getting section query with ground truth took 0.0006999969482421875
INFO:root:loadTrainingDataStep DONE
DEBUG:root:Trying to find cluster locations for 0 trips
DEBUG:root:No points found in cluster input, nothing to fit..
DEBUG:root:Trying to find cluster locations for 0 trips
DEBUG:root:No points found in cluster input, nothing to fit..
DEBUG:root:Trying to find cluster locations for 0 trips
DEBUG:root:No points found in cluster input, nothing to fit..
INFO:root:generateBusAndTrainStopStep DONE
DEBUG:root:created data structures of size 0
INFO:root:generateFeatureMatrixAndResultVectorStep DONE
DEBUG:root:Stripped trips with mode: run 0, transport 0, mixed 0, unknown 0 unstripped 0
DEBUG:root:Stripping out distanceOutliers (array([], dtype=int64),), speedOutliers (array([], dtype=int64),), speedMeanOutliers (array([], dtype=int64),), speedVarianceOutliers (array([], dtype=int64),), maxSpeedOutliers (array([], dtype=int64),)
DEBUG:root:nonOutlierIndices.shape = 0
INFO:root:cleanDataStep DONE
DEBUG:root:generic features = [0, 1, 4, 5, 6, 7, 8]
DEBUG:root:advanced features = [10, 11, 12]
DEBUG:root:location features = [13, 14, 15, 16]
DEBUG:root:time features = [17, 18]
DEBUG:root:bus train features = [19, 20, 21]
INFO:root:selectFeatureIndicesStep DONE
Traceback (most recent call last):
  File "/Users/ashrest2/NREL/2024_RF_Server/e-mission-server/bin/analysis/create_static_model.py", line 8, in <module>
    seed_pipeline.runPipeline()
  File "/Users/ashrest2/NREL/2024_RF_Server/e-mission-server/emission/analysis/classification/inference/mode/seed/pipeline.py", line 67, in runPipeline
    self.model = self.buildModelStep()
  File "/Users/ashrest2/NREL/2024_RF_Server/e-mission-server/emission/analysis/classification/inference/mode/seed/pipeline.py", line 328, in buildModelStep
    model = forestClf.fit(self.selFeatureMatrix, self.cleanedResultVector)
  File "/Users/ashrest2/miniconda-23.1.0/envs/emission/lib/python3.9/site-packages/sklearn/ensemble/_forest.py", line 345, in fit
    X, y = self._validate_data(
  File "/Users/ashrest2/miniconda-23.1.0/envs/emission/lib/python3.9/site-packages/sklearn/base.py", line 565, in _validate_data
    X, y = check_X_y(X, y, **check_params)
  File "/Users/ashrest2/miniconda-23.1.0/envs/emission/lib/python3.9/site-packages/sklearn/utils/validation.py", line 1106, in check_X_y
    X = check_array(
  File "/Users/ashrest2/miniconda-23.1.0/envs/emission/lib/python3.9/site-packages/sklearn/utils/validation.py", line 931, in check_array
    raise ValueError(
ValueError: Found array with 0 sample(s) (shape=(0, 13)) while a minimum of 1 is required by RandomForestClassifier.
(emission) ashrest2-41625s:e-mission-server ashrest2$

Well, the commit message states it is used to create and save model based on old-style data.
Accounting that we have 0 Training set total size, I see two possibility. One, the script is not loading the proper database to load the dataset.
Second, data structure has changed, and this script is outdated.

@shankari
Copy link
Contributor Author

shankari commented Dec 20, 2024

Well, the commit message states it is used to create and save model based on old-style data.

This is the key. Back in 2014, we first started by integrating with an existing app called Moves, that did the data collection and we pulled the data using OAuth. That is how I was able to set up a basic data collection platform as part of a class project. However, we quickly realized that Moves had made decisions around data collection frequency and accuracy and we did not have the ability to control it - which is always a challenge with closed source. So we wrote our own and came up with (IMHO) a better data model.

As an aside, Moves was acquired by Facebook and shut down in 2015, so that was a good choice.

So the original model was built using moves-format data; the model building now needs to move to openpath-format data.
Note that the model application was already working with openpath-style data because I ported it over.
I just didn't have new labeled data for model building at the time, so I did not port over the model building.
You need to port over the model building and build a model for Laos (and then for other projects).

Apparently, we do not have seed_model.json.

Note also that this branch should have another file called seed_model....json which you can use temporarily to verify that the pipeline works. I think I have a copy of the trained model from 2014 as well, but I cannot publish it because RF can theoretically leak data, and I do not have permission to share the data that we collected for the class project (no IRB).
Once you have verified that the pipeline works, you should retrain the model.

@iantei
Copy link
Contributor

iantei commented Dec 20, 2024

Replacing seed_mode.json with seed_mode_from_test_data.json seem to make the pipeline work.
However, the script is running for more than 4 hours at this point, and doesn't seem to be coming to exhaustion.

The pipeline logged some warning stating

/Users/ashrest2/miniconda-23.1.0/envs/emission/lib/python3.9/site-packages/sklearn/base.py:299: UserWarning: Trying to unpickle estimator RandomForestClassifier from version 0.23.2 when using version 1.2.1. This might lead to breaking code or invalid results. Use at your own risk.

This makes sense considering there was change to the setup/environment36.yml:
From -scikit-learn=0.23.2 to -scikit-learn=1.2.1 last year e-mission/e-mission-server@fd8526e

I will wait for a couple of hours more for this pipeline to complete. If it doesn't exhaust in a long time, I will likely try with a smaller dataset, just to see the completion of pipeline execution script.

@iantei
Copy link
Contributor

iantei commented Dec 24, 2024

Trying the above with seed_mode_from_test_data.json and using smaller dataset WashingtonCommon (~50 MB), the pipeline ran successfully and completed within ~3 minutes.

There is still the warning -

/Users/ashrest2/miniconda-23.1.0/envs/emission/lib/python3.9/site-packages/sklearn/base.py:299: UserWarning: Trying to unpickle estimator RandomForestClassifier from version 0.23.2 when using version 1.2.1. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations

@iantei
Copy link
Contributor

iantei commented Dec 24, 2024

Well, using scikit-learn=0.23.2 throws up the following error:

AttributeError: module 'numpy' has no attribute 'float'.
`np.float` was a deprecated alias for the builtin `float`. To avoid this error in existing code, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

This issue is likely because we are using -numpy=1.24.2 in setup/environment36.yml referring to changes done here - e-mission/e-mission-server@fd8526e

As we only wanted to see if the pipeline would run fine with the usage of seed_mode_from_test_data.json for a smaller dataset, it should be fine accounting we will be training the model with newer scikit-learn version itself.

The script : (emission) $ ./e-mission-py.bash bin/intake_multiprocess.py 4 ran pretty much with success. That's likely validation that database access is happening well?

Next task, would be to figure out changes required to load the new version of dataset and conduct the training of the model.

@iantei
Copy link
Contributor

iantei commented Dec 25, 2024

What happens when we run the script - bin/analysis/create_static_model.py:

import emission.analysis.classification.inference.mode.seed.pipeline as pipeline

1.
    seed_pipeline = pipeline.ModeInferencePipelineMovesFormat()
    seed_pipeline.runPipeline()

2.
    allConfirmedTripsQuery = ModeInferencePipelineMovesFormat.getSectionQueryWithGroundTruth({'$ne': ''})

  (self.confirmedSectionCount, self.confirmedSections) = self.loadTrainingDataStep(allConfirmedTripsQuery)

    backupSections = MongoClient(edb.url).Backup_database.Stage_Sections
    (self.backupSectionCount, self.backupConfirmedSections) = self.loadTrainingDataStep(allConfirmedTripsQuery, backupSections)

    (self.bus_cluster, self.train_cluster) = self.generateBusAndTrainStopStep() 

    (self.featureMatrix, self.resultVector) = self.generateFeatureMatrixAndResultVectorStep()

    (self.cleanedFeatureMatrix, self.cleanedResultVector) = self.cleanDataStep()

    self.selFeatureIndices = self.selectFeatureIndicesStep()

    self.selFeatureMatrix = self.cleanedFeatureMatrix[:,self.selFeatureIndices]
    self.model = self.buildModelStep()

    self.saveModelStep()

Focusing starting with self.buildModelStep() since the error log traced, error logging starting from here.

  buildModelStep(self):
     from sklearn import ensemble
    forestClf = ensemble.RandomForestClassifier()
    model = forestClf.fit(self.selFeatureMatrix, self.cleanedResultVector)

Now, it's time to re-trace back how selFeatureMatrix and cleanedResultVector is built.

@iantei
Copy link
Contributor

iantei commented Dec 25, 2024

Details of exploration of forestClf.fit(self.selFeatureMatrix, self.cleanedResultVector)
(Inside - emission/analysis/classification/inference/mode/seed/pipeline.py/ModeInferencePipelineMovesFormat)

import emission.analysis.config as each

For `selFeatureMatrix`:
self.selFeatureMatrix = self.cleanedFeatureMatrix[:,self.selFeatureIndices]

Looking what's `selFeatureIndices` is composed of:
1.
self.selFeatureIndices = self.selectFeatureIndicesStep()

2. 
selectFeatureIndicesStep()
Based on the whether the config needs advancedFeatureIndices or BusTrainFeatureIndices, it chooses to return the retIndices:

DEBUG:root:generic features = [0, 1, 4, 5, 6, 7, 8]
DEBUG:root:advanced features = [10, 11, 12]
DEBUG:root:location features = [13, 14, 15, 16]
DEBUG:root:time features = [17, 18]
DEBUG:root:bus train features = [19, 20, 21]

retIndices = genericFeatureIndices +/ AdvancedFeatureIndices +/ BusTrainFeatureIndices


Looking what's cleanedFeatureMatrix is composed of:
(self.cleanedFeatureMatrix, self.cleanedResultVector) = self.cleanDataStep()

It's the first return part of the cleanDataStep(). The second part cleanedResultVector is also of use for us in buildModelStep.


Looking what's both cleanedFeatureMatrix and cleanedResultVector are composed of. The way to look is understand what's happening in cleanDataStep()

  • Pretty much all the values showing up on the logs are indicating the data was not processed as intended already, at all.
DEBUG:root:Stripped trips with mode: run 0, transport 0, mixed 0, unknown 0 unstripped 0
DEBUG:root:Stripping out distanceOutliers (array([], dtype=int64),), speedOutliers (array([], dtype=int64),), speedMeanOutliers (array([], dtype=int64),), speedVarianceOutliers (array([], dtype=int64),), maxSpeedOutliers (array([], dtype=int64),)
DEBUG:root:nonOutlierIndices.shape = 0

Lets start looking out way backwards.

@iantei
Copy link
Contributor

iantei commented Dec 25, 2024

Tracing from the initial of runPipeline():

    allConfirmedTripsQuery = ModeInferencePipelineMovesFormat.getSectionQueryWithGroundTruth({'$ne': ''})

Looking for type:move and looks for getModeQuery($ne='')
getModeQuery returns 'confirmed_mode' : $ne: '' - Basically returns all confirmed_mode, which is not equal to empty.
Is confirmed_mode is guaranteed to exist? Looks so.
I haven't observed corrected_mode in MongoDB before, but might have missed.


Lets start looking at the loadTrainingDataStep(allConfirmedTripsQuery)

  logging.debug("Section data set size = %s" % sectionDb.count_documents({'type': 'move'}))

DEBUG:root:Section data set size = 0

Tracing back to this code:

 def getSectionQueryWithGroundTruth(groundTruthMode):
    return {"$and": [{'type': 'move'},
                     ModeInferencePipelineMovesFormat.getModeQuery(groundTruthMode)]}
  1. Either we do not have 'type':'move' in the MongoDB anymore, or something else is happening.

We were able to run the script for bin/intake_multiprocess.py, which likely means the database access was successful.

a. Need to validate whether we are able to load the database properly or not.
b. If yes, see the structure of the MongoDB document and see if there is 'type':'move' or not.

@iantei
Copy link
Contributor

iantei commented Dec 25, 2024

Looking further inside: mode/seed/pipeline.py/ModeInferencePipelineMovesFormat:

def loadTrainingDataStep(self, sectionQuery, sectionDb = None)
sectionDb = self.Sections

Looking into what self.Sections refer to.

import emission.core.get_database as edb
self.Sections = edu.get_section_db()

Inside emission/core/get_database.py:

_current_db = MongoClient(url, uuidRepresentation='pythonLegacy')[db_name]

def _get_current_db():
    return _current_db

def get_section_db():
    Sections= _get_current_db().Stage_Sections
    return Sections

Basically, it's looking up for Collection in MongoDB with the name Stage_Sections.

Looked up for then collection in mongoDB via. Mongo Shell:

em-public-dashboard ashrest2$ docker exec -it em-public-dashboard-db-1 mongo
use <database_name>
show collections
  • Could not find the collections with the name Stage_Sections

Lets try to see if we can print out the list of collections from the logging mechanism, so we can be more certain we are able to connect the database to the server properly.

Modifying the get_section_db() to list the db name and collections, could see the exact collections from the washingtoncommons db properly, but couldn't find the collection of Stage_Sections.


def get_section_db():
    # Get list of collections
    db = _get_current_db()
    print(db.name)
    print("DB collections : ", db.list_collection_names())

Do we need the Stage_Sections at this stage?

@iantei
Copy link
Contributor

iantei commented Dec 25, 2024

Also, looked up for
db.Stage_analysis_timeseries.find() to look up for the data in the dataset. We are able to fetch all the data from the database.
The connection to database from server is working fine.

Need to understand and see how this Stage_Sections is useful or redundant in the new training for the newer version of dataset.

@iantei
Copy link
Contributor

iantei commented Dec 25, 2024

AS: Either we do not have 'type':'move' in the MongoDB anymore, or something else is happening.

KS: So the original model was built using moves-format data; the model building now needs to move to openpath-format data.
Note that the model application was already working with openpath-style data because I ported it over.

The return for getSectionQueryWithGroundTruth:

{'$and': [{'type': 'move'}, {'$or': [{'$and': [{'corrected_mode': {'$exists': True}}, {'corrected_mode': {'$ne': ''}}]}, {'confirmed_mode': {'$ne': ''}}]}]}

It looks for
'type':'move' AND 'corrected_mode'
OR
'type':'move' AND 'confirmed_mode'

Lets remove the 'type':'move' and see what happens.

@iantei
Copy link
Contributor

iantei commented Dec 26, 2024

Lets remove the 'type':'move' and see what happens.

This seems like a right call, but what we don't have is Stage_Sections required from call through loadTrainingDataStep(allConfirmedTripsQuery, backupSections)

KS: So the original model was built using moves-format data; the model building now needs to move to openpath-format data.

Finally coming through this part now, lets see how we can configure the model training in adherence to the open path-format data.

@iantei
Copy link
Contributor

iantei commented Jan 2, 2025

Well, going through the code flow in the /seed/pipeline.py

In a very high level, we have the following steps:
Inside runPipeline():

  1. loadTrainingDataStep() - we load the dataset from the database through the call to specific collections
  2. generateBusAndTrainStopStep() - bus_cluster and train_cluster are returned, use of DBSCAN
  3. generateFeatureMatrixAndResultVectorStep() - creates a featureMatrix which is a two dimensional matrix (NumPy array) with the dimension as number_of_sections x number_of_features.
    and resultVector which is a single dimension NumPy array with number_of_sections
    We fill in featureMatrix with self.updateFeatureMatrixRowWithSection() which updates the corresponding elements in the featureMatrix
    And we fill up the resultVector[] with self.getGroundTruthMode(section)
  4. cleanDataStep() - remove specific outliers corresponding to feature characteristics like excess speed and others.
  5. selectFeatureIndicesStep() - Based on the availability of AdvancedFeaturesIndices or BusTrainFeatureIndices from the config, chose the indices.
  6. self.selFeatureMatrix = self.cleanedFeatureMatrix[:, self.selFeatureIndices]
  7. buildModelStep() - uses selFeatureMatrix and cleanedResultVector
  8. save the model

We have tight coupling with the old data model of moves. Multiple instances of usage of Sections, which I would presume to be a division of trip.

@iantei
Copy link
Contributor

iantei commented Jan 2, 2025

What are the current features used in the above model creation?

Features are:
1. distance
2. duration
3. first filter mode
4. sectionId
5. avg speed 
6. speed EV
7. speed variance
8. max speed
9. max accel
10. isCommute
11. heading change rate (currently unfilled)
12. stop rate (currently unfilled)
13. velocity change rate (currently unfilled)
14. start lat
15. start lng
16. stop lat
17. stop lng
18. start hour
19. end hour
20. both start and end close to bus stop
21. both start and end close to train station
22. both start and end close to airport

The features enlisted in the paper are below:

1. Distance of a segment (Dist)
2. The ith maximum velocity of a segment (MaxVi)
3. The ith maximum acceleration of a segment (MaxAi)
4. Average velocity of a segment (AV)
5. Expectation of velocity of GPS points in a segment (EV)
6. Variance of velocity of GPS points in a segment (DV)
7. Heading change rate (HCR)
8. Stop Rate (SR)
9. Velocity Change Rate (VCR)

[Quoted from Understanding Mobility Based on GPS Data]
What is Stop Rate (SR) ?

People walking on a route would become more likely than
other modes to stop somewhere for many reasons, such as
talking with passer-by, attracted by surrounding
environments, waiting for a bus, etc.
The SR stands for the number of GPS points with velocity 
below a certain threshold within a unit distance. 

SR (Walk) > SR (Bus) > SR (Driving).

What is Heading change rate (HCR)?

heading directions of different
transportation modes differ greatly in being constrained by
the real route while being independent of traffic conditions.

What is Velocity Change Rate (VCR)?

VCR as the number of GPS points with
a velocity change percentage above a certain threshold
within a unit distance.

Note: We have pretty much all of the features enlisted in the paper included in the existing model training featurization, with a few notable additions like start and end close to bus station/train station/airport.

Before, exploring this aspect of featurization, I would like to explore more utilization of new data model for the model training over the moves specific data model. Part of my rationality is that, the current featurization seems good.

@shankari
Copy link
Contributor Author

shankari commented Jan 2, 2025

Finally coming through this part now, lets see how we can configure the model training in adherence to the open path-format data.
We have tight coupling with the old data model of moves. Multiple instances of usage of Sections, which I would presume to be a division of trip.

Note that the model application has already been ported over to openpath. So the functions to extract features from openpath data format all exist. You just need to change the model training to use them as well.

@iantei
Copy link
Contributor

iantei commented Jan 3, 2025

Identified the functions to extract features from openpath data format around in /decorations/analysis_timeseries_queries.py

and used here In the model application /mode/pipeline.py

@iantei
Copy link
Contributor

iantei commented Jan 9, 2025

I tried to port the functional changes into /mode/seed/pipeline.py from /mode/pipeline.py

But got stuck into an unexpected, and likely silly issue. I have tried to resolve this since yesterday, and feel a bit stuck in loop.

How did it begin?
After incorporating some changes to the mode/seed/pipeline.py, I tried to run the /e-mission-py.bash bin/intake_multiprocess.py script. Expecting things to move forward.
Well, I came across this error:

(emission) ashrest2-41625s:e-mission-server ashrest2$ ./e-mission-py.bash bin/intake_multiprocess.py 
storage not configured, falling back to sample, default configuration
URL not formatted, defaulting to "Stage_database"
Connecting to database URL localhost
Traceback (most recent call last):
  File "/Users/ashrest2/NREL/2025_2/e-mission-server/bin/intake_multiprocess.py", line 14, in <module>
    import emission.pipeline.scheduler as eps
  File "/Users/ashrest2/NREL/2025_2/e-mission-server/emission/pipeline/scheduler.py", line 16, in <module>
    import emission.storage.timeseries.aggregate_timeseries as estag
  File "/Users/ashrest2/NREL/2025_2/e-mission-server/emission/storage/timeseries/aggregate_timeseries.py", line 15, in <module>
    import emission.storage.timeseries.builtin_timeseries as bits
  File "/Users/ashrest2/NREL/2025_2/e-mission-server/emission/storage/timeseries/builtin_timeseries.py", line 16, in <module>
    import emission.core.wrapper.entry as ecwe
  File "/Users/ashrest2/NREL/2025_2/e-mission-server/emission/core/wrapper/entry.py", line 11, in <module>
    import emission.core.wrapper.wrapperbase as ecwb
  File "/Users/ashrest2/NREL/2025_2/e-mission-server/emission/core/wrapper/wrapperbase.py", line 9, in <module>
    import attrdict as ad
  File "/Users/ashrest2/miniconda-23.1.0/envs/emission/lib/python3.11/site-packages/attrdict/__init__.py", line 5, in <module>
    from attrdict.mapping import AttrMap
  File "/Users/ashrest2/miniconda-23.1.0/envs/emission/lib/python3.11/site-packages/attrdict/mapping.py", line 4, in <module>
    from collections import Mapping
ImportError: cannot import name 'Mapping' from 'collections' (/Users/ashrest2/miniconda-23.1.0/envs/emission/lib/python3.11/collections/__init__.py)

Potential Reason to above error: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3, and in 3.10 it will stop working
It was supposed to be python 3.9, but it's using 3.11.

But that made me reason, how it was working just fine earlier, and stopped working now.
I cloned the main server code, and try to follow the same steps - we are using python 3.9 inside the emission venv instead of python 3.11 which is not resulting in the above issue.

What did I try?

  • I tried to re-install the miniconda-23.1.0 by removing the directory of /Users/<user_name>/miniconda-23.1.0
  • Landed on this error CondaValueError: You have chosen a non-default solver backend (libmamba) but it was not recognized. Choose one of: classic
  • Solution to the above CondaValueError - https://stackoverflow.com/questions/77617946/solve-conda-libmamba-solver-libarchive-so-19-error-after-updating-conda-to-23
  • Tried it, then re-ran the source setup/setup.sh
  • Then tried conda activate emission
  • Ended up with the same python 3.11 and similar error.
  • Tried to reclone this branch, but that wouldn't fix the issue arising from likely something with miniconda.

@shankari Any inputs would be appreciated.

@Abby-Wheelis
Copy link
Member

Potential Reason to above error: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3, and in 3.10 it will stop working
It was supposed to be python 3.9, but it's using 3.11.

I could be missing something here, but have you tried changing the import style as the error message indicates? Maybe that would let you get past this error and then resolve the difference in python versions as a separate issue?

@shankari
Copy link
Contributor Author

I am not sure what the issue is

It was supposed to be python 3.9, but it's using 3.11.

What was supposed to be python 3.9 and what is using 3.11?
It seems pretty clear what the error is - the error said that the code would stop working in 3.11 and it did
you can either upgrade python or change the code

we are using python 3.9 inside the emission venv instead of python 3.11 which is not resulting in the above issue.

If you are using python 3.9 within the emission venv, and it is not resulting in the above issue, then what is the problem?

@shankari shankari moved this from Questions for Shankari to Issues being worked on in OpenPATH Tasks Overview Jan 10, 2025
@iantei
Copy link
Contributor

iantei commented Jan 10, 2025

Apologies for ambiguous issue update.

There was some issue with the miniconda version (23.1.0) in my local system, which we are using in random-forest-mode-detection.

Since, we are using miniconda version (23.5.2) in the main branch, and wasn't observing any issues. I decided to use miniconda version (23.5.2) instead of (23.1.0) even for the random-forest-mode-detection branch changes, and it seemingly is working just fine.

I could be missing something here, but have you tried changing the import style as the error message indicates? Maybe that would let you get past this error and then resolve the difference in python versions as a separate issue?

I deemed changing the import style not to be the right solution, since we have identical code in main-branch, and we are not facing any issues. Making these changes is infutile, as we would have to roll-back the changes as we port this branch to the main branch.

@iantei
Copy link
Contributor

iantei commented Jan 17, 2025

Documenting my understanding about the trip better, after discussion with Jack.

  • We have multiple sections in a trip, which can arise given the user has changed the mode of commute in a trip.
  • Also, we do not have the concept of confirmed_section in OpenPATH rather we just have confirmed_trip because we request the users to label after completion of a trip. However, If we have to get the confirmed_section, we would need to lookup for the corresponding confirmed_trip associated with the cleaned_section, and derive the confirmed_section from it.

I observed this, also when I looked up for the functions to retrieve these.
We have two approaches it seems - using functions available in

  • emission/storage/decorations/analysis_timeseries_queries (esda) - get_entries()
  • emission/storage/timeseries/builtin_timeseries/BuiltinTimeSeries/find_entries() via. esta.TimeSeries.get_aggregate_time_series()

The available keys in the case of esda are just 'analysis/confirmed_trip', 'analysis/cleaned_section' and 'analysis/inferred_section' amongst others. We do not have key for analysis/confirmed_section.

Moreover, we represent the trip-related information about inferred trips rather than a sections in a trip.

I feel it might be a good idea to infer the trip information on trip level rather than section level.


We have a model application (https://github.com/e-mission/e-mission-server/blob/random-forest-mode-detection/emission/analysis/classification/inference/mode/pipeline.py) which uses the following approach -

This is not utilized in the current master branch for e-mission-server

self.toPredictSections = esda.get_entries(esda.CLEANED_SECTION_KEY, user_id, 
        time_query=timerange)

@shankari
Could you please let me know if it would be a good idea to predict the mode of commute on trip level, rather than the section level? Also, please let me know if my above understanding is correct.

@iantei iantei moved this from Issues being worked on to Questions for Shankari in OpenPATH Tasks Overview Jan 17, 2025
@shankari
Copy link
Contributor Author

@iantei at this stage, we don't really have an option. It would be great to re-center multi-modality in our work, but that is beyond the scope of this task.

@shankari shankari moved this from Questions for Shankari to Issues being worked on in OpenPATH Tasks Overview Jan 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Issues being worked on
Development

No branches or pull requests

3 participants