-
Notifications
You must be signed in to change notification settings - Fork 34
Investigate deployment-level models for mode inference based on trajectory characteristics #1100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@iantei @Abby-Wheelis for visibility |
I went through the paper which talks about the features, and we have incorporated all of these features with addition of a few more consideration like in the https://github.com/e-mission/e-mission-server/blob/random-forest-mode-detection/emission/analysis/classification/inference/mode/pipeline.py:
I followed the instructions mentioned in https://github.com/e-mission/e-mission-docs/blob/master/docs/install/manual_install.md to run the server.
while loading the docker-compose for the public-dashboard with exposed ports as 27017. |
My first initial idea was to run the script with a different dataset i.e. Laos EV dataset, since it would have both cars and motorcycles data. However, I came across the below error code while executing the script intake_multiprocess.py: Elaborated call stack -
It complains about habitica.json not being available.
Also, am not sure if it's taking up the right Laos database instead of Stage_database. Now, in the following file -
There is no runPipeline() function in the class ModeInferencePipeline(). @shankari Could you please help me understand the following:
It seems the above pipeline uses the mode being trained in which needs to incorporate the split of training/test data and cross validation aspects. Can I build up the above requirements based in the above pipeline code, once I am able to run the pipeline with Laos EV dataset? |
There needs to be a slight modification in the approach while loading MongoDB database correctly - Since the mongoDB is exposed in the port 27017, we need to assign it to localhost:27017 instead of db.
Running with this change, it is loading the right database i.e. openpath_prod_usaid_laos_ev. [Tested this changes with the latest - Server code; there needs to be additional changes in the existing RF branch, likely introducing db.conf file which is missing] |
This is an old branch, which doesn't have the updates in the latest master branch. I am inclining towards the idea of - understanding the workflow in which the pipeline of the RF model is being trained, model is saved and run with the new dataset & porting these changes to the master and working in a new PR altogether. |
We are not using habitica.json in the project anymore, so the error/ warning can be ignored. Or any code which specifies the usage of it can be analysed and removed. |
I am not sure where model training and evaluation is planned in here. I do not anticipate significant changes for porting - the data model has been unchanged for the past 5 years. I would encourage you to build and evaluate the model before porting so we can the impactful results quickly and don't spin our wheels on porting something we won't use. |
Yes, the model training is not done in https://github.com/e-mission/e-mission-server/blob/random-forest-mode-detection/emission/analysis/classification/inference/mode/pipeline.py Rather it just loads the model being saved here - where training of the model is being done:
I couldn't find a place where the model evaluation is being done. But it does refer to 70% accuracy in one of the comment in mode/pipeline.py
|
I know that we don't do model training in the code. My comment was more around the model training step in your plan. Can you write out your proposed plan of action in bullet points (e..g 1. I will do this first, 2. I will do this next, ...)
We generally don't do model evaluation in a production system (!!) that is done offline. The 70% accuracy is on the ad-hoc dataset that I collected as part of a class project - in 2014, I had just started the PhD program. The model accuracy will depend on the input dataset, which will depend on the deployment. You will need to evaluate this on a few deployments, starting with Laos. |
Here's my proposed plan of actions (on a high level): Work on the
Please let me know about your thoughts on this. Note: I am trying to figure out the issue with db.conf file (config file issues), referring to the initial PR we had some discussion about in op-admin repository. |
Probably not an elegant solution to the issue, but I updated the
This change helped me use the right mongo db. I am not looking for any clean fix here, since there is no issue with updated Master branch, where we can `export DB_HOST= and works as expected. |
What happens when we run the Documenting the steps such that it helps me recall things better and easily later too.
Apparently, we do not have This led to some interesting error logs in the terminal, but it well continued, and completed the execution of script without crash:
|
Well, I tried to run another script
Well, the commit message states it is used to create and save model based on old-style data. |
This is the key. Back in 2014, we first started by integrating with an existing app called Moves, that did the data collection and we pulled the data using OAuth. That is how I was able to set up a basic data collection platform as part of a class project. However, we quickly realized that Moves had made decisions around data collection frequency and accuracy and we did not have the ability to control it - which is always a challenge with closed source. So we wrote our own and came up with (IMHO) a better data model. As an aside, Moves was acquired by Facebook and shut down in 2015, so that was a good choice. So the original model was built using moves-format data; the model building now needs to move to openpath-format data.
Note also that this branch should have another file called |
Replacing The pipeline logged some warning stating
This makes sense considering there was change to the I will wait for a couple of hours more for this pipeline to complete. If it doesn't exhaust in a long time, I will likely try with a smaller dataset, just to see the completion of pipeline execution script. |
Trying the above with There is still the warning -
|
Well, using
This issue is likely because we are using As we only wanted to see if the pipeline would run fine with the usage of The script : Next task, would be to figure out changes required to load the new version of dataset and conduct the training of the model. |
What happens when we run the script -
Focusing starting with
Now, it's time to re-trace back how |
Details of exploration of
Looking what's It's the first return part of the cleanDataStep(). The second part cleanedResultVector is also of use for us in buildModelStep. Looking what's both
Lets start looking out way backwards. |
Tracing from the initial of
Looking for type:move and looks for getModeQuery($ne='') Lets start looking at the
Tracing back to this code:
We were able to run the script for a. Need to validate whether we are able to load the database properly or not. |
Looking further inside: mode/seed/pipeline.py/ModeInferencePipelineMovesFormat:
Looking into what
Inside
Basically, it's looking up for Collection in MongoDB with the name Looked up for then collection in mongoDB via. Mongo Shell:
Lets try to see if we can print out the list of collections from the logging mechanism, so we can be more certain we are able to connect the database to the server properly. Modifying the get_section_db() to list the db name and collections, could see the exact collections from the washingtoncommons db properly, but couldn't find the collection of
Do we need the |
Also, looked up for Need to understand and see how this |
The return for
It looks for Lets remove the 'type':'move' and see what happens. |
This seems like a right call, but what we don't have is
Finally coming through this part now, lets see how we can configure the model training in adherence to the open path-format data. |
Well, going through the code flow in the /seed/pipeline.py In a very high level, we have the following steps:
We have tight coupling with the old data model of |
What are the current features used in the above model creation?
The features enlisted in the paper are below:
[Quoted from Understanding Mobility Based on GPS Data]
What is Heading change rate (HCR)?
What is Velocity Change Rate (VCR)?
Note: We have pretty much all of the features enlisted in the paper included in the existing model training featurization, with a few notable additions like start and end close to bus station/train station/airport. Before, exploring this aspect of featurization, I would like to explore more utilization of new data model for the model training over the moves specific data model. Part of my rationality is that, the current featurization seems good. |
Note that the model application has already been ported over to openpath. So the functions to extract features from openpath data format all exist. You just need to change the model training to use them as well. |
Identified the functions to extract features from openpath data format around in /decorations/analysis_timeseries_queries.py and used here In the model application /mode/pipeline.py |
I tried to port the functional changes into But got stuck into an unexpected, and likely silly issue. I have tried to resolve this since yesterday, and feel a bit stuck in loop. How did it begin?
Potential Reason to above error: But that made me reason, how it was working just fine earlier, and stopped working now. What did I try?
@shankari Any inputs would be appreciated. |
I could be missing something here, but have you tried changing the import style as the error message indicates? Maybe that would let you get past this error and then resolve the difference in python versions as a separate issue? |
I am not sure what the issue is
What was supposed to be python 3.9 and what is using 3.11?
If you are using python 3.9 within the emission venv, and it is not resulting in the above issue, then what is the problem? |
Apologies for ambiguous issue update. There was some issue with the miniconda version (23.1.0) in my local system, which we are using in Since, we are using miniconda version (23.5.2) in the
I deemed changing the import style not to be the right solution, since we have identical code in main-branch, and we are not facing any issues. Making these changes is infutile, as we would have to roll-back the changes as we port this branch to the main branch. |
Documenting my understanding about the trip better, after discussion with Jack.
I observed this, also when I looked up for the functions to retrieve these.
The available keys in the case of Moreover, we represent the trip-related information about inferred trips rather than a sections in a trip. I feel it might be a good idea to infer the trip information on trip level rather than section level. We have a model application (https://github.com/e-mission/e-mission-server/blob/random-forest-mode-detection/emission/analysis/classification/inference/mode/pipeline.py) which uses the following approach - This is not utilized in the current master branch for
@shankari |
@iantei at this stage, we don't really have an option. It would be great to re-center multi-modality in our work, but that is beyond the scope of this task. |
In e-mission, we currently use two methods for determining the mode automatically:
However, as we have deployments with more complex modes, neither of these are sufficient. For example, in Laos, we want to be able to distinguish between cars and motorcycles, both of which are motorized modes without fixed routes. To this effect, we want to have the ability to build a deployment-specific model that uses sensor-level data to predict the rich modes that are relevant to this deployment.
The steps to this are fairly simple:
The initial featurization we should use is from
Zheng, Y., Li, Q., Chen, Y., Xie, X., Ma, W.-Y., 2008. Understanding mobility based on GPS data. In: Proceedings of the 10th international conference on Ubiquitous computing, Association for Computing Machinery, Association for Computing Machinery, Seoul, South Korea. pp. 312–321.
https://dl.acm.org/doi/10.1145/1409635.1409677
I implemented this featurization in https://github.com/e-mission/e-mission-server/tree/random-forest-mode-detection
and in https://github.com/e-mission/e-mission-server/blob/random-forest-mode-detection/emission/analysis/classification/inference/mode/pipeline.py in particular
Please put the initial results here and then we can decide how to proceed
The text was updated successfully, but these errors were encountered: