enyst
diff --git a/‎.gitignore
Lines changed: 3 additions & 1 deletion b/‎.gitignore
Lines changed: 3 additions & 1 deletion
diff --git a/‎Makefile
Lines changed: 1 addition & 1 deletion b/‎Makefile
Lines changed: 1 addition & 1 deletion
diff --git a/‎agenthub/codeact_agent/codeact_agent.py
Lines changed: 4 additions & 1 deletion b/‎agenthub/codeact_agent/codeact_agent.py
Lines changed: 4 additions & 1 deletion
diff --git a/‎evaluation/README.md
Lines changed: 9 additions & 64 deletions b/‎evaluation/README.md
Lines changed: 9 additions & 64 deletions
diff --git a/‎evaluation/SWE-bench/README.md
Lines changed: 0 additions & 80 deletions b/‎evaluation/SWE-bench/README.md
Lines changed: 0 additions & 80 deletions
diff --git a/‎evaluation/SWE-bench/commands.sh
Lines changed: 0 additions & 155 deletions b/‎evaluation/SWE-bench/commands.sh
Lines changed: 0 additions & 155 deletions
diff --git a/‎evaluation/SWE-bench/environment.yml
Lines changed: 0 additions & 15 deletions b/‎evaluation/SWE-bench/environment.yml
Lines changed: 0 additions & 15 deletions
@@ -202,6 +202,8 @@ cache
 
 # configuration
 config.toml
-
+evaluation/swe_bench/eval_workspace
+evaluation/outputs
+evaluation/evaluation_outputs
 test_results*
 /_test_files_tmp/
@@ -135,7 +135,7 @@ install-python-dependencies:
 		export HNSWLIB_NO_NATIVE=1; \
 		poetry run pip install chroma-hnswlib; \
 	fi
-	@poetry install --without evaluation
+	@poetry install
 	@if [ -f "/etc/manjaro-release" ]; then \
 		echo "$(BLUE)Detected Manjaro Linux. Installing Playwright dependencies...$(RESET)"; \
 		poetry run pip install playwright; \
 
@@ -276,7 +276,10 @@ def search_memory(self, query: str) -> list[str]:
         raise NotImplementedError('Implement this abstract method')
 
     def log_cost(self, response):
-        cur_cost = self.llm.completion_cost(response)
+        try:
+            cur_cost = self.llm.completion_cost(response)
+        except Exception:
+            cur_cost = 0
         self.cost_accumulator += cur_cost
         logger.info(
             'Cost: %.2f USD | Accumulated Cost: %.2f USD',
 
@@ -4,76 +4,21 @@ This folder contains code and resources to run experiments and evaluations.
 
 ## Logistics
 To better organize the evaluation folder, we should follow the rules below:
-  - Each subfolder contains a specific benchmark or experiment. For example, `evaluation/SWE-bench` should contain
+  - Each subfolder contains a specific benchmark or experiment. For example, `evaluation/swe_bench` should contain
 all the preprocessing/evaluation/analysis scripts.
-  - Raw data and experimental records should not be stored within this repo (e.g. Google Drive or Hugging Face Datasets).
+  - Raw data and experimental records should not be stored within this repo.
+    - For model outputs, they should be stored at [this huggingface space](https://huggingface.co/spaces/OpenDevin/evaluation) for visualization.
   - Important data files of manageable size and analysis scripts (e.g., jupyter notebooks) can be directly uploaded to this repo.
 
-## Roadmap
+## Supported Benchmarks
 
-- Sanity check. Reproduce Devin's scores on SWE-bench using the released outputs to make sure that our harness pipeline works.
-- Open source model support.
-  - Contributors are encouraged to submit their commits to our [forked SEW-bench repo](https://github.com/OpenDevin/SWE-bench).
-  - Ensure compatibility with OpenAI interface for inference.
-  - Serve open source models, prioritizing high concurrency and throughput.
+- SWE-Bench: [`evaluation/swe_bench`](./swe_bench)
 
-## SWE-bench
-- notebooks
-  - `devin_eval_analysis.ipynb`: notebook analyzing devin's outputs
-- scripts
-  - `prepare_devin_outputs_for_evaluation.py`: script fetching and converting [devin's output](https://github.com/CognitionAI/devin-swebench-results/tree/main) into the desired json file for evaluation.
-    - usage: `python prepare_devin_outputs_for_evaluation.py <setting>` where setting can be `passed`, `failed` or `all`
-- resources
-  - Devin related SWE-bench test subsets
-    - [🤗 OpenDevin/SWE-bench-devin-passed](https://huggingface.co/datasets/OpenDevin/SWE-bench-devin-passed)
-    - [🤗 OpenDevin/SWE-bench-devin-full-filtered](https://huggingface.co/datasets/OpenDevin/SWE-bench-devin-full-filtered)
-  - Devin's outputs processed for evaluations is available on [Huggingface](https://huggingface.co/datasets/OpenDevin/Devin-SWE-bench-output)
-    - get predictions that passed the test: `wget https://huggingface.co/datasets/OpenDevin/Devin-SWE-bench-output/raw/main/devin_swe_passed.json`
-    - get all predictions `wget https://huggingface.co/datasets/OpenDevin/Devin-SWE-bench-output/raw/main/devin_swe_outputs.json`
+### Result Visualization
 
-See [`SWE-bench/README.md`](./SWE-bench/README.md) for more details on how to run SWE-Bench for evaluation.
+Check [this huggingface space](https://huggingface.co/spaces/OpenDevin/evaluation) for visualization of existing experimental results.
 
-### Results
 
-We have refined the original SWE-bench evaluation pipeline to enhance its efficiency and reliability. The updates are as follows:
-- Reuse testbeds and Conda environments.
-- Additionally try `patch` command for patch application if `git apply` command fails.
+### Upload your results
 
-#### Results on SWE-bench-devin-passed
-
-[🤗 OpenDevin/SWE-bench-devin-passed](https://huggingface.co/datasets/OpenDevin/SWE-bench-devin-passed)
-
-| Model/Agent            | #instances | #init | #apply | #resolve |
-|------------------------|------------|-------|--------|----------|
-| Gold                   | 79         | 79    | 79     | 79       |
-| Devin                  | 79         | 79    | 76     | 76       |
-
-#init: number of instances where testbeds have been successfully initialized.
-
-In the 3 Devin-failed instances (see below), Devin has made changes to the tests, which are incompatible with the provided test patch and causes failures during patch application. The evaluation adopted by Devin does not seem to align with the original SWE-bench evaluation.
-
-```shell
-django__django-11244
-scikit-learn__scikit-learn-10870
-sphinx-doc__sphinx-9367
-```
-
-#### Results on SWE-bench-devin-failed
-
-| Model/Agent            | #instances | #init | #apply | #resolve |
-|------------------------|------------|-------|--------|----------|
-| Gold                   | 491        | 491   | 491    | 371      |
-| Devin                  | 491        | 491   | 463    | 7        |
-
-Devin **passes** 7 instances on the `SWE-bench-devin-failed` subset. SWE-bench dataset appears to be noisy, evidenced by 120 instances where gold patches do not pass.
-
-We have filtered out the problematic 120 instances, resulting in the creation of the `SWE-bench-devin-full-filtered` subset.
-
-## Results on SWE-bench-devin-full-filtered
-
-[🤗 OpenDevin/SWE-bench-devin-full-filtered](https://huggingface.co/datasets/OpenDevin/SWE-bench-devin-full-filtered)
-
-| Model/Agent            | #instances | #init | #apply | #resolve |
-|------------------------|------------|-------|--------|----------|
-| Gold                   | 450        | 450   | 450    | 450      |
-| Devin                  | 450        | 450   | 426    | 83       |
+You can start your own fork of [our huggingface evaluation outputs](https://huggingface.co/spaces/OpenDevin/evaluation) and submit a PR of your evaluation results to our hosted huggingface repo via PR following the guide [here](https://huggingface.co/docs/hub/en/repositories-pull-requests-discussions#pull-requests-and-discussions).