-
Notifications
You must be signed in to change notification settings - Fork 49
Cosmos DB Linux Emulator fails to start on some Intel chips #45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Is there a way to add a constraint on the Azure Pipeline to use the CPU model that works? I am hitting this issue in Azure DevOps Pipelines, and I always get model 85, which always fails. I have tried specifying "ubuntu-latest" "ubuntu-20.04" and "ubuntu-18.04" but none have worked. The below CPU also fails.
This one with ubuntu-18.04 did work:
|
I am also facing this issue. Are there any work arounds or has any progress been made? |
@rrr-michael-aquilina The workaround that worked for me was moving to the Cosmos emulator (powershell) that's baked in the windows pipeline. High traffic times can cause the emulator to start slowly, > 5min. I had to make some modifications to it's timeout and such but it's been pretty stable since. Definitely worth having than not using the emulator at all in the DevOps pipeline. |
Also faced this issue using the Linux Docker image. Cost me a day of investigating network issues just to find the out the container is immediately shutting down. Using ubuntu-18.04 as suggested in the other Github ticket worked for me, but a fix for 20-04 would be great. |
Ubuntu agent 18.04 is getting depracated so the issue needs to be fixed before that day. |
We are seeing the exact same problem. Running fine on Azure DevOps agents running ubuntu-18.04 but fails on ubuntu-20.04 and ubuntu-22.04. |
I run a test today and it works on 20.04, see https://github.com/eddumelendez/testcontainers-cosmodb-gha-test/actions/runs/3153248862/jobs/5129495371 Can someone else confirm? |
@eddumelendez ddradar/ddradar#1002 |
I think it is flaky, ran two more times and the first failed but the last one succeeded |
Yes it's flaky. I continue to see random failures as well. |
@milismsft do you have any updates on this issue? |
+1 for working on ubuntu 20/22 We run as part of integration testing - only starts (sometimes) on ubuntu 18. But anything higher it just hangs at the "Starting" message in the container logs forever. Our devs use docker-compose stack for local dependencies which includes cosmosdb, so we would like to just spin up the same stack in ado pipelines. Ubuntu 18 deprecation date was pushed back to April '23 so we have a bit more time... |
Very actual during the current un-scheduled brownout for 18! 20 doesn't work. |
This repo doesn't look like is active so I posted question here. |
Any news on this? |
Someone asked again today but all we got is the same answer. We don't have a public facing ETA we can share for now, but we will share on Azure updates when this will be available. |
while we wait on this, is there a workaround ? I am using windows agent to get around this problem but the emulator for windows agent randomly takes too long to start |
@sajeetharan, it's essentially been two years now since this issue was opened, two years since people reported that it's blocking them from being able to use Microsoft hosted Linux runners on Azure DevOps Pipelines to be able to run integration tests against Cosmos DB using the docker emulator. I feel really let down by Microsoft here, no explanation has been offered as to why this hasn't been fixed yet, nor has a practical workaround been offered, for example all the following are not options which would work for us:
It feels like this blocking issue is not being prioritised by Microsoft appropriately, between the Azure DevOps Pipelines and the Cosmos DB Emulator teams, can this please get the attention it deserves. |
Building on @JonathanLydall's excellent summary: perhaps it would be productive to change the title of this issue to "Cosmos DB does not work with Microsoft-hosted Linux runners on Azure DevOps Pipelines." |
Any update on this? This is ridiculous. |
This would be incorrect though. It doesn't need to be Microsoft-hosted, and it doesn't need to be on Azure DevOps. It happens on most mainstream flavors of linux, on any platform (namely, GitHub) |
I've been running some xUnit samples using the CosmosDb emulator both with Testcontainers and FluentDocker. See my repo https://github.com/diegosasw/cosmosdb-sample/tree/634f6044ba7fe384bcb2d34ebef3f653ef85be0b They seem to run properly locally and in CI/CD in GitHub actions using Am I missing anything? has this issue finally been resolved with latest emulator? or it depends on how the docker container is spined up?
|
Doesn't seem to be fixed, I just tested on a Microsoft hosted |
@JonathanLydall Following task can help in Printing Machine Info -
|
…cker#45 (comment) and re-enabled Cosmos DB tests again to see if they're working.
Hi @niteshvijay1995, Here is the YAML file: https://github.com/IntentArchitect/Intent.Modules.NET/blob/1cde41ccb3c8498ccba4a38d5060414eacdea762/azure-pipelines.yml#L137 Just a note that the task YAML you pasted above has incorrect white spacing, it should be: - task: PowerShell@2
name: 'PrintMachineInformation'
inputs:
targetType: 'inline'
pwsh: true
script: |
echo "CPU Information:"
lscpu
echo "Memory Information:"
free -h
echo "Disk Usage:"
df -h
echo "Operating System Information:"
uname -a
echo "Network Configuration:"
ip addr show
echo "Docker Version:"
docker --version
echo "Docker Info:"
docker info I updated the Pipeline now to add the information task and also to re-enable Cosmos DB container image tests again. Since enabling it this morning 2 of 2 runs have succeeded, however, as this issue was intermittent before (I guess depending on which underlying hardware Azure Pipelines happened to use for a particular run), I wouldn't consider it solved until a few more days of testing. In the meantime, here is the PrintMachineInformation output of the two successful runs: Run 1
Run 2
|
@JonathanLydall Thanks for sharing the details. |
I've just added the task to our pipeline. On the first run, all of the tests failed: Sample error - 1st run
System Information - 1st run
I then re-ran the pipeline, and this time only got partial failure, most tests passed but a few didn't: Sample error - 2nd run
System Information - 2nd run
After another re-run, this time all tests succeeded and we got a green build: System Information - 3rd run
Hope that helps. |
@kntajus, Please attach the yaml file. |
@niteshvijay1995 What is it that you're interested in within the YAML? I ask because I'm not sure if there's anything sensitive in there that I shouldn't be sharing and it may take me a while to work that out. In terms of the tests being run, they are C# XUnit tests that are using TestContainers to spin up the docker container, and the YAML for that task is simply:
I can confirm that we've had 100% success on local runs for many months (we introduced this near the start of this year), it's only ever failed (intermittently) when running in Azure DevOps pipelines. |
Hi @niteshvijay1995, We had a failure yesterday (in addition to a few more builds which were fine). Machine information for failed build
Example of error of failing test
Diff of the machine information
diff --git "a/machine-info-success.txt" "b/machine-info-failure.txt"
index 6379f46..9ff21b4 100644
--- "a/machine-info-success.txt"
+++ "b/machine-info-failure.txt"
@@ -8,7 +8,7 @@ Help : https://docs.microsoft.com/azure/devops/pipelines/tasks/utility/p
==============================================================================
Generating script.
========================== Starting Command Output ===========================
-[command]/usr/bin/pwsh -NoLogo -NoProfile -NonInteractive -Command . '/home/vsts/work/_temp/ec24e82d-e7fa-4de5-95c5-306e9ec7ba53.ps1'
+[command]/usr/bin/pwsh -NoLogo -NoProfile -NonInteractive -Command . '/home/vsts/work/_temp/55bc2847-d486-47c6-ae07-e0e5a259cbe2.ps1'
CPU Information:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
@@ -17,30 +17,30 @@ Byte Order: Little Endian
CPU(s): 2
On-line CPU(s) list: 0,1
Vendor ID: GenuineIntel
-Model name: Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
+Model name: Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
CPU family: 6
-Model: 79
+Model: 85
Thread(s) per core: 1
Core(s) per socket: 2
Socket(s): 1
-Stepping: 1
-BogoMIPS: 4589.37
-Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt md_clear
+Stepping: 7
+BogoMIPS: 5187.81
+Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm avx512f avx512dq rdseed adx smap clflushopt avx512cd avx512bw avx512vl xsaveopt xsavec xsaves md_clear
Hypervisor vendor: Microsoft
Virtualization type: full
L1d cache: 64 KiB (2 instances)
L1i cache: 64 KiB (2 instances)
-L2 cache: 512 KiB (2 instances)
-L3 cache: 50 MiB (1 instance)
+L2 cache: 2 MiB (2 instances)
+L3 cache: 35.8 MiB (1 instance)
NUMA node(s): 1
NUMA node0 CPU(s): 0,1
-Vulnerability Gather data sampling: Not affected
+Vulnerability Gather data sampling: Unknown: Dependent on hypervisor status
Vulnerability Itlb multihit: KVM: Mitigation: VMX unsupported
Vulnerability L1tf: Mitigation; PTE Inversion
Vulnerability Mds: Mitigation; Clear CPU buffers; SMT Host state unknown
Vulnerability Meltdown: Mitigation; PTI
Vulnerability Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
-Vulnerability Retbleed: Not affected
+Vulnerability Retbleed: Vulnerable
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
@@ -49,11 +49,11 @@ Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Mitigation; Clear CPU buffers; SMT Host state unknown
Memory Information:
total used free shared buff/cache available
-Mem: 6.8Gi 672Mi 4.6Gi 35Mi 1.5Gi 5.8Gi
+Mem: 6.8Gi 632Mi 4.7Gi 35Mi 1.5Gi 5.8Gi
Swap: 4.0Gi 0B 4.0Gi
Disk Usage:
Filesystem Size Used Avail Use% Mounted on
-/dev/root 73G 52G 21G 72% /
+/dev/root 73G 53G 21G 72% /
tmpfs 3.4G 172K 3.4G 1% /dev/shm
tmpfs 1.4G 1.1M 1.4G 1% /run
tmpfs 5.0M 0 5.0M 0% /run/lock
@@ -61,7 +61,7 @@ tmpfs 5.0M 0 5.0M 0% /run/lock
/dev/sda1 14G 4.1G 9.0G 31% /mnt
tmpfs 693M 12K 693M 1% /run/user/1001
Operating System Information:
-Linux fv-az367-389 6.5.0-1022-azure #23~22.04.1-Ubuntu SMP Thu May 9 17:59:24 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
+Linux fv-az356-146 6.5.0-1022-azure #23~22.04.1-Ubuntu SMP Thu May 9 17:59:24 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Network Configuration:
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
@@ -70,13 +70,13 @@ Network Configuration:
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
- link/ether 00:22:48:9a:99:5f brd ff:ff:ff:ff:ff:ff
- inet 10.1.0.5/16 metric 100 brd 10.1.255.255 scope global eth0
+ link/ether 00:0d:3a:da:86:c4 brd ff:ff:ff:ff:ff:ff
+ inet 10.1.28.0/16 metric 100 brd 10.1.255.255 scope global eth0
valid_lft forever preferred_lft forever
- inet6 fe80::222:48ff:fe9a:995f/64 scope link
+ inet6 fe80::20d:3aff:feda:86c4/64 scope link
valid_lft forever preferred_lft forever
3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
- link/ether 02:42:0e:68:7c:0b brd ff:ff:ff:ff:ff:ff
+ link/ether 02:42:35:c8:95:4a brd ff:ff:ff:ff:ff:ff
inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
valid_lft forever preferred_lft forever
Docker Version:
@@ -132,7 +132,7 @@ Server:
Architecture: x86_64
CPUs: 2
Total Memory: 6.759GiB
- Name: fv-az367-389
+ Name: fv-az356-146
ID: 2eba0d88-daa1-420e-94b5-971f98899c95
Docker Root Dir: /var/lib/docker
Debug Mode: false |
@kntajus I was interested in knowing the emulator startup script. |
@JonathanLydall Can you please confirm if the emulator startup is failing in failed build?
|
@niteshvijay1995, there is no text like that in the logs for the run, but only some of the tests against Cosmos DB are failing, not all. Below is the start of the logs for the Details
|
@niteshvijay1995 I'm looking in to how to grab the emulator logs for you. In my local testing I'm seeing nothing of interest in there. For a full run where all the tests are passing, it's literally just giving me:
Is there some kind of flag or environment variable I should be setting to tell the emulator to output more verbose logs? |
@kntajus i think the logs should be accessible via a |
Thanks @razvangoga, given that I'm using Testcontainers I've found a way to access the container logs directly from the test code anyway. @JonathanLydall I don't know if this might be useful for you too, looks like you're using .NET - you can call GetLogsAsync on the instance representing the docker container that Testcontainers provides for you at the end of the test/run. |
@kntajus cool! genuinely curious if anything comes from them the issue is hardware related and the behaviour is in line with how AzDevops assign agent VMs: based on the region of the AzDevops tenant you will get agents VMs in the same Azure region or in a fallback one ( docs ) it explains why some people (like me based in Germany where agents come from Germany or France azure regions) get only fails, while some others have more success (different azure zone => different hardware). i mean we can also wait a couple of years more until MS upgrades all it's datacenteres from these pesky Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz 😁 in any case i'll drink a beer each day @niteshvijay1995 doesn't ghost the repo like @sajeetharan did the last time around |
I find it quite amusing that this issue has a Feature label. The emulator is supposed to work on Linux. There's nothing about only certain Linux agents in ADO. This is a bug not a feature. |
It looks like it wont be solved soon, so whats the best workaround for this issue ? Just to use "real" instance of DB ? in-memory approach rather is not something good for API integration tests. |
Is it resolved on https://learn.microsoft.com/en-us/azure/cosmos-db/emulator-linux |
Related to: actions/runner-images#5036 (comment)
The Cosmos DB Linux Emulator fails to start on some Intel chips.
lscpu output:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: [46]
CPU(s): 2
On-line CPU(s) list: 0,1
Thread(s) per core: 1
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Stepping: 7
CPU MHz: 2593.907
BogoMIPS: 87.81
Hypervisor vendor: Microsoft
Virtualization type: full
L1d cache: 64KiB
L1i cache: 64 KiB
L2 cache: 2 MiB
L3 cache: 35.8 MiB
NUMA node0 CPU(s): 0,1
Vulnerability Itlb multihit: KVM: Mitigation: VMX unsupported
Vulnerability L1tf: Mitigation; PTE Inversion
Vulnerability Mds: Mitigation; Clear CPU buffers; SMT Host state unknown
Vulnerability Meltdown: Mitigation; PTI
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Full generic retpoline, STIBP disabled, RSB filling
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Mitigation; Clear CPU buffers; SMT Host state unknown
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt avx512cd avx512bw avx512vl xsaveopt xsavec xsaves md_clear
/proc/cpuinfo content:
/proc/cpuinfo
The text was updated successfully, but these errors were encountered: